A METHOD FOR DETERMINING INFORMATION DIFFUSION CASCADES ON SOCIAL NETWORKS

Online social networks have become one of the most efficient communication platforms over the last two decades with high socio-economic impacts. This fact has motivated a large amount of recent research. Different problems are currently studied, including network modelling, social network annotation, community detection, user recommendation, A METHOD FOR DETERMINING INFORMATION DIFFUSION CASCADES ON SOCIAL NETWORKS


Introduction
Online social networks have become one of the most efficient communication platforms over the last two decades with high socio-economic impacts.This fact has motivated a large amount of recent research.Different problems are currently studied, including network modelling, social network annotation, community detection, user recommendation,
An information diffusion process occurs when a piece of information flows from one individual to another in a network.To understand the underlying structure of a diffusion, it is compulsory to construct the diffusion cascade, which requires knowing activation decisions of people and also the social ties, over which these activations occur.Actually, the problem has been studied in areas like epidemiology for decades [3].However, the problem of information diffusion on online social networks has become much more complex because networks now are often very large, and moreover, there is a large diversity in users profiles and the uncertainty in their behaviors making the existing methods less accurate and efficient.Thus, there is a need for methods that better approximate the mechanism of information diffusion on social networks in a more efficient manner.

Literature review and problem statement
A lot of existing studies adopt the Linear Threshold model (LT model) [4][5][6][7] and Independent Cascade model (IC model) [8][9][10] to predict diffusion cascades.In the former, an acceptance threshold θ and an aggregation function f are associated with each user (or node).In the traditional version of the LT model, the function f is the sum of the weights of edges from active neighbours of a user v to v, and v becomes active if this sum exceeds the threshold θ.The IC model, on the other hand, uses activation probabilities for edges instead of acceptance thresholds for nodes.An active node independently tries to activate its inactive neighbours and succeeds with specified probabilities.
Some studies extend the LT model and IC model by considering more factors such as time decay and user profile.In [11], the authors introduce ASIC (Asynchronous IC), which models the influence between an active user and his/ her neighbours using an exponential probability distribution based on the delay of influence.In [12], the ASIC model is further extended with user profile information.The method in [13] considers different probability distributions for the delay of the influence: exponential, power law and Rayleigh distributions.Several other extensions includes the topic sensitive IC model [14], influence models considering positive and negative opinions [15], influence model considering friend and foe relationships [16].
Some studies consider the cascade prediction task as a regression problem [17,18] or a classification problem [19,20].These studies first identify a number of features that may correlate with the dependent variable that can be the cascade size or activation probabilities.After that, they learn a regression model or a classifier to estimate the value of the dependent variable.
Many other studies apply PageRank algorithm variations to rank user influence according to the network structure.Kwak et.al. [21] rank users by applying PageRank on follower following graph in Twitter.The problem of this ap-proach is that network structure is relatively static compared to the activities of users in social networks.To deal with the problem, some studies have tried to include additional information in their models, for example, Topic-Sensitive PageRank [22] and TwitterRank [23] are able to compute per topic influence ranks.
Most of the previous studies aim to predict the volume of aggregate activation (or the size of the cascade), a closely related but a different task from the one addressed in this paper.The work [18] estimates the total number of up-votes on Digg stories.The work [24] estimates the total hourly volume of news phrases.The work [25] estimates total daily hash-tag use.Some work in this line of research estimates the cascade size through sampling [5,26,27].These works often use a fixed number of samples to estimate the expected size of the cascade.The problem here is that there is no single sample size that fits all kinds of networks.
Another closely related task with the one addressed in this paper is the problem of influence maximization.The goal of influence maximization is to find a seed set of users who can activate cascades for maximizing the diffusion of information in the social network.Domingos and Richardson [4,28] address this problem using Markov random fields.The work [5] models the problem as a discrete optimization problem and proved its NP-hardness for both the LT and IC models.
Most of the previous works studying the problem of influence maximization or predicting cascade size assume the diffusion probabilities between users are given as inputs.Instead, in this paper we address the problem of predicting these probabilities

The aim and objectives of the study
The aim of this study is to understand underlying information diffusion mechanisms on online social networks.More specifically, the study aims at predicting the adopting probability of a user when he/she is exposed to the content.
To accomplish the aim, the following objectives have been set: -to carefully and thouroghly review related works regarding the information diffusion problem; -to formally formulate the problem and other definitions; -to solve the problem theoretically based on real-world observations and analysis; -to prove the accuracy and efficiency of the proposed method experimentally by comparing with other state-ofthe-art algorithms on real-world datasets.

The proposed method
Quantifying the probability that a user will adopt a piece of content on an online social network is a very challenging task.In general, the exact mechanisms driving users to take actions are unknown because they are diversified and vary according to individuals.Influence between users is a highly subjective area and the quantification of influence depends strongly upon domains.Moreover, an influential user in a particular domain cannot remain influential forever.This requires constant evaluation of their diffused behaviors and contents.In this paper, a method is proposed that allows quantification of the probability that a user takes action with regard to a content being diffused on a social network.The method takes into account the history of user-user interaction and the user-content preference for every user in the network.The method also considers and quantifies external influence that will lead to a more accurate forecasting model of information diffusion.

1. Problem Formulation
An online social network is often represented by a graph, where nodes are users and edges are relationships between users.The relationship can be either directed (one-directional or two-directional) or undirected.More precisely, it depends on whether the network allows connecting in an unilateral (e. g. following manner in Twitter) or bilateral (e. g. friendship manner in Facebook) model.In a social network, users often publish contents to share or forward various kinds of information, such as ideas, political opinions, etc.
Formally, we represent a social network as a graph G= =(V, E, T) where V is the set of nodes which represent users, E is the set of edges which represent relationships among users, E=(u, v)| ∈ , , u v V u≠v and T: E→N is a time labeling function which labels each edge with the time-stamp showing the time when the relationship between two users is created.
Given a content c that is being diffused over a social graph G, a node (a user) ∈ u V may have one of the following states with regard to c. Node u is in the unaware state with regard to c if no friends of u are activated to c. Node u gets exposed (in exposure state) to c if some friends of u are activated to c and u is not.Node u is activated (in active state) to the content c if u does some social action (e.g.like, share, comment) associated to the content c.Now some constraints about the activation mechanism of the diffusion are given according to the IC model: -when a node u∈V becomes active, say at time t, then it has only one chance to activate each currently inactive neighbor v; -active nodes are only temporarily influential (contagious) for some time steps; this means that once a node has made all its attempts, it becomes non-influential (non-contagious) but still remains active; -this activation successfully happens with probability , uv p independently of all other neighbors' attempts; -in case more than one contagious node try to activate a same inactive node, the order in which they will make their attempts is random.
The triple <u, c, t> is used to represent that user u is activated to a content c at time t.It is also confirmed in the study that the user u propagated c at time t.Using this notation, we are given an action log as a triple H=(U, C, T) which contains users' historical activities over a social network.Table 1 gives an example of action log.
Problem Definition.Given graph G (V, E, T), content c, current time t, action log H, the problem is to estimate the probability that a node is activated to c at time t.

2. Measuring Pair-wise User Influence
To measure user influence between users on a social network, we adopt and modify the disease transmission model given in [3].Consider a pair of users who are connected, one of whom u is active and the other v is exposed to a content c.Suppose that the average rate of content-transmission from u to v is r uv , and that the active user remains influential for a time π u .According to disease transmission mechanism in [3], the probability 1 − τ uv that the content will not be transmitted from u to v is: and the probability of transmission is: The above equation applies for continuous time.In our model, discrete time steps rather than continuous time are used, in which case instead of taking the limit in Eq. ( 1) we simply set δ t =1, resulting in where π u is measured in time steps.In general, uv r and π u are not the same between users, so the probability of transmission is also different.Furthermore, the rate uv r is not symmetric and thus the probability of transmission in either direction might not be the same.In this paper, for simplicity, it is assumed that π u is identical for all users and its optimal value will be chosen from a set of concrete time steps, {1, 2,..., 10}.
In disease transmission research, the disease transmission rate is often drawn from appropriate distributions.In this paper, this approach is not followed.It is observed that the pairwise influence can be defined based on social ties and historical interactions between users, which are given in the action log H.Given two nodes u and v, we assume that the influence from u to v is only propagated through the edge (u, v) in G.We do not consider influence through intermediate nodes between u and v.
It can be seen that the transmission rate r uv is proportional to the number of times pieces of content are successfully diffused form u to v.However, in the IC model, there may be other nodes that also try to activate v. Therefore, influence must be "shared" among these nodes each time v is activated.Let ⋅

C
can be easily obtained by scanning the action log H.The average content-transmission rate from u to v is given by: where c v S is the set of neighbor's nodes of v that become ac- tive before v regarding the content c.Eq. ( 4), however, does not differentiate between primary and secondary diffusions.Primary diffusion means that the node that diffused the content is also the node that created the content.Secondary diffusion means otherwise.It is observed that influence through primary diffusion and secondary diffusion between two nodes can be very different and can severely affect the accuracy of diffusion models.For example, on online social networks, many people willingly adopt information created by their friends but reluctantly adopt information created by a strange person.This reflects the fact that almost all diffusion cascades are very small containing merely friends of the source node and large cascades are extremely rare [29,30].
The set ⋅ u v C mentioned above is divided into two subsets: C is the set of contents created by the set of contents that created by u and ξ is the set of contents that created by u and activated v, and ξ → u v C contains the others.The transmission rate from u to v is now given by: , for primary diffusion from to , 1 , otherwise.
Note that for primary diffusion the influence from u to v is not "shared" with any other nodes as reflected in Eq. ( 5).

3. Measuring User-Content Preference
It is intuitive that personal characteristics or habits of people can be closely related to their behavior.This leads us to consider topic-based influence measure in our problem setting.
Given a set of topics C={C 1 , C 2 , ..., C m }, where C i is the class label of the i-th topic.The probability that a content c belongs to topic C i is determined by: where δ c i is the probability that the content c belongs to topic , x is an attribute of content c.In natural language pro- cessing, the attributes of documents are usually determined by words or phrases, which have been pre-identified.
It is assumed in the study that the preferences (or interests) of a user for topics are independent.Based on events that a user v took action in the past, the interest level of v for the topic i C can be measured, denoted by ρ , , v i as: where l is the number of contents that v has adopted, and δ h i is the probability that the h-th content, which activated v, belongs to the topic C i as given in Eq. ( 6).
When a content c is being spread in a social network, the topic-based probability that a user v will be activated by c is measured by the similarity between the topic of c and the preference of v on that topic.The Cosine similarity measure is applied to determine the similarity between c and the preference of v on the topic of c, denoted by µ , c v as: where ρ , v i is the interest level of v for the topic i C as given in Eq. (7).
To sum up, user-user influence and user-content preference are combined to get the probability, which indicates whether a user u activates a user v with regard to a content c as: where α ∈ 0,1 [ ], τ uv is the influence measure of u on v, given in Eq. ( 5), and µ c v is given in Eq. ( 8) above.It is noted that, unlike user-content preference, influence measure between users is independent with the content (Fig. 1).

Measuring External Influence
Here in the model of this study, the effect of external influence in information diffusion is considered and quantified.Most of the existing works consider only internal influence among nodes and assume the absence of any other external information.This assumption, however, does not hold in general.For example, an external rumour or mass media like newspapers and televisions can easily reach people on social networks and eventually affect their actions toward an event or information.Basically, it is very hard to capture and study the effects of external influences let alone to forecast before it actually happens.This may be one of the reasons why until now there are very few studies dealing directly with this problem.The works [31,32] also consider how external trends can affect their model, but the problem setting and the method used in their papers are very different from ours.
If the growing (virality) of a cascade is low, it may appeal only to a small group of closed people, and thus there may be no external influence or the influence is very small and can be neglected.On the contrary, if we witness fast growing stages of diffusion, it can be concluded that external influence does exist, and as reported in [18], their influence can still last even after a long time.
It is assumed that when external influence exists, every inactive node in the social network receives the same addi- tional amount of influence, denoted as ∆ext .The combined influence between two nodes u and v can be simply defined as the sum of p uv and ∆ext .However, this simple method has a problem: the sum value can be greater than 1.To force the value to fall between 0 and 1, it is mapped through a logistic function, which is bounded between 0 and 1 as: In this paper, A=1/2 and B=4 are chosen.The shape of the corresponding logistic function is shown in Fig. 1.
Suppose that in the diffusion time step t, we observed k new activated nodes.Our task now is to estimate ∆ .ext Let S be the set of contagious nodes, and let V S be the set of nodes in V that exposed to S at a time step t.We denote by S v the set of contagious neighbours of a node v. Since in the IC model, each of contagious neighbours tries to activate v independently, we define the activation function for each exposed node as: The expected number of activated nodes to the set S of contagious nodes is given by: Since we already know , uv p and by setting I(S)=k, we can easily estimate the value of ∆ ext by solving Eq. ( 12).

5. Diffusion tree construction
Here a method of constructing the diffusion tree for a particular piece of content is represented.Each node of the tree corresponds to a user who has adopted the content, and each edge links a user to another user called its "parent".Tree construction would be relatively straightforward if we knew exactly which user caused other users to adopt the information.Unfortunately, however, users adopt content using a variety of unknown mechanisms, which complicate the construction task.
The set of potential parents of a node v which adopted the content is denoted by .
v S The single most likely parent from the set of all potential parents of a given adoption is identified as the one that has the highest probability to influence v as shown in Eq. ( 13): After each adoption has been identified as either a root or a child of another node, the cascade of information diffusion is constructed in the form of a diffusion tree as shown in Algorithm 1.
Algorithm 1 generates the most likely diffusion tree (the tree with the highest probability).The probability of the generated tree T is given by: Note that the probability of an exact tree is very small according to Eq. ( 14).However, it is still the tree that is closest to the "ground truth" -the real diffusion tree.

Algorithm 1 Diffusion Tree Construction
Given: graph G=(V, E, T), action log H, first active node v 0 at time t 0 , tree T=∅ 1: set node v 0 as the root of the tree T; 2: t=0; 3: at each time step t ≥ 0 do; 4: A t ← set of active nodes at time t; 5: for each contagious node u ∈ A t do 6: V u ← set of inactive nodes ∈ V which are neighbours of u; 7: for each node v ∈ V u do 8: calculate the diffusion probability p uv from u to v according to Eq. ( 9); 9: for each newly active node v do 10: find the most likely parent u of v according to Eq. ( 13); 11: make a link from u to v; 12: stop if there are no more contagious nodes; 13: Output the diffusion tree T;

Experiments
In this section, the effectiveness of our proposed method for cascade prediction on online social networks is evaluated.The datasets are described, then evaluation metrics used for evaluation are defined, and finally evaluation results are discussed.
Datasets.Both synthetic and real-world datasets are used.For the synthetic dataset, the same method for generating the data as described in [33][34][35] is adopted.Specifically, Kronecker generator [36] is used to generate graphs which mimics the structural properties and the information diffusion traces in real-world networks [37].The generated graphs are edgedirected, with core-periphery structure.10 graphs with parameters [0.9 0.5; 0.5 0.3] are generated.The generated graphs consist of about 8,192 nodes and 25,600 edges.A K-dimensional topic distribution for each node of a graph is sampled from a K-dimensional symmetric Dirichlet distribution.This is done by assigning to each node j a uniformly distributed random variable θ j ∈(0, 1] K , which is the parameter of the Dirichlet distribution.Since the entries of θ are less than one, the generated contents are more focused on a small subset of topics.For cascade generation, content from the Dirichlet distribution is sampled first, then the discrete-time independent cascade is applied to generate a set of 5,000 cascades. Two real-world datasets are also used.The first one is a large meme dataset [38], which traces the spread of memes across 1,700 popular media sites and blogs [39].The dataset classifies memes per topic, and assigns each meme m to an information cascade t m , which is a record of times when sites mentioned meme m.The second real-world dataset is named Tencent Weibo dataset, which is released by KDD Cup 2012 [40].4,000 and 1,000 cascades are used for training (learning diffusion probabilities) and testing, respectively, for all three datasets.
Metrics.It is very hard to evaluate diffusion models based on diffusion probabilities between users because we cannot compare estimated probabilities directly with exact probabilities.In previous works, various measures have been used to compare diffusion models.Many studies use measures like cascade size (or the number of users involved) [21,41,42].Some studies use metrics with regard to shape patterns like frequencies [21,42] or correlations of shapes to events [21].Metrics for temporal aspects like the time lag between messages [43] are also used.In this paper, we use the measure of cascade size to evaluate our proposed model.Given the diffusion probabilities between users, close approximations of exact cascade size by repeatedly simulating the cascade process and sampling the cascade size at each diffusion time step can be computed.
The performance of our proposed method with several existing ones is compared.The first model to compare, called IC model, is based on the Independent Cascade Model, which assigns the diffusion probability of a content to be simply the prior diffusion probability.The IC model does not consider user-user influences and content influences.For a pair of node u and v, the diffusion probability is calculated as = 1/ ( ), uv in p d v where ( ) in d v denotes in-degree of node v, as in [44,45].The second model, called UI Model is a user interaction model, which considers the user-user influences.We compare our method with the UI method given in [46], which measures the relatedness between nodes in a graph using the theory of random walk with restart.Finally, we compare our method with a regression method, called RM model, which estimate cascade size based on regressing on user-based features, content-based features, and timebased features [47].
To compare the methods, we adopt the relative error which shows how far the estimated diffusion size from the "ground truth".The relative error of estimated diffusion is computed as follows: − ( ) 1 *100 %, ( ) where ( ) I S is the estimated diffusion size of the seed set S by the method, and I(S) is the ground truth for S. In our experiments, S has only 1 node.
The study also aims to evaluate the methods using the most likely generated diffusion tree (the tree with the highest probability).Because the RM model can only estimate the diffusion size, it is excluded from the test.For fair comparison, we prune the tree using a threshold θ; any edge having diffusion probability below θ will be discarded from the trees.In this experiment, it sets θ=50 %, and the size of the estimated trees is compared with that of the actual diffusion tree.The experimental results are reported in Table 2.In all experiments, the value of α in Eq. ( 9) is set to 0.5.As can be seen in the table, our proposed method outperforms all other methods in all datasets.
The experimental results above do not tell us about the effect of external trends affecting diffusion probabilities.External influence can speed up the process of information diffusion and make much larger cascades that other factors such as user-user and user-content interactions are unable to explain.It can be safely hypothesized that in small cascades, external trends have very small or even no influence.We examine the effect of including external influence into our model by using the top 5 % of the largest cascades from the Meme and Tencent Weibo datasets.The experimental results are given in Table 3.
As can be seen in Table 3, when estimating influence probabilities where external influence exists, existing methods like IC, UI and RM cannot take into account this factor and therefore, face with significant declination in prediction accuracy.The proposed method, on the contrary, still works well with just a slight drop in prediction accuracy.However, our method has a weakness.It estimates the external influence in a time step and assumes that the influence still affects and remains the same for the next time step.Therefore, it cannot be used for early prediction and requires constant updating of real-world diffusion data.To deeper analyse and to improve our method for external influence is a topic of interest for our future research.

Discussion of the research results of method
We have demonstrated the effectiveness of our method with extensive experiments on both artificial and real-world datasets.The good experimental results obtained in section 5 are due to the following facts: 1) The proposed model can corporate different sources of diffusion influence, namely, user-user influence, user-content preference, and external influence.As we all know, mechanisms that drive users to adopt information are very diversified.If we consider only one mechanism and neglect the others we will miss important information which can contribute greatly to better understanding of diffusion probabilities.Network topology, history of interaction activities, content preference are among the techniques used in this paper to achieve our goal.
2) For user-user influence, our model can differentiate between two types of active nodes, namely the one that created the content and the one that just acts as intermediary.It is observed that influence through two types of nodes can be very different and can severely affect the accuracy of diffusion models.Despite its interest, no one to the best of our knowledge has considered and incorporated this fact in their work.
3) The proposed model can quantify external influence in information diffusion.This may be the most important contribution of our paper.Our method works in a sequential time-step manner.If a big viral is diagnosed in the current time step, its influence will diffuse to the next step.Most of existing works consider only internal influence among nodes and assume the absence of any other external information will face with difficulty when forecasting big information virals where external influence does exist.
By capturing different factors into the model, our proposed method can be easily extended to other social network analysis tasks like influence maximization, recommendation system, trust propagation, etc.
However, several challenges remain.The proposed model requires history interaction activities of users for computing user-user influence and user-content preference and therefore does not work well with users who are new in the social network.Secondly, the method for incorporating external influence can only work for the next time step if data of the previous step are known.Therefore, it cannot be used for early prediction and requires constant update of real-world diffusion data, a really hard task in practice.Moreover, in this paper, for the sake of simplicity and fast computation, some restrictions or suppositions have been made.This can hinder the accuracy of the prediction.

Conclusions
1.The problem of predicting diffusion probability on online social networks under the popular IC model is ad-dressed.The proposed model incorporates user-user influence, user-content preference and external influence in a unified framework which ensures the capability of capturing the true mechanisms of information diffusion.To do so, the network topology, interaction activities among users, and interaction activities between users and diffused contents have been used.
2. Different properties of the information diffusion process are investigated and exploited in the proposed model.The efficient disease transmission model is adopted and extended to make accurate pair-wise influence prediction.Moreover, influence through different types of diffusing nodes is quantified.Topic-based analysis technique is used to characterize user habits on sharing favorite contents.Finally, external influence is diagnosed and quantified for better prediction of diffusion probabilities.
3. The learned diffusion probabilities, which are the outputs of the model, are used to construct the most likely diffusion tree and to estimate the size of diffused nodes.The diffusion tree can help to understand the diffusion process in a more intuitive and visualized way.The diffusion size or the scale of the information diffusion is especially useful for predicting important social events.
4. Extensive experiments on both artificial and real-world datasets have been conducted.It is shown experimentally that the proposed model improves the accuracy of predicting cascade size significantly in comparison with other state of the art methods.The experimental results confirm the advantages of the proposed method in predicting diffusion probabililties and show that the method works well in both cases when external influence exists and when not.

C
be the set of contents that activated u since the time u and v becoming friend on the social network and let → set of contents that activated both u and v and that u is activated before v.The sets ⋅ is the probability that attribute x k belongs to the topic i C and that k

Table 1
An example of action log

Table 2
Comparing performance of methods in estimating diffusion size

Table 3
Comparing performance of methods in estimating diffusionsize where external influence exists