Introduction
The relationships established among the entities that interact in a large-scale system can be often modeled as a graph. This is particularly visible in the case of networks, where nodes are connected through links that allow exchanging or transmitting different types of flows (e.g., water, electricity, and information). The characterization of the relevance of the nodes and links in a graph is a problem that has been studied under different perspectives and that is generally related to the calculation of measures, i.e., numerical values that provide a notion of the relevance according to a specific criterion. While some measures focus on the inherent structure of the network (e.g., the well-known degree and betweenness centrality measures), others aggregate additional information into the calculation.
In this article, we study specifically the centrality measure provided by the PageRank. It has been used by search engine Google, as one of the means to rank the relevance of the results they offer to their users.1 Specific information about the PageRank value can be found in [4] and [5]. An analysis of this value and algorithms for its computation can be found in [19] and [20]. The rationale behind PageRank is that the relevance of a web page must be greater if there are significant pages pointing to it. Somehow it is as if each page votes the pages that it points to by means of the hyperlinks. Hence, the PageRank value depends only on the link structure of the graph that describes the web and its calculation can be seen as a variation to that of another popular centrality measure: eigenvector centrality.
In particular, we would like to gain insights into the problem of finding the most interesting nodes for given nodes to merge with from a PageRank viewpoint. While the aggregation of several nodes into a new one is beneficial in terms of incoming hyperlinks (as long as the new node is pointed to by all the links that were originally pointing to the merged nodes), it is clear that not all the nodes contribute equally to the fusion. Moreover, some fusions may be super-additive in the sense that the PageRank of the new node can be greater than the sum of the PageRanks of the merged nodes.
This viewpoint can be of interest in several applications where the fusion of nodes is an option to gain relevance, e.g., web pages, journals, conferences and companies. For example, PageRank was used together with other measures to evaluate company’s value taking into account the effect of network structures in [32], where data of one million Japanese companies were studied. In [9], the world trade matrix is studied to rank countries by means of PageRank and CheiRank, which is a variation that deals with the transposed link matrix. Another interesting work is [21], where economic influence and contagion propagation over the world economy are analyzed by means of a Google matrix of economic activities in the period 1995–2009. Other interesting applications can be found in [14], where the social network Twitter is examined by using this approach to detect the most influencing users, and in [22], where PageRank is used as a tool in citation analysis. Finally, it is worthy to mention [2], where the effect of newly created links on PageRank is analyzed to find out how much a web page can control its PageRank.
In order to analyze this problem, we propose two different approaches, from the global and local perspectives. In the global one, we assume that full information about the network structure is available. In order to gain insights into the relevance of the web pages, we propose different cooperative games in which the value of each coalition is characterized as the PageRank received by the new node resulting from aggregation. The Shapley value [31] of the games is related to the PageRank that a certain page should expect from joining randomly to a coalition of web pages. In this way, new centrality measures are obtained. Notice that the Shapley value has been proposed as a mathematical tool to measure node centrality in a network in many works of the literature. For example, in [28], the information diffusion process in a network is modeled as a cooperative game and its Shapley value is used to provide a measure of the node influence. Another related work is [16], where the Shapley value of the difference of a game and its graph-restricted version is proposed as a centrality measure in a network. Likewise, it has been also applied to characterize the relative relevance of links by means of the position value in [3]. Although other solution concepts also appear in the literature (e.g., in [18] a measure degree of centrality in a social network based on an extension of the Banzhaf power index is presented), the Shapley value is by far the most used game theoretical payoff rule used in this context, probably due to its properties and its straightforward interpretation in terms of average marginal contribution. Applications of the Shapley value as a measure to gain an insight into the relevance of the players in this context include social networks [18], [28], wine ranking [15], scientific influence attribution [29], shareholder influence attribution in companies [25], and control systems [24], [27], among many others.
In addition, we provide a mechanism to compute a numerical approximation of these values to overcome the combinatorial explosion issues that arise in this type of problems. It must be noticed that, while there are numerous studies on the Shapley value from a theoretical perspective, from a practical point of view the complexity of its computation is exponential, in general, leading to an NP-complete problem [8]. In the literature, the Shapley value has been computed for special classes of games as weighted voting games restricted by a tree [13] or weighted multiple majority games [1] where it has been calculated in polynomial time by algorithms based on generating functions, or particular cases of operations research games called extended tree games [17], among others. An approximation of the Shapley value for voting games by a randomized polynomial method is presented in [10]. Also, the Shapley value for some centrality-related cooperative games on networks has been solved analytically by exact algorithms in polynomial time [26]. In addition, a polynomial method based on sampling theory which can be applied to approximate the Shapley value for cooperative games is given in [7], since it is not possible to compute the marginal contributions from every player when the number of players is large. In fact, the authors of this work applied sampling as a method to describe the average of the marginal contributions for each player on the set of all possible permutations of the set of players from a representative sample of the set of all permutations. This method is only efficient if the worth of any coalition can be calculated in polynomial time.
The second approach we propose is based solely on local information. In particular, it is assumed that the nodes only have information about their neighbors. Hence, it is not possible to calculate directly the PageRank of any merger, although it can be approximated if certain simplifying assumptions hold, e.g., if the PageRank of the neighbors is not modified after the merger is formed or if the rest of the network can be approximated so that the PageRank of the nodes under study is preserved. Four alternatives are considered depending on different assumptions regarding the way the approximated PageRank is computed, which are assessed by simulation. As it will be seen in the corresponding section, our simulation results show that some of these methods are very accurate for determining super-additivity.
Finally, this work enhances substantially and contributes with more relevant results to the earlier version [23] in several ways: it includes the proofs of the theorems contained in that paper and it presents additional theoretical contributions to calculate analytically equivalent networks in a PageRank sense and estimators for the PageRank of a coalition of nodes when there is no global information available. As it will be shown, the methods proposed allow predicting with high accuracy whether a fusion of nodes will be super-additive from a PageRank viewpoint. Moreover, the limitations derived from the combinatorial explosion problem in large networks, which limited the applicability of our previous results, are solved in this article by means of a randomized method proposed in [6], which is adapted here to the games considered and assessed by means of a large example.
The outline of the paper is as follows. In Section II, grounds on the calculation of the PageRank and the problem setting are given. Section III deals with the centralized approach and introduces cooperative games based on the PageRank whose Shapley values are proposed as centrality measures. Section IV deals with the local information approach. Each of the proposed approaches is illustrated with different examples in the corresponding sections. Finally, Section V concludes the paper with final remarks and comments about future research.
Problem Setting
Let
The PageRank is itself a centrality measure related to eigenvector centrality. It provides information regarding the relevance of a certain web page based on the dominant eigenvector of a modified adjacency matrix. In particular, the PageRank of the web page \begin{equation*} x_{i}^{\ast }=\sum \limits _{j\in \mathcal {N}_{i}}\frac {x_{j}^{\ast }}{n_{j}}, \end{equation*}
The calculation of the PageRank can also be stated in matrix form. Let \begin{equation} \mathbf {x^{*}} =\mathbf {A} \mathbf {x^{*}} , \quad \mathbf {x^{*}} \in [{0,1}]^{| \mathcal {N} |}, ~ \sum \limits _{i\in \mathcal {N} } x_{i}^{*}=1, \end{equation}
\begin{equation*} a_{ij}=\begin{cases} \displaystyle \frac {1}{n_{j}} & j\in \mathcal {N}_{i}, \\ 0 & \mathrm {otherwise}. \end{cases} \end{equation*}
The matrix
The PageRank vector \begin{equation} \mathbf {x^{*}} =\mathbf {M} \mathbf {x^{*}} , \quad \mathbf {x^{*}} \in [{0,1}]^{| \mathcal {N} |}, \quad \sum \limits _{i\in \mathcal {N} } x_{i}^{*}=1 \end{equation}
A. Node Aggregation
Let
The aggregation of web pages can be analytically achieved by means of three auxiliary transformation matrices, namely \begin{equation*} \mathbf {P}_{ {\mathcal {S}}}= \begin{bmatrix} \mathbf {e}_{s_{1}}\mathbf {e}_{s_{2}}\ldots \mathbf {e}_{s_{s}}\mathbf {e} _{r_{1}}\mathbf {e}_{r_{2}}\ldots \mathbf {e}_{r_{r}} \end{bmatrix}, \end{equation*}
\begin{equation*} \mathbf {T} _ {\mathcal {S}}= \left [{ \begin{array}{cc} \mathbf {1}^{|\mathcal {S}|\times 1} & \mathbf {0}^{|\mathcal {S}|\times |\mathcal {R}|} \\ \mathbf {0}^{|\mathcal {R}|\times 1} & \mathbf {I}^{|\mathcal {R}|\times |\mathcal {R}|} \end{array} }\right ]. \end{equation*}
\begin{equation*} \mathbf {D} _ {\mathcal {S}}= \left [{ \begin{array}{cccc} \dfrac {1}{|\mathcal {S}|} & \mathbf {0}^{1\times |\mathcal {R}|} \\ \mathbf {0}^{|\mathcal {R}|\times 1} & \mathbf {I}^{|\mathcal {R}|\times |\mathcal {R}|} \end{array} }\right ]. \end{equation*}
The link matrix of \begin{equation*} \mathbf {A}^{\prime }=\mathbf {T}_{ {\mathcal {S}}}^{\mathrm {T}} \mathbf {P}_{ {\mathcal {S}}}^{\mathrm {T}}\mathbf {A}\mathbf {P}_{ {\mathcal {S}}} \mathbf {T}_{ {\mathcal {S}}}\mathbf {D}_{ {\mathcal {S}}}. \end{equation*}
The PageRank of the web page that results from merging the web pages in \begin{equation} {\mathbf {x}}'[k\!+\!1]= {\mathbf {M}}' {\mathbf {x}}'[k]=(1\!-\!m) {\mathbf {A}}' {\mathbf {x}}'[k]\!+\!\frac {m}{| {\mathcal {N}}'|}\mathbf {1}^{| {\mathcal {N}}'|\times 1}\qquad \end{equation}
Remark 1:
It is possible to consider alternatives in the way the matrix
An alternative case could be to consider that the random surfer jumps with an aggregated probability to the nodes inside \begin{equation*} \mathbf {J}^{\prime }:=\frac {1}{| {\mathcal {N}}|}\left [{ \begin{matrix} |\mathcal {S}|\mathbf {1^{1\times |\mathcal {N}^{\prime }|}} \\ \mathbf {1^{|\mathcal {R}|\times |\mathcal {N}^{\prime }|}} \end{matrix} }\right ]. \end{equation*}
B. An Academic Example
In this subsection, we introduce an example taken from [20], which will help us illustrate the main problem studied in this paper.
The network is shown in Fig. 1. As can be seen, it can be modeled by means of a directed graph \begin{equation*} \mathbf {A} = \left [{ \begin{array}{cccccc} 0 & \dfrac {1}{2} & 0 & 0 & 0 & 0 \\ \dfrac {1}{2} & 0 & \dfrac {1}{3} & 0 & 0 & 0 \\ 0 & \dfrac {1}{2} & 0 & \dfrac {1}{3} & 0 & 0 \\ \dfrac {1}{2} & 0 & \dfrac {1}{3} & 0 & 0 & \dfrac {1}{2} \\ 0 & 0 & 0 & \dfrac {1}{3} & 0 & \dfrac {1}{2} \\ 0 & 0 & \dfrac {1}{3} & \dfrac {1}{3} & 1 & 0 \\ \end{array} }\right ] \end{equation*}
\begin{equation*} \mathbf {M} \!=\! \begin{bmatrix} 0.025 & 0.450 & 0.025 & 0.025 & 0.025 & 0.025 \\ 0.450 & 0.025 & 0.308 & 0.025 & 0.025 & 0.025 \\ 0.025 & 0.450 & 0.025 & 0.308 & 0.025 & 0.025 \\ 0.450 & 0.025 & 0.308 & 0.025 & 0.025 & 0.450 \\ 0.025 & 0.025 & 0.025 & 0.308 & 0.025 & 0.450 \\ 0.025 & 0.025 & 0.308 & 0.308 & 0.875 & 0.025 \end{bmatrix}\!. \end{equation*}
Table 1 provides the values of the PageRank (PR) of this network. As can be seen, web pages 4, 5 and 6 have the highest PageRank values due to their larger numbers of incoming links. The value of web page 6 is greater because its incoming links come from web pages with larger values, namely pages 3, 4 and 5. It is very interesting that pages 4 and 5 share the same value of PageRank, specially since page 5 is a dangling node in [20].4
Finally, the aggregation of web pages can be studied by using (3). For example, the PageRank of the merger of web pages 1 and 2 is 0.11, which is greater than the PageRank of any of these web pages, as shown in Table 1. Nevertheless, the PageRank of the merger is lower than the sum of the PageRanks corresponding to the merged nodes. Next, let us examine what happens when nodes 1 and 4 are aggregated. In this case, the PageRank of the merger becomes 0.281, which is greater than the sum of the PageRanks of the nodes aggregated. Hence, the aggregation is super-additive in terms of PageRank.
In the next sections, we explore the generation of super-additive PageRank from different perspectives. In particular, we study which nodes are more likely to generate super-additive PageRank and also different estimators for the PageRank value of the merger when full information is not available.
Global Approach: Cooperative Games Based on the Pagerank
Game theory deals with situations in which there are several coupled interacting entities. The outcome of a game depends on the combinations of the decisions taken by its players. Depending on the way the decisions are taken, it is possible to classify the games as cooperative or noncooperative. In the former case, players may attain binding agreements; in the latter case, decisions are taken without guarantees regarding the behavior of the rest of the players. In this work, we focus on cooperative games, which are defined by two basic elements, namely a set of players and a characteristic function. More specifically, we propose to use the PageRank that an aggregation of web pages obtains as the value that corresponds to that coalition. The Shapley value [31] of these aggregation PageRank games is proposed as a centrality measure.
In this section, we first provide some grounds about cooperative game theory. Next we define how the PageRank can be computed. Finally, cooperative games based on the PageRank are defined.
A. Cooperative Game Theory
Cooperative game theory studies situations in which players can negotiate and attain binding agreements. Which coalitions of players should be expected and how to divide the benefits/costs derived from cooperation are major concerns in this branch of game theory. Formally, a cooperative game with transferable utility is a pair \begin{align} \phi _{i}(\mathcal {N},v)=\sum _{\mathcal {S}\subseteq \mathcal {N}\setminus \{i\}} \frac {|\mathcal {S}|!(|\mathcal {N}|\!-\!|\mathcal {S}|\!-\!1)!}{| \mathcal {N}|!}[v(\mathcal {S}\cup \{i\})\!-\!v(\mathcal {S})].\notag \\ {}\end{align}
B. The Pagerank Aggregation Game
The definition of a cooperative game requires a characteristic function that assigns a value to each of the possible coalitions of players that can be formed. In the PageRank aggregation game, the players are the web pages contained in \begin{equation} v(\mathcal {S})=[1\; 0 \ldots 0] \mathbf {x} ^{\prime *}=\mathbf {e} _{1}^{\mathrm {T}} \mathbf {x} ^{\prime *}. \end{equation}
Note that (5) depends only on the structure of the network of web pages
C. The Pagerank Difference Aggregation Game
An allocation vector of the previous game provides information regarding the players that are expected to provide more PageRank when they are aggregated into a coalition. While this information is valuable, it does not consider the net effect of the aggregation. It is interesting to know whether the PageRank of the merger is greater, lower or equal to the sum of the individual PageRanks of the players in the coalition. To this end, we define the PageRank difference aggregation game over the same set of players but with the following characteristic function:\begin{equation*} v_{\mathrm {dif}}(\mathcal {S} )=v(\mathcal {S} )-\sum \limits _{j\in \mathcal {S} } v(j), \end{equation*}
Remark 2:
The fact that the PageRank obtained by a coalition is lower than the sum of the PageRanks of its members should not be taken as if it is not rational to carry out the integration. A certain loss of PageRank can be acceptable in order to gain relevance in terms of centrality.
Finally, the following proposition simplifies the calculation of the Shapley value of the difference aggregation game.
Proposition 1:
The Shapley value of the difference aggregation game can be calculated as \begin{equation*} \phi _{i}(\mathcal {N},{v}_{\mathrm {dif}})=\phi _{i}(\mathcal {N}, {v})-{v}(i). \end{equation*}
Proof:
The Shapley value is linear by construction. Hence, \begin{align*} \phi _{i}(\mathcal {N},{v}_{\mathrm {dif}})=&\phi _{i}\left({\mathcal {N},v( \mathcal {S})-\sum \limits _{j\in \mathcal {S}}v(j)}\right) \\=&\phi _{i}( \mathcal {N},v(\mathcal {S}))-\phi _{i}\left({\mathcal {N},\sum \limits _{j\in \mathcal { S}}v(j)}\right). \end{align*}
Remark 3:
Notice that the computation of
D. An Academic Example (Cont.)
In this subsection, we work again with the example in Section II-B, which will help us show the differences between the PageRank of the web pages in the network and the Shapley value of the games proposed.
From the viewpoint of game theory, we model the web given by the directed graph
Table 3 provides the values of the PageRank (PR) and the Shapley values of the PageRank aggregation games of this network. As can be seen, the Shapley value of the aggregation game is fairly close to the PageRank, but it does recognize a difference on the relevance on nodes 4 and 5. In particular, node 4 is more important because of its greater impact when it is aggregated into a random coalition of web pages. For example, in all the coalitions with 5 web pages, the one with the lowest value is that in which web page 4 is not included. As a direct consequence, the relative relevance of page 6 is also reduced. The Shapley value of the difference aggregation game shows that nodes 1 to 4 are more likely to produce mergers with a higher PageRank value. Indeed, a closer look at Table 2 shows that most coalitions that produce a positive difference contain at least some of these players. Likewise, notice that the Shapley values of this game correspond to that of the aggregation game minus the PageRank of each player.5
This simple academic example shows the potential of the Shapley value of the aggregation PageRank games as an alternative centrality measure for the nodes. In addition, this perspective is also useful when evaluating the potential fusion of nodes inside the network. For example, several web owners that plan to integrate their web pages would be very interested in this type of information, specially since revenues are related to the number of visitors and hence PageRank. However, a problem arises at this point due to the combinatorial explosion and the need of centralized information. While there are efficient methods for the distributed computation of the PageRank [20], the computational and informational requirements to calculate this value are very demanding.
E. The Shapley Value of Large Networks
As previously pointed out, one of the most important solution concepts in cooperative games is the Shapley value [30]. The Shapley value allocates the worth of the grand coalition when all agents in the set of players decide to cooperate. However, a well-known problem of the Shapley value is its computation. The explosion on the number of coalitions that have to be computed hinders its calculation for large-scale games. For a set of players
In the sampling method given in [6], an alternative definition of the Shapley value is used. Indeed, the Shapley value can be expressed in terms of all possible orders of the players \begin{equation*} \phi _{i}(\mathcal {N},v)=\frac {1}{|\mathcal {N}|!}\sum _{\pi \in \Pi (\mathcal {N})}m_{i}^{\pi }(\mathcal {N},v), \text {for all }i\in \mathcal {N}, \end{equation*}
\begin{align} m_{i}^{\pi }(\mathcal {N},v)=&v(\{j\in \mathcal {N}\mid \pi (j)\leq \pi (i)\})\notag \\&\qquad \qquad \quad ~ -\,v(\{j\in \mathcal {N}\mid \pi (j)<\pi (i)\}),\qquad \end{align}
Next, some terminology corresponding to the sampling process to approximate the Shapley value is given as follows:
The population set
from which the sample is taken is represented by the set of all possible orders of the set of players\mathcal {P}=\Pi (\mathcal {N}) . The sample\mathcal {N} is an element of\mathcal {Q} i.e., the sample size is\begin{equation*} \underbrace {\mathcal {P}\times \mathcal {P}\times {\cdots }\times \mathcal {P}}_{\text {$q$ times}}, \end{equation*} View Source\begin{equation*} \underbrace {\mathcal {P}\times \mathcal {P}\times {\cdots }\times \mathcal {P}}_{\text {$q$ times}}, \end{equation*}
and it is obtained with replacement.q The parameter vector
under consideration consists of the Shapley value for each\phi =\left ({ \phi _{1},\ldots ,\phi _{n}}\right ) .i\in \mathcal {N} The characteristics observed for each sampling unit
correspond to the marginal contributions of the players in the order\pi \in \Pi (\mathcal {N}) , i.e.,\pi \begin{equation*} x(\pi )=\left ({ x_{1}(\pi ),\ldots ,x_{n}(\pi )}\right ) \quad \text {with}~x_{i}(\pi )=m_{i}^{\pi }(\mathcal {N},v). \end{equation*} View Source\begin{equation*} x(\pi )=\left ({ x_{1}(\pi ),\ldots ,x_{n}(\pi )}\right ) \quad \text {with}~x_{i}(\pi )=m_{i}^{\pi }(\mathcal {N},v). \end{equation*}
The estimate of the Shapley value,
, will be given by the average of the marginal contributions over the sample\widetilde {\phi }_{i}(\mathcal {N},v) , i.e.,Q \begin{equation*} \widetilde {\phi }_{i}(\mathcal {N},v)=\frac {1}{q}\sum _{\pi \in Q}m_{i}^{\pi }(\mathcal {N},v),\quad \text {for all}~i\in \mathcal {N}. \end{equation*} View Source\begin{equation*} \widetilde {\phi }_{i}(\mathcal {N},v)=\frac {1}{q}\sum _{\pi \in Q}m_{i}^{\pi }(\mathcal {N},v),\quad \text {for all}~i\in \mathcal {N}. \end{equation*}
Any order
will be taken with equal probability to determine the sample\pi \in \Pi (\mathcal {N}) in the process of selection.\mathcal {Q}
The sampling process described above gives an approximation of the Shapley value with desirable properties. For instance, it is possible to calculate the theoretical error in a probabilistic way. In particular, \begin{equation*} \widetilde {\phi }_{i}(\mathcal {N},v) \sim N({\phi }_{i}(\mathcal {N},v), \sigma ^{2}/q), \end{equation*}
F. Example
In this subsection we deal with a web composed of 200 nodes. The links between the nodes can be seen in Figure 2, where yellow and blue are used to denote respectively the presence and absence of links between two given nodes. Notice that the direct computation of the Shapley value of any of the games proposed in this work requires the computation of the value of 2200 coalitions (
The application of the method from the previous subsection is particularly interesting in this case. In particular, the fact that the PageRank of any given node is limited between 0 and 1 allows calculating accurate results with a reduced set of samples. For example, a maximum error of 0.01 in the PR requires the evaluation of 829400 coalitions. A maximum error ten times lower requires to compute 1353600 coalitions. As can be seen, the resulting computation burden is feasible within the limits imposed by current technology. In addition, it must be noticed that the computation can be easily parallelized.
Figure 3 shows the PageRank and Shapley values of the aggregation and difference games of the example calculated for a maximum error of
PageRank (PR), Shapley values of the aggregation (SV) and difference games (SVdg) of the 200 nodes example.
Local Approach
Even when we have shown that obtaining information regarding the suitability of merging a set of nodes
: These are the core nodes, i.e., those whose fusion is to be assessed.\mathcal {S} : Interface nodes, which are the nodes that point to or are pointed to by the nodes in\mathcal {I} . These are the neighbors of the core nodes, i.e., their interface with the rest of the network. We assume that information from this nodes is also available.\mathcal {S} : The rest of the nodes are grouped here and are called the outer nodes.\mathcal {O}
Without loss of generality, let us assume that the set \begin{equation} \mathbf {A}=\left [{ \begin{matrix} \mathbf {A}_{\mathcal {O}\mathcal {O}} & \mathbf {A}_{\mathcal {O}\mathcal {I}} & \mathbf {A}_{\mathcal {O}\mathcal {S}}\\ \mathbf {A}_{\mathcal {I}\mathcal {O}} & \mathbf {A}_{\mathcal {I}\mathcal {I}} & \mathbf {A}_{\mathcal {I}\mathcal {S}}\\ \mathbf {A}_{\mathcal {S}\mathcal {O}} & \mathbf {A}_{\mathcal {S}\mathcal {I}} & \mathbf {A}_{\mathcal {S}\mathcal {S}} \end{matrix} }\right ]. \end{equation}
Let us define the function PR\begin{equation*} PR(\mathcal {S})=(1\!-\!m)\left [{\mathbf {A}_{\mathcal {S}\mathcal {O}} ~~ \mathbf {A}_{\mathcal {S}\mathcal {I}} ~~ \mathbf {A}_{\mathcal {S}\mathcal {S}}}\right ]PR(\mathcal {N})+\frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1}. \end{equation*}
\begin{align} PR(\mathcal {S})=&\left [{(m-1)\mathbf {A}_{\mathcal {S}\mathcal {S}}+\mathbf {I}^{|\mathcal {S}|\times |\mathcal {S}|}}\right ]^{-1} \notag \\&\qquad \quad \cdot \,\left [{ (1-m)\mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})+\frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1}}\right ].\qquad \end{align}
Remark 4:
A more general version of (8) can be built by considering a non-homogeneous jump matrix \begin{align} PR(\mathcal {S})=&\left [{(m-1)\mathbf {A}_{\mathcal {S}\mathcal {S}}-m\mathbf {J}_{\mathcal {S}\mathcal {S}}+\mathbf {I}^{|\mathcal {S}|\times |\mathcal {S}|}}\right ]^{-1} \notag \\&\cdot \,\Biggl [{ (1-m)\mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})}\notag \\&{+\,m(\mathbf {J}_{\mathcal {S}\mathcal {O}}PR(\mathcal {O})+\mathbf {J}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I}))}\Biggr ]. \end{align}
A. Naive Coalition Value Estimation
Equation (8) allows us to calculate the PageRank of the set of core nodes from the PageRank of its neighbors. This also motivates us to introduce our first and simplest estimator for \begin{equation*} v^{\mathrm {SPR}}(\mathcal {S})=\mathbf {1}^{1\times |\mathcal {S}|} PR(\mathcal {S}), \end{equation*}
B. Ceteris Paribus Approach
Equation (8) can also be applied to the link-matrix of the network resulting from merging the nodes in the coalition \begin{equation} PR^{\prime }(\mathcal {S})=\frac {(1-m)\mathbf {A}^{\prime }_{\mathcal {S}\mathcal {I}} PR^{\prime }(\mathcal {I})+\frac {m}{|\mathcal {N}| -|\mathcal {S}|+1}}{(m-1)\mathbf {A}^{\prime }_{\mathcal {S}\mathcal {S}}+1}, \end{equation}
\begin{align*} \mathbf {A}^{\prime }_{\mathcal {S}\mathcal {S}}=&\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|},\\ \mathbf {A}^{\prime }_{\mathcal {S}\mathcal {I}}=&\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}. \end{align*}
\begin{equation*} v(\mathcal {S}) \ge v^{\mathrm {SPR}}(\mathcal {S}), \end{equation*}
Proposition 2:
A coalition \begin{align}&\hspace {-3.5pc}\frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR^{\prime }(\mathcal {I})+\frac {m}{|\mathcal {N}|-|\mathcal {S}|}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1} \notag \\\ge&\mathbf {1}^{1\times |\mathcal {S}|} \left [{(m-1)\mathbf {A}_{\mathcal {S}\mathcal {S}}+\mathbf {I}^{|\mathcal {S}|\times |\mathcal {S}|}}\right ]^{-1} \notag \\&\cdot \,\left [{ (1-m)\mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})+\frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1}}\right ]. \end{align}
No proof is provided for the proposition since it is a simple consequence of the definitions of
Equation (11) highlights two key ideas for a coalition to be super-additive:
Coalitions with nodes with a strong internal link interaction with respect to their interaction with rest of the network have more chances due to the term
in the denominator of (11). Given that\mathbf {A}_{\mathcal {S}\mathcal {S}} , the greater the elements of this matrix are, the better.m-1 < 0 Coalitions making their neighbors better off after the merging have better chances due to the influence of
in (11). That is, nodes forming a coalition should not only pay attention to their PageRank. Their neighbors’ PageRank is very important in this regard.PR^{\prime }(\mathcal {I})
Note also that it is not possible to evaluate Equation (11) before actually computing \begin{equation} v^{\mathrm {CP}}(\mathcal {S})= \frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I}) +\frac {m}{|\mathcal {N}|-|\mathcal {S}|+1}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}.\qquad \end{equation}
A second version of this estimator can be defined in the line of Equation (9), i.e., for the case in which the jump matrix is also affected by the node aggregation. In this case, the probability of randomly jumping to the merger becomes the sum of the probabilities of the merged nodes before the jump took place, hence giving \begin{equation} v^{\mathrm {CP2}}(\mathcal {S})=\frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})+m\frac {|\mathcal {S}|}{|\mathcal {N}|}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}. \end{equation}
C. Equivalent Subnetworks in a Pagerank Sense
In this subsection, we take a slightly different approach and pay attention to the following problem: given a network
Definition 1:
Let
Our goal is to find a network
Proposition 3:
Given a subnetwork
Proof:
Let
is a zero block because there is no link relationship between core and outer nodes.\mathbf {A}_{\mathcal {S}\mathcal {O}}^{\prime } Likewise,
and\mathbf {A}_{\mathcal {S}\mathcal {I}}^{\prime } must be equal in\mathbf {A}_{\mathcal {S}\mathcal {S}}^{\prime } and\mathbf {A}^{\prime } to preserve the link relationship between these groups of nodes.\mathbf {A}
Since Equation (8) can be applied to both networks, we can observe that \begin{equation*} \frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1} = \frac {m}{|\mathcal {N}^\prime |}\mathbf {1}^{|\mathcal {S}|\times 1}, \end{equation*}
According to Proposition 3, it is possible to obtain a network
Definition 2:
Let
From the definition, it is clear that wide-sense PageRank equivalence is less restrictive than strict PageRank equivalence. The reduced network
Proposition 4:
Let \begin{equation} \left ({(1-m)\mathbf {A}^{\prime }+m\mathbf {J}^{\prime }}\right )\mathrm {PR^{\prime }}(\mathcal {N}^\prime )=\mathrm {PR^\prime }(\mathcal {N}^\prime ). \end{equation}
\begin{align} \mathbf {A}_{\mathcal {I}\mathcal {I}}^{\prime }=&\mathbf {A}_{\mathcal {I}\mathcal {I}}, \\[2pt] \mathbf {A}_{\mathcal {I}\mathcal {S}}^{\prime }=&\mathbf {A}_{\mathcal {I}\mathcal {S}}, \\[2pt] \mathbf {A}_{\mathcal {S}\mathcal {I}}^{\prime }=&\mathbf {A}_{\mathcal {S}\mathcal {I}}, \\[2pt] \mathbf {A}_{\mathcal {S}\mathcal {S}}^{\prime }=&\mathbf {A}_{\mathcal {S}\mathcal {S}}, \\[2pt] \mathbf {A}_{\mathcal {S}\mathcal {O}}^{\prime }=&\mathbf {0}, \\[2pt] \mathbf {A} _{\mathcal {O}\mathcal {S}}^{\prime }=&\mathbf {0}, \\[2pt] PR^\prime (\mathcal {I})=&PR(\mathcal {I}), \\[2pt] PR^\prime (\mathcal {S})=&PR(\mathcal {S}). \end{align}
Proof:
The inner structure of
Notice that the conditions given are necessary but they do not lead in general to a unique solution. In particular, the following elements have to be designed:
Another noteworthy point is that there may not be a feasible solution for the set of constraints given, i.e., the set of equations may be incompatible, or even when a solution can be found
Once this merged network
Next, we present a particular case of this approach and the corresponding estimator.
1) Direct Approximation
Here, we introduce a direct approximation for \begin{equation} \mathbf {A}^{\prime }=\left [{ \begin{matrix} 1-\mathbf {1}^{1\times |\mathcal {I}|}\mathbf {A}_{\mathcal {I}\mathcal {O}}^{\prime } & \mathbf {A} _{\mathcal {O}\mathcal {I}}^{\prime } & 0 \\ \mathbf {A} _{\mathcal {I}\mathcal {O}}^{\prime } & \mathbf {A}_{\mathcal {I}\mathcal {I}} & \mathbf {A} _{\mathcal {I}\mathcal {S}} \\ 0 & \mathbf {A}_{\mathcal {S}\mathcal {I}} & \mathbf {A}_{\mathcal {S}\mathcal {S}} \end{matrix} }\right ], \end{equation}
\begin{equation} \mathbf {A}_{\mathcal {O}\mathcal {I}}^{\prime } =\mathbf {1}^{1\times |\mathcal {I}|}-\mathbf {1}^{1\times |\mathcal {I}|}\mathbf {A}_{\mathcal {I}\mathcal {I}} -\mathbf {1}^{1\times |\mathcal {S}|}\mathbf {A}_{\mathcal {S}\mathcal {I}} \end{equation}
\begin{equation} \mathbf {A}_{\mathcal {I}\mathcal {O}}^{\prime }=\frac {\mathbf {A}_{\mathcal {I}\mathcal {O}}PR(\mathcal {O})}{1-\sum \limits _{i\in \mathcal {S}\cup \mathcal {I}}PR(i)}. \end{equation}
Likewise, the matrix \begin{equation} \mathbf {J}^{\prime }=\left [{ \begin{matrix} \left({1-\dfrac {|\mathcal {I}|+|\mathcal {S}|}{|\mathcal {N}|}}\right)\mathbf {1}^{1 \times (1+|\mathcal {I}|+|\mathcal {S}|)} \\ \dfrac {1}{|\mathcal {N}|}\mathbf {1}^{(|\mathcal {I}|+|\mathcal {S}|)\times (1+|\mathcal {I}|+|\mathcal {S}|)} \end{matrix} }\right ]. \end{equation}
Once the simplified network is calculated, it is possible to assess the performance of the merger by using (10). To this end, the nodes in \begin{equation} v^{\mathrm {DA}}(\mathcal {S})= \frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR^{\prime \prime }(\mathcal {I})+\frac {m}{|\mathcal {I}|+2}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}, \end{equation}
A second version for this estimator is based on the calculation of the PageRank by means of the corresponding aggregated jump matrix \begin{align*} \mathbf {J}^{\prime \prime }=\left [{ \begin{matrix} \left({1-\dfrac {|\mathcal {I}|+|\mathcal {S}|}{|\mathcal {N}|}}\right)\mathbf {1}^{1 \times (2+|\mathcal {I}|)} \\[9pt] \dfrac {1}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {I}|\times (2+|\mathcal {I}|)} \\[9pt] \dfrac {|\mathcal {S}|}{|\mathcal {N}|}\mathbf {1}^{1\times (2+|\mathcal {I}|)} \end{matrix} }\right ]. \end{align*}
\begin{equation*} v^{\mathrm {DA2}}(\mathcal {S})= \frac { \left [{ (1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}} }\right ] PR^{\prime \prime }(\mathcal {I})+m\frac {|\mathcal {S}|}{|\mathcal {N}|}} {(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}.\qquad \quad \tag{28}\end{equation*}
D. Experimental Results
In this subsection, we show some experimental results to illustrate the different local information approaches proposed.
1) Equivalent Network Example
In Figure 4 a randomly generated 30 node network is depicted. Let’s suppose that we would like to assess whether nodes 4 and 8 should be merged, which would be our core nodes. We will take the 1-hop neighborhood as interface nodes, i.e., nodes 1, 2, 9, 11, 16, 19, 23, 24, 25, and 26. Consequently, nodes 3, 5, 6, 7, 10, 12, 13, 14, 15, 17, 18, 20, 21, 22, 27, 28, 29 and 30 are external nodes.
The method described after Proposition 4 can be applied to find a link-matrix equivalent in a wide PageRank sense by solving the set of bilinear equations that stems from (14). The result is given in (29), as shown at the top of this page
and corresponds to a 14-node network that provides the same PageRank value for all the nodes of interest.2) Estimator Assessment
Finally, we have assessed our estimators. To this end, 10000 simulations with randomly generated networks were carried out for different network sizes in the range [25, 200]. The probability that a link is established between any given two nodes was set to 0.1. The coalition size was also randomly set in the range [2, 5].
The results are shown in Tables 4–7. There, we can see data regarding the average absolute (
Clearly, the estimators work better for the cases they are designed for. It is remarkable that some of them offer much better results than the mere sum of PageRank of the merged nodes, specially regarding the prediction of super-additivity. In particular,
Conclusions
In this work, we have studied the problem of merging nodes in a network from a PageRank viewpoint. A global perspective analysis has allowed us to define games of interest in this context. Two measures have been given, one of which is directly related to the amount of additional PageRank that can be expected when merging with a node. A method for the computation of the new measures in polynomial time for large networks has been also applied.
The same problem has also been addressed from a local perspective. The lack of full information is a strong limitation that only allows obtaining estimates of the expected PageRank value. Different estimators have been introduced and experiments have been carried out to show the remarkable accuracy of some of these approximations, very particularly in their capability of assessing the potential of the coalition to generate additional PageRank.
Future work should deal with the utilization of models of restricted cooperation to integrate the structure of the directed graph in the solution. Likewise, the application of these values to coalitional control schemes [11], [12], [24] will be also studied.