Loading web-font TeX/Caligraphic/Regular
Node Aggregation for Enhancing PageRank | IEEE Journals & Magazine | IEEE Xplore

Node Aggregation for Enhancing PageRank


Best candidates for node aggregation in a 200 nodes network according to PageRank (PR) and Shapley values of the aggregation (SV) and difference games (SVdg).

Abstract:

In this paper, we study the problem of node aggregation under different perspectives for increasing PageRank of some nodes of interest. PageRank is one of the parameters ...Show More

Abstract:

In this paper, we study the problem of node aggregation under different perspectives for increasing PageRank of some nodes of interest. PageRank is one of the parameters used by the search engine Google to determine the relevance of a web page. We focus our attention to the problem of finding the best nodes in the network from an aggregation viewpoint, i.e., what are the best nodes to merge with for the given nodes. This problem is studied from global and local perspectives. Approximations are proposed to reduce the computation burden and to overcome the limitations resulting from the lack of centralized information. Several examples are presented to illustrate the different approaches that we propose.
Best candidates for node aggregation in a 200 nodes network according to PageRank (PR) and Shapley values of the aggregation (SV) and difference games (SVdg).
Published in: IEEE Access ( Volume: 5)
Page(s): 19799 - 19811
Date of Publication: 22 September 2017
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

The relationships established among the entities that interact in a large-scale system can be often modeled as a graph. This is particularly visible in the case of networks, where nodes are connected through links that allow exchanging or transmitting different types of flows (e.g., water, electricity, and information). The characterization of the relevance of the nodes and links in a graph is a problem that has been studied under different perspectives and that is generally related to the calculation of measures, i.e., numerical values that provide a notion of the relevance according to a specific criterion. While some measures focus on the inherent structure of the network (e.g., the well-known degree and betweenness centrality measures), others aggregate additional information into the calculation.

In this article, we study specifically the centrality measure provided by the PageRank. It has been used by search engine Google, as one of the means to rank the relevance of the results they offer to their users.1 Specific information about the PageRank value can be found in [4] and [5]. An analysis of this value and algorithms for its computation can be found in [19] and [20]. The rationale behind PageRank is that the relevance of a web page must be greater if there are significant pages pointing to it. Somehow it is as if each page votes the pages that it points to by means of the hyperlinks. Hence, the PageRank value depends only on the link structure of the graph that describes the web and its calculation can be seen as a variation to that of another popular centrality measure: eigenvector centrality.

In particular, we would like to gain insights into the problem of finding the most interesting nodes for given nodes to merge with from a PageRank viewpoint. While the aggregation of several nodes into a new one is beneficial in terms of incoming hyperlinks (as long as the new node is pointed to by all the links that were originally pointing to the merged nodes), it is clear that not all the nodes contribute equally to the fusion. Moreover, some fusions may be super-additive in the sense that the PageRank of the new node can be greater than the sum of the PageRanks of the merged nodes.

This viewpoint can be of interest in several applications where the fusion of nodes is an option to gain relevance, e.g., web pages, journals, conferences and companies. For example, PageRank was used together with other measures to evaluate company’s value taking into account the effect of network structures in [32], where data of one million Japanese companies were studied. In [9], the world trade matrix is studied to rank countries by means of PageRank and CheiRank, which is a variation that deals with the transposed link matrix. Another interesting work is [21], where economic influence and contagion propagation over the world economy are analyzed by means of a Google matrix of economic activities in the period 1995–2009. Other interesting applications can be found in [14], where the social network Twitter is examined by using this approach to detect the most influencing users, and in [22], where PageRank is used as a tool in citation analysis. Finally, it is worthy to mention [2], where the effect of newly created links on PageRank is analyzed to find out how much a web page can control its PageRank.

In order to analyze this problem, we propose two different approaches, from the global and local perspectives. In the global one, we assume that full information about the network structure is available. In order to gain insights into the relevance of the web pages, we propose different cooperative games in which the value of each coalition is characterized as the PageRank received by the new node resulting from aggregation. The Shapley value [31] of the games is related to the PageRank that a certain page should expect from joining randomly to a coalition of web pages. In this way, new centrality measures are obtained. Notice that the Shapley value has been proposed as a mathematical tool to measure node centrality in a network in many works of the literature. For example, in [28], the information diffusion process in a network is modeled as a cooperative game and its Shapley value is used to provide a measure of the node influence. Another related work is [16], where the Shapley value of the difference of a game and its graph-restricted version is proposed as a centrality measure in a network. Likewise, it has been also applied to characterize the relative relevance of links by means of the position value in [3]. Although other solution concepts also appear in the literature (e.g., in [18] a measure degree of centrality in a social network based on an extension of the Banzhaf power index is presented), the Shapley value is by far the most used game theoretical payoff rule used in this context, probably due to its properties and its straightforward interpretation in terms of average marginal contribution. Applications of the Shapley value as a measure to gain an insight into the relevance of the players in this context include social networks [18], [28], wine ranking [15], scientific influence attribution [29], shareholder influence attribution in companies [25], and control systems [24], [27], among many others.

In addition, we provide a mechanism to compute a numerical approximation of these values to overcome the combinatorial explosion issues that arise in this type of problems. It must be noticed that, while there are numerous studies on the Shapley value from a theoretical perspective, from a practical point of view the complexity of its computation is exponential, in general, leading to an NP-complete problem [8]. In the literature, the Shapley value has been computed for special classes of games as weighted voting games restricted by a tree [13] or weighted multiple majority games [1] where it has been calculated in polynomial time by algorithms based on generating functions, or particular cases of operations research games called extended tree games [17], among others. An approximation of the Shapley value for voting games by a randomized polynomial method is presented in [10]. Also, the Shapley value for some centrality-related cooperative games on networks has been solved analytically by exact algorithms in polynomial time [26]. In addition, a polynomial method based on sampling theory which can be applied to approximate the Shapley value for cooperative games is given in [7], since it is not possible to compute the marginal contributions from every player when the number of players is large. In fact, the authors of this work applied sampling as a method to describe the average of the marginal contributions for each player on the set of all possible permutations of the set of players from a representative sample of the set of all permutations. This method is only efficient if the worth of any coalition can be calculated in polynomial time.

The second approach we propose is based solely on local information. In particular, it is assumed that the nodes only have information about their neighbors. Hence, it is not possible to calculate directly the PageRank of any merger, although it can be approximated if certain simplifying assumptions hold, e.g., if the PageRank of the neighbors is not modified after the merger is formed or if the rest of the network can be approximated so that the PageRank of the nodes under study is preserved. Four alternatives are considered depending on different assumptions regarding the way the approximated PageRank is computed, which are assessed by simulation. As it will be seen in the corresponding section, our simulation results show that some of these methods are very accurate for determining super-additivity.

Finally, this work enhances substantially and contributes with more relevant results to the earlier version [23] in several ways: it includes the proofs of the theorems contained in that paper and it presents additional theoretical contributions to calculate analytically equivalent networks in a PageRank sense and estimators for the PageRank of a coalition of nodes when there is no global information available. As it will be shown, the methods proposed allow predicting with high accuracy whether a fusion of nodes will be super-additive from a PageRank viewpoint. Moreover, the limitations derived from the combinatorial explosion problem in large networks, which limited the applicability of our previous results, are solved in this article by means of a randomized method proposed in [6], which is adapted here to the games considered and assessed by means of a large example.

The outline of the paper is as follows. In Section II, grounds on the calculation of the PageRank and the problem setting are given. Section III deals with the centralized approach and introduces cooperative games based on the PageRank whose Shapley values are proposed as centrality measures. Section IV deals with the local information approach. Each of the proposed approaches is illustrated with different examples in the corresponding sections. Finally, Section V concludes the paper with final remarks and comments about future research.

SECTION II.

Problem Setting

Let \mathcal {N} be a set of networked web pages. The network can be represented as a directed graph \mathcal {G} =(\mathcal {N} ,\mathcal {E} ) , where \mathcal {E} is the set of edges representing the hyperlinks among the web pages. In case that a web page i has a link pointing to web page j then (i,j)\in \mathcal {E} .

The PageRank is itself a centrality measure related to eigenvector centrality. It provides information regarding the relevance of a certain web page based on the dominant eigenvector of a modified adjacency matrix. In particular, the PageRank of the web page i is given by the number x_{i}^{\ast }\in \lbrack 0,1] , with \sum _{i\in \mathcal {N}}x_{i}^{\ast }=1 . It relies on the assumption that a web page having links from important web pages is also important. Hence, the value of a web page i is the sum of the contributions from all web pages that have links to it, i.e., \begin{equation*} x_{i}^{\ast }=\sum \limits _{j\in \mathcal {N}_{i}}\frac {x_{j}^{\ast }}{n_{j}}, \end{equation*}

View SourceRight-click on figure for MathML and additional features. where \mathcal {N}_{i}:=\{j:(j,i)\in \mathcal {E}\} is the set of web pages pointing to the web page i and n_{j} is the number of outgoing links of web page j .

The calculation of the PageRank can also be stated in matrix form. Let \mathbf {x^{*}} :=[x_{i}^{*}]_{i\in \mathcal {N} } be a vector of the PageRanks of all the web pages. The calculation of the PageRank can be performed by solving \begin{equation} \mathbf {x^{*}} =\mathbf {A} \mathbf {x^{*}} , \quad \mathbf {x^{*}} \in [{0,1}]^{| \mathcal {N} |}, ~ \sum \limits _{i\in \mathcal {N} } x_{i}^{*}=1, \end{equation}

View SourceRight-click on figure for MathML and additional features.where \mathbf {A} is the so-called hyperlink matrix, a variation of the adjacency matrix, given by \begin{equation*} a_{ij}=\begin{cases} \displaystyle \frac {1}{n_{j}} & j\in \mathcal {N}_{i}, \\ 0 & \mathrm {otherwise}. \end{cases} \end{equation*}
View SourceRight-click on figure for MathML and additional features.

The matrix \mathbf {A} is column stochastic, i.e., \sum _{i\in \mathcal {N}}a_{ij}=1 for all j\in \mathcal {N} . As it is shown in [20], in order to have a graph of the network that satisfies this property it is necessary to remove the so-called dangling nodes, i.e., the pages with no links to other pages. This can be done by assuming that they have backward links to (some of) the pages pointing to them, which corresponds to the use of the back button in the web browser.

The PageRank vector \mathbf {x^{*}} is the nonnegative unit eigenvector that corresponds to the eigenvalue 1 of \mathbf {A} . However, in order to guarantee its uniqueness, (1) is slightly modified and the PageRank vector \mathbf {x^{*}} is defined as the solution of \begin{equation} \mathbf {x^{*}} =\mathbf {M} \mathbf {x^{*}} , \quad \mathbf {x^{*}} \in [{0,1}]^{| \mathcal {N} |}, \quad \sum \limits _{i\in \mathcal {N} } x_{i}^{*}=1 \end{equation}

View SourceRight-click on figure for MathML and additional features.with \mathbf {M}:=(1-m)\mathbf {A}+m\mathbf {J,}~m\in \left ({ 0,1}\right ) . Matrix \mathbf {M} is a convex combination of \mathbf {A} and \mathbf {J} , with \mathbf {J}=\frac {1}{|\mathcal {N}|}\mathbf {1^{|\mathcal {N}|\times | \mathcal {N}|}} , which is a jump matrix with all its elements equal to 1/|\mathcal {N}| . The jump matrix models the random jumps between pages that users perform while surfing the web. Notice that matrix \mathbf {M} is a column stochastic matrix with all positive entries. The parameter m is usually set to 0.15 by Google [5], [20].

A. Node Aggregation

Let \mathcal {S} =\{s_{1}, s_{2}, \ldots , s_{s}\} be a set composed of the web pages s_{1} , s_{2} , \ldots s_{s} that are aggregated into a single web page that preserves all the original links of the merged web pages.2 For convenience we also define \mathcal {R} =\{r_{1}, r_{2}, \ldots , r_{r}\} as the complementary set \mathcal {N} \setminus \mathcal {S} . In this paper, we are interested in the value of the resulting web page in the new configuration of the network, which will be denoted as the directed graph \mathcal {G} ^{\prime }=(\mathcal {N}^{\prime },\mathcal {E}^{\prime }) .

The aggregation of web pages can be analytically achieved by means of three auxiliary transformation matrices, namely \mathbf {P}_{ {\mathcal {S}}} , \mathbf {T}_{ {\mathcal {S}}} , and \mathbf {D}_{ {\mathcal {S}}} .3 Matrix \mathbf {P}_{ {\mathcal {S}}} is a permutation matrix whose purpose is to rearrange the link matrix so that the web pages inside \mathcal {S} appear next to each other in the first columns and rows. In particular, the permutation is given by \begin{equation*} \mathbf {P}_{ {\mathcal {S}}}= \begin{bmatrix} \mathbf {e}_{s_{1}}\mathbf {e}_{s_{2}}\ldots \mathbf {e}_{s_{s}}\mathbf {e} _{r_{1}}\mathbf {e}_{r_{2}}\ldots \mathbf {e}_{r_{r}} \end{bmatrix}, \end{equation*}

View SourceRight-click on figure for MathML and additional features. where \mathbf {e}_{i} denotes the column vector of length |\mathcal {N}| with 1 in the i -th entry and 0 in all other entries. \mathbf {T}_{ {\mathcal {S}}} is a transformation matrix whose goal is to aggregate into a single web page the members in \mathcal {S} . It is given by \begin{equation*} \mathbf {T} _ {\mathcal {S}}= \left [{ \begin{array}{cc} \mathbf {1}^{|\mathcal {S}|\times 1} & \mathbf {0}^{|\mathcal {S}|\times |\mathcal {R}|} \\ \mathbf {0}^{|\mathcal {R}|\times 1} & \mathbf {I}^{|\mathcal {R}|\times |\mathcal {R}|} \end{array} }\right ]. \end{equation*}
View SourceRight-click on figure for MathML and additional features.
Finally, \mathbf {D}_{ {\mathcal {S}}} guarantees that the new link matrix \mathbf {A}^{\prime } is also a column stochastic matrix. To this end, it normalizes the column in which the web pages are aggregated. Hence, \begin{equation*} \mathbf {D} _ {\mathcal {S}}= \left [{ \begin{array}{cccc} \dfrac {1}{|\mathcal {S}|} & \mathbf {0}^{1\times |\mathcal {R}|} \\ \mathbf {0}^{|\mathcal {R}|\times 1} & \mathbf {I}^{|\mathcal {R}|\times |\mathcal {R}|} \end{array} }\right ]. \end{equation*}
View SourceRight-click on figure for MathML and additional features.

The link matrix of \mathcal {G}^{\prime } can be derived from the original network \mathcal {G} as follows:\begin{equation*} \mathbf {A}^{\prime }=\mathbf {T}_{ {\mathcal {S}}}^{\mathrm {T}} \mathbf {P}_{ {\mathcal {S}}}^{\mathrm {T}}\mathbf {A}\mathbf {P}_{ {\mathcal {S}}} \mathbf {T}_{ {\mathcal {S}}}\mathbf {D}_{ {\mathcal {S}}}. \end{equation*}

View SourceRight-click on figure for MathML and additional features. Here, the rows and columns of \mathbf {A} are conveniently rearranged by \mathbf {P}_{ {\mathcal {S}}} , the rows and columns of the web pages in \mathcal {S} are merged by means of \mathbf {T}_{ {\mathcal {S}}} , and \mathbf {D}_{ {\mathcal {S}}} adjusts the final result to have the column stochastic matrix \mathbf {A}^{\prime } . The corresponding PageRank can be calculated as in (2), with \mathbf {M}^{\prime }:= (1-m)\mathbf {A}^{\prime }+m \mathbf {J}^{\prime } and \mathbf {J}^{\prime }=\frac {1}{|\mathcal {N}^{\prime }|}\mathbf {1^{|\mathcal {N}^{\prime }|\times |\mathcal {N}^{\prime }|}} .

The PageRank of the web page that results from merging the web pages in \mathcal {S} is defined as the PageRank of the aggregated web page in \mathcal {G} ^{\prime } , that is, the first component of the new PageRank vector \mathbf {x}^{\prime *} . This can be calculated as the limit of the sequence generated by the power method as \begin{equation} {\mathbf {x}}'[k\!+\!1]= {\mathbf {M}}' {\mathbf {x}}'[k]=(1\!-\!m) {\mathbf {A}}' {\mathbf {x}}'[k]\!+\!\frac {m}{| {\mathcal {N}}'|}\mathbf {1}^{| {\mathcal {N}}'|\times 1}\qquad \end{equation}

View SourceRight-click on figure for MathML and additional features. when k\rightarrow \infty and \mathbf {x}^{\prime }[{0}]=\frac {1}{|\mathcal {N}^{\prime }|}\mathbf {1^{|\mathcal {N}^{\prime }|\times 1}} .

Remark 1:

It is possible to consider alternatives in the way the matrix \mathbf {M}^{\prime } is built. The only strong requirement is that it must be a column stochastic matrix. One key question is whether to assume if the random surfer still jumps between the pages in the merged network \mathcal {G} ^{\prime } with the same pro- bability, i.e., if \mathbf {J}^{\prime }=\frac {1}{| {\mathcal {N}}'|}\mathbf {1^{| {\mathcal {N}}'|\times | {\mathcal {N}}'|}} should hold for this case. This corresponds to the extreme case in which the new node receives the same amount of random jumps as any other one even when it may emerge after the fusion of many nodes.

An alternative case could be to consider that the random surfer jumps with an aggregated probability to the nodes inside \mathcal {S} , i.e., \begin{equation*} \mathbf {J}^{\prime }:=\frac {1}{| {\mathcal {N}}|}\left [{ \begin{matrix} |\mathcal {S}|\mathbf {1^{1\times |\mathcal {N}^{\prime }|}} \\ \mathbf {1^{|\mathcal {R}|\times |\mathcal {N}^{\prime }|}} \end{matrix} }\right ]. \end{equation*}

View SourceRight-click on figure for MathML and additional features. In this case, it is assigned to the merged node an additional probability of random jumps because it comes from the aggregation of several independent nodes. Additional alternatives based on a different rationale are possible and even a convex combination of them could be used for the calculation. However, it should be noticed that strictly speaking the use of any of these alternatives does not lead to the PageRank but to a modified value for the merger.

B. An Academic Example

In this subsection, we introduce an example taken from [20], which will help us illustrate the main problem studied in this paper.

The network is shown in Fig. 1. As can be seen, it can be modeled by means of a directed graph \mathcal {G}=\left ({ \mathcal {N}, \mathcal {E}}\right ) , where \mathcal {N}=\{1,2,3,4,5,6\} and \mathcal {E} =\left \{{(1,2),(2,1),(1,4),(2,3),(3,2), (4,3),(3,4),(4,6),} { (6,4),(4,5),(5,6),(6,5),(3,6)}\right \} . The matrix \mathbf {A} of this network is \begin{equation*} \mathbf {A} = \left [{ \begin{array}{cccccc} 0 & \dfrac {1}{2} & 0 & 0 & 0 & 0 \\ \dfrac {1}{2} & 0 & \dfrac {1}{3} & 0 & 0 & 0 \\ 0 & \dfrac {1}{2} & 0 & \dfrac {1}{3} & 0 & 0 \\ \dfrac {1}{2} & 0 & \dfrac {1}{3} & 0 & 0 & \dfrac {1}{2} \\ 0 & 0 & 0 & \dfrac {1}{3} & 0 & \dfrac {1}{2} \\ 0 & 0 & \dfrac {1}{3} & \dfrac {1}{3} & 1 & 0 \\ \end{array} }\right ] \end{equation*}

View SourceRight-click on figure for MathML and additional features. and the corresponding matrix \mathbf {M} becomes \begin{equation*} \mathbf {M} \!=\! \begin{bmatrix} 0.025 & 0.450 & 0.025 & 0.025 & 0.025 & 0.025 \\ 0.450 & 0.025 & 0.308 & 0.025 & 0.025 & 0.025 \\ 0.025 & 0.450 & 0.025 & 0.308 & 0.025 & 0.025 \\ 0.450 & 0.025 & 0.308 & 0.025 & 0.025 & 0.450 \\ 0.025 & 0.025 & 0.025 & 0.308 & 0.025 & 0.450 \\ 0.025 & 0.025 & 0.308 & 0.308 & 0.875 & 0.025 \end{bmatrix}\!. \end{equation*}
View SourceRight-click on figure for MathML and additional features.

FIGURE 1. - Example web with six pages.
FIGURE 1.

Example web with six pages.

Table 1 provides the values of the PageRank (PR) of this network. As can be seen, web pages 4, 5 and 6 have the highest PageRank values due to their larger numbers of incoming links. The value of web page 6 is greater because its incoming links come from web pages with larger values, namely pages 3, 4 and 5. It is very interesting that pages 4 and 5 share the same value of PageRank, specially since page 5 is a dangling node in [20].4

TABLE 1 PageRank of the Example Network
Table 1- 
PageRank of the Example Network

Finally, the aggregation of web pages can be studied by using (3). For example, the PageRank of the merger of web pages 1 and 2 is 0.11, which is greater than the PageRank of any of these web pages, as shown in Table 1. Nevertheless, the PageRank of the merger is lower than the sum of the PageRanks corresponding to the merged nodes. Next, let us examine what happens when nodes 1 and 4 are aggregated. In this case, the PageRank of the merger becomes 0.281, which is greater than the sum of the PageRanks of the nodes aggregated. Hence, the aggregation is super-additive in terms of PageRank.

In the next sections, we explore the generation of super-additive PageRank from different perspectives. In particular, we study which nodes are more likely to generate super-additive PageRank and also different estimators for the PageRank value of the merger when full information is not available.

SECTION III.

Global Approach: Cooperative Games Based on the Pagerank

Game theory deals with situations in which there are several coupled interacting entities. The outcome of a game depends on the combinations of the decisions taken by its players. Depending on the way the decisions are taken, it is possible to classify the games as cooperative or noncooperative. In the former case, players may attain binding agreements; in the latter case, decisions are taken without guarantees regarding the behavior of the rest of the players. In this work, we focus on cooperative games, which are defined by two basic elements, namely a set of players and a characteristic function. More specifically, we propose to use the PageRank that an aggregation of web pages obtains as the value that corresponds to that coalition. The Shapley value [31] of these aggregation PageRank games is proposed as a centrality measure.

In this section, we first provide some grounds about cooperative game theory. Next we define how the PageRank can be computed. Finally, cooperative games based on the PageRank are defined.

A. Cooperative Game Theory

Cooperative game theory studies situations in which players can negotiate and attain binding agreements. Which coalitions of players should be expected and how to divide the benefits/costs derived from cooperation are major concerns in this branch of game theory. Formally, a cooperative game with transferable utility is a pair (\mathcal {N},v) , where \mathcal {N} is the set of players and v is a function that assigns a value to each possible coalition \mathcal {S}\subseteq \mathcal {N} , with v(\emptyset )=0 . A payoff rule is a vector that assigns a certain amount to each player according to its contribution to the game. In this work we use the Shapley value, which assigns to each player i\in \mathcal {N} the value \begin{align} \phi _{i}(\mathcal {N},v)=\sum _{\mathcal {S}\subseteq \mathcal {N}\setminus \{i\}} \frac {|\mathcal {S}|!(|\mathcal {N}|\!-\!|\mathcal {S}|\!-\!1)!}{| \mathcal {N}|!}[v(\mathcal {S}\cup \{i\})\!-\!v(\mathcal {S})].\notag \\ {}\end{align}

View SourceRight-click on figure for MathML and additional features. This can be interpreted as the value that each player gets according to a weighted average of the contributions he makes to the different coalitions. The Shapley value on the class of transferable utility games satisfies some interesting properties such as Linearity, Efficiency, Dummy player and Symmetry [31].

B. The Pagerank Aggregation Game

The definition of a cooperative game requires a characteristic function that assigns a value to each of the possible coalitions of players that can be formed. In the PageRank aggregation game, the players are the web pages contained in \mathcal {G} and the term coalition is understood in the sense of aggregation. A coalition of web pages \mathcal {S} =\{s_{1}, s_{2}, \ldots , s_{s}\} is the aggregation of these web pages into a single web page that preserves all the original links. The value of coalition \mathcal {S} is defined as the PageRank of the aggregated web page in \mathcal {G}^{\prime } , that is, the first component of the new PageRank vector \mathbf {x}^{\prime *} . Hence \begin{equation} v(\mathcal {S})=[1\; 0 \ldots 0] \mathbf {x} ^{\prime *}=\mathbf {e} _{1}^{\mathrm {T}} \mathbf {x} ^{\prime *}. \end{equation}

View SourceRight-click on figure for MathML and additional features.

Note that (5) depends only on the structure of the network of web pages {\mathcal {G}} and hence the Shapley value of this game provides us with a centrality measure of the web pages.

C. The Pagerank Difference Aggregation Game

An allocation vector of the previous game provides information regarding the players that are expected to provide more PageRank when they are aggregated into a coalition. While this information is valuable, it does not consider the net effect of the aggregation. It is interesting to know whether the PageRank of the merger is greater, lower or equal to the sum of the individual PageRanks of the players in the coalition. To this end, we define the PageRank difference aggregation game over the same set of players but with the following characteristic function:\begin{equation*} v_{\mathrm {dif}}(\mathcal {S} )=v(\mathcal {S} )-\sum \limits _{j\in \mathcal {S} } v(j), \end{equation*}

View SourceRight-click on figure for MathML and additional features. where for simplicity we have denoted v(\{i\}) by v(i) . The characteristic function v_{\mathrm {dif}}(\mathcal {S} ) measures the gain or loss of PageRank derived from the aggregation. If v_{\mathrm {dif}}(\mathcal {S} )>0 for a certain coalition \mathcal {S} then the aggregation is rational and the players have a strong incentive to perform the coalition. The Shapley value of the difference game provides information about the best players to be integrated from this perspective.

Remark 2:

The fact that the PageRank obtained by a coalition is lower than the sum of the PageRanks of its members should not be taken as if it is not rational to carry out the integration. A certain loss of PageRank can be acceptable in order to gain relevance in terms of centrality.

Finally, the following proposition simplifies the calculation of the Shapley value of the difference aggregation game.

Proposition 1:

The Shapley value of the difference aggregation game can be calculated as \begin{equation*} \phi _{i}(\mathcal {N},{v}_{\mathrm {dif}})=\phi _{i}(\mathcal {N}, {v})-{v}(i). \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

The Shapley value is linear by construction. Hence, \begin{align*} \phi _{i}(\mathcal {N},{v}_{\mathrm {dif}})=&\phi _{i}\left({\mathcal {N},v( \mathcal {S})-\sum \limits _{j\in \mathcal {S}}v(j)}\right) \\=&\phi _{i}( \mathcal {N},v(\mathcal {S}))-\phi _{i}\left({\mathcal {N},\sum \limits _{j\in \mathcal { S}}v(j)}\right). \end{align*}

View SourceRight-click on figure for MathML and additional features. The Shapley value of the game \sum \limits _{j\in \mathcal {S} } v(j) is simply \phi _{i}\left({\mathcal {N} ,\sum \limits _{j\in \mathcal {S} } v(j)}\right)=v(i) because the marginal contribution in this game of adding any player to a random coalition is constant and equal to its own value v(i) .

Remark 3:

Notice that the computation of \phi _{i}(\mathcal {N} ,{v}_{\mathrm { dif}}) can be performed immediately after \phi _{i}(\mathcal {N} ,{v} ) is computed as it only requires to subtract the PageRank of the corresponding player to this amount.

D. An Academic Example (Cont.)

In this subsection, we work again with the example in Section II-B, which will help us show the differences between the PageRank of the web pages in the network and the Shapley value of the games proposed.

From the viewpoint of game theory, we model the web given by the directed graph \mathcal {G} as a cooperative game (\mathcal {N},{v}) where \mathcal {N} is the set of web pages and {v} is defined as the PageRank of the aggregated coalition. In Table 2, the value of the characteristic function {v} is shown for each coalition of web pages.

TABLE 2 Characteristic Function for the Example Network
Table 2- 
Characteristic Function for the Example Network

Table 3 provides the values of the PageRank (PR) and the Shapley values of the PageRank aggregation games of this network. As can be seen, the Shapley value of the aggregation game is fairly close to the PageRank, but it does recognize a difference on the relevance on nodes 4 and 5. In particular, node 4 is more important because of its greater impact when it is aggregated into a random coalition of web pages. For example, in all the coalitions with 5 web pages, the one with the lowest value is that in which web page 4 is not included. As a direct consequence, the relative relevance of page 6 is also reduced. The Shapley value of the difference aggregation game shows that nodes 1 to 4 are more likely to produce mergers with a higher PageRank value. Indeed, a closer look at Table 2 shows that most coalitions that produce a positive difference contain at least some of these players. Likewise, notice that the Shapley values of this game correspond to that of the aggregation game minus the PageRank of each player.5

TABLE 3 Comparison of the Values
Table 3- 
Comparison of the Values

This simple academic example shows the potential of the Shapley value of the aggregation PageRank games as an alternative centrality measure for the nodes. In addition, this perspective is also useful when evaluating the potential fusion of nodes inside the network. For example, several web owners that plan to integrate their web pages would be very interested in this type of information, specially since revenues are related to the number of visitors and hence PageRank. However, a problem arises at this point due to the combinatorial explosion and the need of centralized information. While there are efficient methods for the distributed computation of the PageRank [20], the computational and informational requirements to calculate this value are very demanding.

E. The Shapley Value of Large Networks

As previously pointed out, one of the most important solution concepts in cooperative games is the Shapley value [30]. The Shapley value allocates the worth of the grand coalition when all agents in the set of players decide to cooperate. However, a well-known problem of the Shapley value is its computation. The explosion on the number of coalitions that have to be computed hinders its calculation for large-scale games. For a set of players \mathcal {N} , 2^{|\mathcal {N} |} different coalitions need to be evaluated. For example, a network with only 30 nodes requires 230 coalitions to be evaluated, i.e., a billion of value computations for a relatively small size network. To overcome this issue, in this work we employ the numerical approximation of the Shapley value proposed in [6], which is based on a randomized method. Next, we describe briefly this method. It is important to emphasize that for the games defined previously the computation of the value of each coalition can be realized in polynomial time.

In the sampling method given in [6], an alternative definition of the Shapley value is used. Indeed, the Shapley value can be expressed in terms of all possible orders of the players \mathcal {N} , assuming that all different orders have the same probability, in the following way:\begin{equation*} \phi _{i}(\mathcal {N},v)=\frac {1}{|\mathcal {N}|!}\sum _{\pi \in \Pi (\mathcal {N})}m_{i}^{\pi }(\mathcal {N},v), \text {for all }i\in \mathcal {N}, \end{equation*}

View SourceRight-click on figure for MathML and additional features. where \Pi (\mathcal {N}) is the collection of all permutations \pi \colon \mathcal {N}\rightarrow \mathcal {N} on \mathcal {N} , and for every permutation \pi \in \Pi (\mathcal {N}) , \begin{align} m_{i}^{\pi }(\mathcal {N},v)=&v(\{j\in \mathcal {N}\mid \pi (j)\leq \pi (i)\})\notag \\&\qquad \qquad \quad ~ -\,v(\{j\in \mathcal {N}\mid \pi (j)<\pi (i)\}),\qquad \end{align}
View SourceRight-click on figure for MathML and additional features.
is the marginal contribution of player i to the players that are ranked before him in the order \pi . Therefore, the Shapley value assigns to every game the average over all marginal vectors associated to all permutations of the player set \mathcal {N} .

Next, some terminology corresponding to the sampling process to approximate the Shapley value is given as follows:

  1. The population set \mathcal {P}=\Pi (\mathcal {N}) from which the sample is taken is represented by the set of all possible orders of the set of players \mathcal {N} . The sample \mathcal {Q} is an element of \begin{equation*} \underbrace {\mathcal {P}\times \mathcal {P}\times {\cdots }\times \mathcal {P}}_{\text {$q$ times}}, \end{equation*}

    View SourceRight-click on figure for MathML and additional features. i.e., the sample size is q and it is obtained with replacement.

  2. The parameter vector \phi =\left ({ \phi _{1},\ldots ,\phi _{n}}\right ) under consideration consists of the Shapley value for each i\in \mathcal {N} .

  3. The characteristics observed for each sampling unit \pi \in \Pi (\mathcal {N}) correspond to the marginal contributions of the players in the order \pi , i.e., \begin{equation*} x(\pi )=\left ({ x_{1}(\pi ),\ldots ,x_{n}(\pi )}\right ) \quad \text {with}~x_{i}(\pi )=m_{i}^{\pi }(\mathcal {N},v). \end{equation*}

    View SourceRight-click on figure for MathML and additional features.

  4. The estimate of the Shapley value, \widetilde {\phi }_{i}(\mathcal {N},v) , will be given by the average of the marginal contributions over the sample Q , i.e., \begin{equation*} \widetilde {\phi }_{i}(\mathcal {N},v)=\frac {1}{q}\sum _{\pi \in Q}m_{i}^{\pi }(\mathcal {N},v),\quad \text {for all}~i\in \mathcal {N}. \end{equation*}

    View SourceRight-click on figure for MathML and additional features.

  5. Any order \pi \in \Pi (\mathcal {N}) will be taken with equal probability to determine the sample \mathcal {Q} in the process of selection.

The sampling process described above gives an approximation of the Shapley value with desirable properties. For instance, it is possible to calculate the theoretical error in a probabilistic way. In particular, \begin{equation*} \widetilde {\phi }_{i}(\mathcal {N},v) \sim N({\phi }_{i}(\mathcal {N},v), \sigma ^{2}/q), \end{equation*}

View SourceRight-click on figure for MathML and additional features. i.e., the estimator is unbiased and its variance is given by \sigma ^{2}/q [6]. Hence, if the sample size satisfies q\ge Z_{\alpha /2}^{2} \sigma ^{2}/e^{2} , then P(|\phi _{i}(\mathcal {N},v)- \widetilde {\phi }_{i}(\mathcal {N},v)|\le e)\ge 1-\alpha , with e being the approximation error, Z\sim N(0,1) , and Z_{\alpha /2}^{2} being the value such that P(Z\ge Z_{\alpha /2}^{2} )=\alpha /2 . Given that \sigma ^{2} is unknown, it is necessary to provide an upper bound, which becomes \sigma ^{2}\le (x^{i}_{\mathrm {max}}-x^{i}_{\mathrm {min}})/4 for any random variable bounded in a range [x^{i}_{\mathrm {min}}, x^{i}_{\mathrm {max}}] [6]. If we take into account that \phi _{i}(\mathcal {N},v)\in [{0, 1}] , then the bound simply becomes \sigma ^{2}\le 0.25 in our case.

F. Example

In this subsection we deal with a web composed of 200 nodes. The links between the nodes can be seen in Figure 2, where yellow and blue are used to denote respectively the presence and absence of links between two given nodes. Notice that the direct computation of the Shapley value of any of the games proposed in this work requires the computation of the value of 2200 coalitions (1.60\times 10^{60} ), which is not feasible.

FIGURE 2. - Example web with 200 pages.
FIGURE 2.

Example web with 200 pages.

The application of the method from the previous subsection is particularly interesting in this case. In particular, the fact that the PageRank of any given node is limited between 0 and 1 allows calculating accurate results with a reduced set of samples. For example, a maximum error of 0.01 in the PR requires the evaluation of 829400 coalitions. A maximum error ten times lower requires to compute 1353600 coalitions. As can be seen, the resulting computation burden is feasible within the limits imposed by current technology. In addition, it must be noticed that the computation can be easily parallelized.

Figure 3 shows the PageRank and Shapley values of the aggregation and difference games of the example calculated for a maximum error of 5\times 10^{-4} and a 99% confidence level. In general, there is a strong correlation between the PageRank and the Shapley value of the aggregation game. However, the difference in the value of the last nodes is remarkable. As can be seen in Figure 2, these nodes are almost isolated. Hence, it is difficult for the random surfer to get out of this set of nodes once he is there. As a consequence, the PageRank of these nodes is high. However, this effect is corrected in the Shapley value of the aggregation game. From this perspective, there is not much to gain by merging with the nodes in this set. Moreover, the Shapley value of the difference game shows losses in terms of PageRank for any merger containing one of the nodes in this set. Merging breaks the isolation of the nodes, and hence the gain of PageRank may be lost. Finally, the Shapley value of the difference game shows that nodes more likely to provide an increased PageRank are those between 50 and 90.

FIGURE 3. - PageRank (PR), Shapley values of the aggregation (SV) and difference games (SVdg) of the 200 nodes example.
FIGURE 3.

PageRank (PR), Shapley values of the aggregation (SV) and difference games (SVdg) of the 200 nodes example.

SECTION IV.

Local Approach

Even when we have shown that obtaining information regarding the suitability of merging a set of nodes \mathcal {S} is feasible, full information is not always available. In such cases, it is possible to assess the potential of the merger by means of an approximation of v(\mathcal {S}) , denoted by v^{\prime }(\mathcal {S} ) , before making any decision. To this end, let us partition the nodes \mathcal {N} in \mathcal {G} in three disjoint sets:

  • \mathcal {S} : These are the core nodes, i.e., those whose fusion is to be assessed.

  • \mathcal {I} : Interface nodes, which are the nodes that point to or are pointed to by the nodes in \mathcal {S} . These are the neighbors of the core nodes, i.e., their interface with the rest of the network. We assume that information from this nodes is also available.

  • \mathcal {O} : The rest of the nodes are grouped here and are called the outer nodes.

Without loss of generality, let us assume that the set \mathcal {N} in \mathcal {G} has its elements arranged in the following order: outer nodes (\mathcal {O} ), interface nodes (\mathcal {I} ), and core nodes (\mathcal {S} ). The link-matrix \mathbf {A} has this structure as well:\begin{equation} \mathbf {A}=\left [{ \begin{matrix} \mathbf {A}_{\mathcal {O}\mathcal {O}} & \mathbf {A}_{\mathcal {O}\mathcal {I}} & \mathbf {A}_{\mathcal {O}\mathcal {S}}\\ \mathbf {A}_{\mathcal {I}\mathcal {O}} & \mathbf {A}_{\mathcal {I}\mathcal {I}} & \mathbf {A}_{\mathcal {I}\mathcal {S}}\\ \mathbf {A}_{\mathcal {S}\mathcal {O}} & \mathbf {A}_{\mathcal {S}\mathcal {I}} & \mathbf {A}_{\mathcal {S}\mathcal {S}} \end{matrix} }\right ]. \end{equation}

View SourceRight-click on figure for MathML and additional features.

Let us define the function PR(\cdot ) that provides us with the PageRank vector of the nodes in its argument, e.g., PR(\mathcal {S} ) stands for the PageRank of the core nodes. As we saw, the PageRank is the eigenvector corresponding to the unit eigenvalue of the matrix \mathbf {M} in (2). Hence, the PageRank of the core nodes has to satisfy the following equation:\begin{equation*} PR(\mathcal {S})=(1\!-\!m)\left [{\mathbf {A}_{\mathcal {S}\mathcal {O}} ~~ \mathbf {A}_{\mathcal {S}\mathcal {I}} ~~ \mathbf {A}_{\mathcal {S}\mathcal {S}}}\right ]PR(\mathcal {N})+\frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1}. \end{equation*}

View SourceRight-click on figure for MathML and additional features. Taking into account that \mathbf {A}_{\mathcal {S}\mathcal {O}}=\mathbf {0} by construction and that we can decompose the PageRank vector as PR(\mathcal {N})=[PR(\mathcal {O})^{\mathrm {T}}, PR(\mathcal {I})^{\mathrm {T}}, PR(\mathcal {S})^{\mathrm {T}}]^{\mathrm {T}} , it is possible to rewrite the equation as \begin{align} PR(\mathcal {S})=&\left [{(m-1)\mathbf {A}_{\mathcal {S}\mathcal {S}}+\mathbf {I}^{|\mathcal {S}|\times |\mathcal {S}|}}\right ]^{-1} \notag \\&\qquad \quad \cdot \,\left [{ (1-m)\mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})+\frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1}}\right ].\qquad \end{align}
View SourceRight-click on figure for MathML and additional features.

Remark 4:

A more general version of (8) can be built by considering a non-homogeneous jump matrix \mathbf {J} . In this case, (8) becomes \begin{align} PR(\mathcal {S})=&\left [{(m-1)\mathbf {A}_{\mathcal {S}\mathcal {S}}-m\mathbf {J}_{\mathcal {S}\mathcal {S}}+\mathbf {I}^{|\mathcal {S}|\times |\mathcal {S}|}}\right ]^{-1} \notag \\&\cdot \,\Biggl [{ (1-m)\mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})}\notag \\&{+\,m(\mathbf {J}_{\mathcal {S}\mathcal {O}}PR(\mathcal {O})+\mathbf {J}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I}))}\Biggr ]. \end{align}

View SourceRight-click on figure for MathML and additional features.

A. Naive Coalition Value Estimation

Equation (8) allows us to calculate the PageRank of the set of core nodes from the PageRank of its neighbors. This also motivates us to introduce our first and simplest estimator for v(\mathcal {S}) as the sum of the PageRank of the corresponding nodes in \mathcal {S} as \begin{equation*} v^{\mathrm {SPR}}(\mathcal {S})=\mathbf {1}^{1\times |\mathcal {S}|} PR(\mathcal {S}), \end{equation*}

View SourceRight-click on figure for MathML and additional features. where PR(\mathcal {S}) is obtained as in (8). Despite its simplicity, v^{\mathrm {SPR}}(\mathcal {S}) has a problem: it cannot be used to obtain information regarding the super-additivity of the merger regarding PageRank. This is important because we would like to know whether the PageRank of the merger will be greater than or equal to the sum of the PageRanks of the merged nodes.

B. Ceteris Paribus Approach

Equation (8) can also be applied to the link-matrix of the network resulting from merging the nodes in the coalition \mathcal {S} , denoted by \mathbf {A}^{\prime } , i.e., \begin{equation} PR^{\prime }(\mathcal {S})=\frac {(1-m)\mathbf {A}^{\prime }_{\mathcal {S}\mathcal {I}} PR^{\prime }(\mathcal {I})+\frac {m}{|\mathcal {N}| -|\mathcal {S}|+1}}{(m-1)\mathbf {A}^{\prime }_{\mathcal {S}\mathcal {S}}+1}, \end{equation}

View SourceRight-click on figure for MathML and additional features. where the matrices \mathbf {A}^{\prime }_{\mathcal {S}\mathcal {S}} and \mathbf {A}^{\prime }_{\mathcal {S}\mathcal {I}} can be easily obtained from \mathbf {A} as \begin{align*} \mathbf {A}^{\prime }_{\mathcal {S}\mathcal {S}}=&\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|},\\ \mathbf {A}^{\prime }_{\mathcal {S}\mathcal {I}}=&\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}. \end{align*}
View SourceRight-click on figure for MathML and additional features.
Here, the superscript ^\prime denotes the PageRank and the matrices that correspond to the network after merging the core nodes. Note that this is consistent with Section II-A. Also, everything in (10) can be known in advance with the exception of PR^{\prime }(\mathcal {I}) , i.e., the PageRank of neighbors after the merging. Note moreover that Equation (10) provides also the actual value of v(\mathcal {S}) , i.e., v(\mathcal {S})=PR^{\prime }(\mathcal {S}) . As a consequence, and taking into account that super-additive coalitions must satisfy \begin{equation*} v(\mathcal {S}) \ge v^{\mathrm {SPR}}(\mathcal {S}), \end{equation*}
View SourceRight-click on figure for MathML and additional features.
it is possible to write the condition that has to be satisfied to have a super-additive coalition.

Proposition 2:

A coalition \mathcal {S} generates super-additivity in terms of PageRank if and only if the following condition holds:\begin{align}&\hspace {-3.5pc}\frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR^{\prime }(\mathcal {I})+\frac {m}{|\mathcal {N}|-|\mathcal {S}|}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1} \notag \\\ge&\mathbf {1}^{1\times |\mathcal {S}|} \left [{(m-1)\mathbf {A}_{\mathcal {S}\mathcal {S}}+\mathbf {I}^{|\mathcal {S}|\times |\mathcal {S}|}}\right ]^{-1} \notag \\&\cdot \,\left [{ (1-m)\mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})+\frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1}}\right ]. \end{align}

View SourceRight-click on figure for MathML and additional features.

No proof is provided for the proposition since it is a simple consequence of the definitions of v(\mathcal {S}) and v^{\mathrm {SPR}}(\mathcal {S}) .

Equation (11) highlights two key ideas for a coalition to be super-additive:

  • Coalitions with nodes with a strong internal link interaction with respect to their interaction with rest of the network have more chances due to the term \mathbf {A}_{\mathcal {S}\mathcal {S}} in the denominator of (11). Given that m-1 < 0 , the greater the elements of this matrix are, the better.

  • Coalitions making their neighbors better off after the merging have better chances due to the influence of PR^{\prime }(\mathcal {I}) in (11). That is, nodes forming a coalition should not only pay attention to their PageRank. Their neighbors’ PageRank is very important in this regard.

Note also that it is not possible to evaluate Equation (11) before actually computing \mathbf {A}^\prime . Nevertheless, if we take the classical ceteris paribus approach and assume that everything holds constant despite merging the nodes, we can evaluate Equation (11) by taking PR^{\prime }(\mathcal {I})=PR(\mathcal {I}) . This assumption provides us with a reasonable guess of whether we can expect a super-additive coalition as long as the PageRank of neighbors is not much altered after the merging, which is likely to happen in a large-scale network. As a consequence, we arrive at our ceteris paribus estimator for v(\mathcal {S}) , which is defined as \begin{equation} v^{\mathrm {CP}}(\mathcal {S})= \frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I}) +\frac {m}{|\mathcal {N}|-|\mathcal {S}|+1}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}.\qquad \end{equation}

View SourceRight-click on figure for MathML and additional features.

A second version of this estimator can be defined in the line of Equation (9), i.e., for the case in which the jump matrix is also affected by the node aggregation. In this case, the probability of randomly jumping to the merger becomes the sum of the probabilities of the merged nodes before the jump took place, hence giving \begin{equation} v^{\mathrm {CP2}}(\mathcal {S})=\frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR(\mathcal {I})+m\frac {|\mathcal {S}|}{|\mathcal {N}|}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}. \end{equation}

View SourceRight-click on figure for MathML and additional features.

C. Equivalent Subnetworks in a Pagerank Sense

In this subsection, we take a slightly different approach and pay attention to the following problem: given a network \mathcal {G} that contains a subnetwork \mathcal {G}_{\mathrm {sub}} , we would like to find another network \mathcal {G}^{\prime } that also contains \mathcal {G}_{\mathrm {sub}} so that the PageRank of the nodes in \mathcal {G}_{\mathrm {sub}} is equal in both networks \mathcal {G} and \mathcal {G}^{\prime } .

Definition 1:

Let \mathcal {G} and \mathcal {G} ^{\prime } be two networks containing a subnetwork \mathcal {G} _{\mathrm {sub}} . We say that they are equivalent in a strict PageRank sense for the subnetwork \mathcal {G} _{\mathrm {sub}} if the PageRank of the nodes in \mathcal {G} _{\mathrm {sub}} is the same in both networks.

Our goal is to find a network \mathcal {G}^{\prime } simpler than \mathcal {G} that allows assessing the convenience of merging some of the elements in \mathcal {G} _{\mathrm {sub}} . In particular, we assume that \mathcal {G} _{\mathrm {sub}} contains the nodes in \mathcal {S} and \mathcal {I} and their internal links.

Proposition 3:

Given a subnetwork \mathcal {G} _{\mathrm {sub}} of \mathcal {G} with a nonempty set \mathcal {S} of core nodes, a simplified network \mathcal {G} ^{\prime } that also contains \mathcal {G} _{\mathrm {sub}} must have the same number of nodes with \mathcal {G} to provide the same PageRank to the nodes in \mathcal {G} _{\mathrm {sub}} , i.e., to provide strict PageRank equivalence for subnetwork \mathcal {G} _{\mathrm {sub}} .

Proof:

Let \mathbf {A} and \mathbf {A}^{\prime } be the link-matrices that correspond respectively to \mathcal {G} and \mathcal {G}^{\prime } . Strict PageRank equivalence requires that the condition PR(\mathcal {S})=PR^\prime (\mathcal {S}) holds. Now note that:

  • \mathbf {A}_{\mathcal {S}\mathcal {O}}^{\prime } is a zero block because there is no link relationship between core and outer nodes.

  • Likewise, \mathbf {A}_{\mathcal {S}\mathcal {I}}^{\prime } and \mathbf {A}_{\mathcal {S}\mathcal {S}}^{\prime } must be equal in \mathbf {A}^{\prime } and \mathbf {A} to preserve the link relationship between these groups of nodes.

Since Equation (8) can be applied to both networks, we can observe that PR(\mathcal {S})=PR^\prime (\mathcal {S}) holds only if \begin{equation*} \frac {m}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {S}|\times 1} = \frac {m}{|\mathcal {N}^\prime |}\mathbf {1}^{|\mathcal {S}|\times 1}, \end{equation*}

View SourceRight-click on figure for MathML and additional features. which is equivalent to |\mathcal {N}|=|\mathcal {N}^{\prime }| . Consequently, a different number of nodes will lead to a different value of the PageRank in the core nodes.

According to Proposition 3, it is possible to obtain a network \mathcal {G} ^{\prime } with a simpler structure but the number of nodes must remain constant whenever there exist core nodes. Overcoming this issue requires to introduce changes in the way the PageRank is computed following the ideas of Remark 1. If the structure of \mathbf {M} is changed, then it may be possible to find a network \mathcal {G} ^{\prime } with a reduced number of nodes that still provides the same PageRank value for the nodes of the subnetwork under study according to the modified version of the PageRank computation. To this end, we introduce the following definition:

Definition 2:

Let \mathcal {G} and \mathcal {G}^{\prime } be two networks containing a subnetwork \mathcal {G}_{ \mathrm {sub}} . For \mathcal {G} ^{\prime } , let \mathbf {M}^{\prime } be a column stochastic matrix built using a non-uniform distribution of the random surfer jumps, which is denoted by \mathbf {J} ^{\prime } , i.e., \mathbf {M}^{\prime }=(1-m)\mathbf {A}^{\prime }+m\mathbf {J }^{\prime } . Let PR\mathrm {^{\prime }} be the eigenvector that corresponds to the eigenvalue 1 of this matrix. We say that \mathcal {G} and \mathcal {G}^{\prime } are equivalent in a wide PageRank sense for the subnetwork \mathcal {G}_{\mathrm {sub}} if the PageRank PR of \mathcal {G} and the modified PageRank \mathrm {PR^{\prime }} of \mathcal {G}^{\prime } are the same for the nodes in \mathcal {G}_{\mathrm {sub}} .

From the definition, it is clear that wide-sense PageRank equivalence is less restrictive than strict PageRank equivalence. The reduced network \mathcal {G} ^{\prime } can be ultimately described by \mathbf {A} ^{\prime } , which must be calculated.

Proposition 4:

Let \mathbf {A}^{\prime } and \mathbf {J}^{\prime } be nonnegative column stochastic matrices representing respectively the link-matrix and the jump matrix of network \mathcal {G}^{\prime } so that \begin{equation} \left ({(1-m)\mathbf {A}^{\prime }+m\mathbf {J}^{\prime }}\right )\mathrm {PR^{\prime }}(\mathcal {N}^\prime )=\mathrm {PR^\prime }(\mathcal {N}^\prime ). \end{equation}

View SourceRight-click on figure for MathML and additional features. For the networks \mathcal {G} and \mathcal {G}^{\prime } to be equivalent in a wide PageRank sense for a subnetwork \mathcal {G}_{\mathrm {sub}} , it holds that \begin{align} \mathbf {A}_{\mathcal {I}\mathcal {I}}^{\prime }=&\mathbf {A}_{\mathcal {I}\mathcal {I}}, \\[2pt] \mathbf {A}_{\mathcal {I}\mathcal {S}}^{\prime }=&\mathbf {A}_{\mathcal {I}\mathcal {S}}, \\[2pt] \mathbf {A}_{\mathcal {S}\mathcal {I}}^{\prime }=&\mathbf {A}_{\mathcal {S}\mathcal {I}}, \\[2pt] \mathbf {A}_{\mathcal {S}\mathcal {S}}^{\prime }=&\mathbf {A}_{\mathcal {S}\mathcal {S}}, \\[2pt] \mathbf {A}_{\mathcal {S}\mathcal {O}}^{\prime }=&\mathbf {0}, \\[2pt] \mathbf {A} _{\mathcal {O}\mathcal {S}}^{\prime }=&\mathbf {0}, \\[2pt] PR^\prime (\mathcal {I})=&PR(\mathcal {I}), \\[2pt] PR^\prime (\mathcal {S})=&PR(\mathcal {S}). \end{align}
View SourceRight-click on figure for MathML and additional features.

Proof:

The inner structure of \mathcal {G} _{\mathrm {sub}} is preserved by means of the equality constraints of \mathbf {A}^{\prime } (15)–(20). Both \mathbf {A}_{\mathcal {S}\mathcal {O}}^{\prime } and \mathbf {A}_{\mathcal {O}\mathcal {S}}^{\prime } are zero blocks of the corresponding sizes and the elements in \mathbf {A}_{\mathcal {I}\mathcal {I}}^{\prime }, \mathbf {A} _{\mathcal {I}\mathcal {S}}^{\prime }, \mathbf {A}_{\mathcal {S}\mathcal {I}}^{\prime } and \mathbf {A}_{\mathcal {S}\mathcal {S}}^{\prime } have the same values as those in the corresponding submatrices in \mathbf {A} . Finally, Equations (21) and (22) require that the core and interface nodes receive the same value in PR^\prime , which stems from the definition of equivalent network in a wide PageRank sense.

Notice that the conditions given are necessary but they do not lead in general to a unique solution. In particular, the following elements have to be designed: \mathbf {A}_{\mathcal {O}\mathcal {O}}^{\prime } , \mathbf {A}_{\mathcal {I}\mathcal {O}}^{\prime } , \mathbf {A}_{\mathcal {O}\mathcal {I}}^{\prime } , \mathbf {J}^{\prime } , and PR^\prime (\mathcal {O^\prime }) . The direct computation of these elements based on (14) leads us to a set of bilinear equations because there are cross multiplications in these variables. Alternatively, it may be preferable to solve a set of linear equations for different fixed values of the modified PageRank value of the external nodes, especially because there is an incentive to keep a low number of them for the sake of simplicity.

Another noteworthy point is that there may not be a feasible solution for the set of constraints given, i.e., the set of equations may be incompatible, or even when a solution can be found \mathbf {A}^{\prime } may be non-stochastic, etc. In that case, a new attempt can be made with an increase in the number of external nodes. Consecutive increments can be made until \mathbf {A}^{\prime } is found. Finally, notice that the increase of external elements is bounded because in the worst case a solution for the problem is guaranteed trivially for \mathcal {G}=\mathcal {G}^{\prime } . In that case \mathcal {G} would not be reducible for the considered subnetwork.

Once this merged network \mathcal {G}^{\prime } has been calculated, an estimation of v(\mathcal {S}) is given by v^{\prime }(\mathcal {S}) , i.e., the PageRank that corresponds to the fusion of the nodes in the reduced network.

Next, we present a particular case of this approach and the corresponding estimator.

1) Direct Approximation

Here, we introduce a direct approximation for v^{\prime } (\mathcal {S} ) based on a simplification of the original network \mathcal {G} in which we aggregate all the nodes outside \mathcal {G} _{\mathrm {sub}} into a single node. As we will see, even if full information is not available, it is possible to calculate this approximation if the links between the nodes in \mathcal {I} and \mathcal {O} are known. To this end, we construct \mathbf {A}^{\prime } as follows:\begin{equation} \mathbf {A}^{\prime }=\left [{ \begin{matrix} 1-\mathbf {1}^{1\times |\mathcal {I}|}\mathbf {A}_{\mathcal {I}\mathcal {O}}^{\prime } & \mathbf {A} _{\mathcal {O}\mathcal {I}}^{\prime } & 0 \\ \mathbf {A} _{\mathcal {I}\mathcal {O}}^{\prime } & \mathbf {A}_{\mathcal {I}\mathcal {I}} & \mathbf {A} _{\mathcal {I}\mathcal {S}} \\ 0 & \mathbf {A}_{\mathcal {S}\mathcal {I}} & \mathbf {A}_{\mathcal {S}\mathcal {S}} \end{matrix} }\right ], \end{equation}

View SourceRight-click on figure for MathML and additional features. with \begin{equation} \mathbf {A}_{\mathcal {O}\mathcal {I}}^{\prime } =\mathbf {1}^{1\times |\mathcal {I}|}-\mathbf {1}^{1\times |\mathcal {I}|}\mathbf {A}_{\mathcal {I}\mathcal {I}} -\mathbf {1}^{1\times |\mathcal {S}|}\mathbf {A}_{\mathcal {S}\mathcal {I}} \end{equation}
View SourceRight-click on figure for MathML and additional features.
and \begin{equation} \mathbf {A}_{\mathcal {I}\mathcal {O}}^{\prime }=\frac {\mathbf {A}_{\mathcal {I}\mathcal {O}}PR(\mathcal {O})}{1-\sum \limits _{i\in \mathcal {S}\cup \mathcal {I}}PR(i)}. \end{equation}
View SourceRight-click on figure for MathML and additional features.

Likewise, the matrix \mathbf {J}^{\prime } is built by accumulating the jump probability of the nodes aggregated into the new node while it remains constant for the rest of the nodes in the network, i.e., \begin{equation} \mathbf {J}^{\prime }=\left [{ \begin{matrix} \left({1-\dfrac {|\mathcal {I}|+|\mathcal {S}|}{|\mathcal {N}|}}\right)\mathbf {1}^{1 \times (1+|\mathcal {I}|+|\mathcal {S}|)} \\ \dfrac {1}{|\mathcal {N}|}\mathbf {1}^{(|\mathcal {I}|+|\mathcal {S}|)\times (1+|\mathcal {I}|+|\mathcal {S}|)} \end{matrix} }\right ]. \end{equation}

View SourceRight-click on figure for MathML and additional features.

Once the simplified network is calculated, it is possible to assess the performance of the merger by using (10). To this end, the nodes in \mathcal {S} are merged and the PageRank of the resulting merger PR^{\prime \prime }(\mathcal {S}) provides us with the direct approximation estimator v^{\mathrm {DA}}(\mathcal {S}) . It is also possible to calculate an analytical expression for this estimator based on the PageRank of the interface nodes as follows:\begin{equation} v^{\mathrm {DA}}(\mathcal {S})= \frac {(1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}}PR^{\prime \prime }(\mathcal {I})+\frac {m}{|\mathcal {I}|+2}}{(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}, \end{equation}

View SourceRight-click on figure for MathML and additional features. where PR^{\prime \prime }(\mathcal {I}) is the PageRank that the interface nodes \mathcal {I} receive in the new network. Here, the double apostrophe ^{\prime \prime } denotes that the network has been transformed twice before the calculation of the PageRank of these nodes. As can be seen, the rest of the elements in (27) belong to the original link matrix \mathbf {A} .

A second version for this estimator is based on the calculation of the PageRank by means of the corresponding aggregated jump matrix \begin{align*} \mathbf {J}^{\prime \prime }=\left [{ \begin{matrix} \left({1-\dfrac {|\mathcal {I}|+|\mathcal {S}|}{|\mathcal {N}|}}\right)\mathbf {1}^{1 \times (2+|\mathcal {I}|)} \\[9pt] \dfrac {1}{|\mathcal {N}|}\mathbf {1}^{|\mathcal {I}|\times (2+|\mathcal {I}|)} \\[9pt] \dfrac {|\mathcal {S}|}{|\mathcal {N}|}\mathbf {1}^{1\times (2+|\mathcal {I}|)} \end{matrix} }\right ]. \end{align*}

View SourceRight-click on figure for MathML and additional features. We denote the estimator by v^{\mathrm {DA2}}(\mathcal {S}) . We can also provide an analytical expression for this estimator based on the contribution of interface nodes by using (9) as follows:\begin{equation*} v^{\mathrm {DA2}}(\mathcal {S})= \frac { \left [{ (1-m)\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {I}} }\right ] PR^{\prime \prime }(\mathcal {I})+m\frac {|\mathcal {S}|}{|\mathcal {N}|}} {(m-1)\frac {\mathbf {1}^{1\times |\mathcal {S}|} \mathbf {A}_{\mathcal {S}\mathcal {S}} \mathbf {1}^{|\mathcal {S}|\times 1}}{|\mathcal {S}|}+1}.\qquad \quad \tag{28}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

D. Experimental Results

In this subsection, we show some experimental results to illustrate the different local information approaches proposed.

1) Equivalent Network Example

In Figure 4 a randomly generated 30 node network is depicted. Let’s suppose that we would like to assess whether nodes 4 and 8 should be merged, which would be our core nodes. We will take the 1-hop neighborhood as interface nodes, i.e., nodes 1, 2, 9, 11, 16, 19, 23, 24, 25, and 26. Consequently, nodes 3, 5, 6, 7, 10, 12, 13, 14, 15, 17, 18, 20, 21, 22, 27, 28, 29 and 30 are external nodes.

FIGURE 4. - Example web with thirty pages.
FIGURE 4.

Example web with thirty pages.

The method described after Proposition 4 can be applied to find a link-matrix equivalent in a wide PageRank sense by solving the set of bilinear equations that stems from (14). The result is given in (29), as shown at the top of this page

and corresponds to a 14-node network that provides the same PageRank value for all the nodes of interest.

2) Estimator Assessment

Finally, we have assessed our estimators. To this end, 10000 simulations with randomly generated networks were carried out for different network sizes in the range [25, 200]. The probability that a link is established between any given two nodes was set to 0.1. The coalition size was also randomly set in the range [2, 5].

The results are shown in Tables 4–​7. There, we can see data regarding the average absolute (|\epsilon | ) and relative (|\epsilon _{r}| ) approximation error for two cases, namely, standard PageRank calculation (PR) and PageRank when an aggregated jump matrix is used (PR (aggr. \mathbf {J} )). As for the latter, it corresponds to the case where the merger receives a probability of jump that is equal to the sum of the probabilities of the nodes before merging. The standard deviations for these errors are also provided (|\epsilon |_\sigma and |\epsilon _{r}|_\sigma ). In addition, two additional columns are added to show the rates of correct predictions of PageRank super-additivity/subadditivity based on the estimator turns to be true. These are denoted by CFR, which stands for correct forecast rate. Note that the performance of v^{\mathrm {SPR}}(\mathcal {S}) in this regard is omitted because this estimator cannot be used to predict super-additivity.

TABLE 4 Assessment of the Estimations ( |\mathcal {N}|=25 )
Table 4- 
Assessment of the Estimations (
$|\mathcal {N}|=25$
)
TABLE 5 Assessment of the Estimations ( |\mathcal {N}|=50 )
Table 5- 
Assessment of the Estimations (
$|\mathcal {N}|=50$
)
TABLE 6 Assessment of the Estimations ( |\mathcal {N}|=100 )
Table 6- 
Assessment of the Estimations (
$|\mathcal {N}|=100$
)
TABLE 7 Assessment of the Estimations ( |\mathcal {N}|=200 )
Table 7- 
Assessment of the Estimations (
$|\mathcal {N}|=200$
)

Clearly, the estimators work better for the cases they are designed for. It is remarkable that some of them offer much better results than the mere sum of PageRank of the merged nodes, specially regarding the prediction of super-additivity. In particular, v^{\mathrm {CP}}(\mathcal {S}) and v^{\mathrm {DA2}}(\mathcal {S}) are capable of providing very accurate predictions on this matter. Moreover, these estimators become better as the size of the network grows, which make them very suitable for real large-scale networks. Likewise, the estimations provided by v^{\mathrm {CP2}}(\mathcal {S}) are good, although they are not as accurate as those of v^{\mathrm {DA2}}(\mathcal {S}) . Surprisingly, v^{\mathrm {DA}}(\mathcal {S}) offers very poor results, specially in terms of absolute relative error. In fact, the results of this estimator are even worse than those of v^{\mathrm {SPR}}(\mathcal {S}) .

SECTION V.

Conclusions

In this work, we have studied the problem of merging nodes in a network from a PageRank viewpoint. A global perspective analysis has allowed us to define games of interest in this context. Two measures have been given, one of which is directly related to the amount of additional PageRank that can be expected when merging with a node. A method for the computation of the new measures in polynomial time for large networks has been also applied.

The same problem has also been addressed from a local perspective. The lack of full information is a strong limitation that only allows obtaining estimates of the expected PageRank value. Different estimators have been introduced and experiments have been carried out to show the remarkable accuracy of some of these approximations, very particularly in their capability of assessing the potential of the coalition to generate additional PageRank.

Future work should deal with the utilization of models of restricted cooperation to integrate the structure of the directed graph in the solution. Likewise, the application of these values to coalitional control schemes [11], [12], [24] will be also studied.

References

References is not available for this document.