You are currently browsing the tag archive for the 'Math' tag.

This article formed the main technical content of a post I made called The Wisdom of a Random Crowd of One. The point of the post was to show how IT data can be used to produce meaningful models, with PageRank as an example. However PageRank turned out to be vastly interesting, and essentially took over the post. Here you will find just the PageRank material withoout the IT data modelling context for easier referencing and indexing.

Incidentally, PageRank gets its name from being the ranking algorithm of Larry Page rather than because it ranks web pages.

The Random Surfer

The hero of the PageRank story is an anonymous and robotic random surfer. He selects a random (arbitrary) starting page on the internet, looks at links on that page, and then selects one to follow at random. On the new page, he again looks over the links and surfs along a random link. He happily continues following links in this fashion. However, every now and again, the random surfer decides to jump to a totally random page where he then follows random links once again. If we could stand back and watch our random surfer, we would see him follow a series of random links, then teleport to another part of the internet, follow another series of links, teleport, follow links, teleport, follow links, and so on, ad infinitum.

Let’s assume that as our random surfer performs this mix of random linking and teleporting, he also takes the time to cast a vote of importance for each page he visits. So if he visits a page 10 times, then the random surfer allocates 10 votes to the page. Surprisingly, the PageRank metric is directly derived from the relative sizes of the page votes cast by this (infinite) random surfer process.

This seems deeply counterintuitive. Why would we expect the surfing habits of a random process to yield a useful guideline to the importance of pages on the Internet? While the surfing habits of people may be time consuming, and sometimes downright wasteful, we probably all think of ourselves as more than random-clicking automatons. However the proof of the pudding is in the searching, and Google has 70% of the search market. So apparently when all of the erratic meanderings of the random surfer are aggregated over a sufficiently long period, they do in fact provide a practical measure of internet page importance. We cannot explain this phenomenon any better than by simply labelling it as the wisdom of a random crowd of one.

Breathing Life into the Random Surfer

The data that can be gathered relatively easily by web crawlers are the set of links on a given page and the set of pages they point to. Let’s assume that there are M pages currently on the internet, where M is several billion or so. We can arrange the link information into M x M matrix P = [ Pij ] where Pij is the probability that page Pi links to page Pj (Pij is just the number of links on Pi to Pj divided by the total number of links on Pi).

The matrix P is called stochastic since the sum of each row is 1, which simply means that any page must link to somewhere (if Pi has no links then it links to itself with probability 1). So P represents the probability of surfing (linking) from Pi to Pj in one click by a random surfer. The nice property of P is that P^2 = P*P gives the probability that Pj can be reached from Pi in two clicks by the random surfer. In general P^N gives the probability that the random surfer ends up on page Pj after N clicks, starting from page Pi.

Can we say anything about P^N as N becomes large? Well if P is ergodic (defined below) then there will exist a probability vector

L = (p1, p2, …, pM)

such that as N becomes large then

P^N = (L, L, …, L)^t

This says that for large N, the rows of P^N are all tending to the common distribution L. So no matter what page Pi the random surfer starts surfing from, his long run page visiting behaviour is described by L. We learn quite a bit about the random surfer from L.

As we said above, the long run probability vector L only exists for matrices that are ergodic. Ergodic matrices are described by 3 properties; they are finite, irreducible, and aperiodic. Our matrix P is large but certainly finite. Two pages Pi and Pj are said to communicate if it is possible to reach Pj by following a series of links beginning at page Pi. The matrix P is irreducible if all pairs of pages communicate. But this is clearly not the case, since some pages have no links (so-called dangling pages). If our random surfer hits such a page then he gets stuck, we don’t get irreducibility and we don’t get L.

To get the random surfer surfing again we make the following adjustment to P. Recall that we have M pages and let R be the M x M matrix for which each entry is 1/M (that is, R models the ultimate random surfer who can jump from any page to any page in one click). Let d be a value less than one and create a new matrix G (the Google matrix) where

G = d*P + (1-d)*R

That is, G is a combination of P (real link data) and R (random link data). Our random surfer than follows links in P with probability d or follows jumps (teleports) to a totally random page with probability (1-d). The value of d will be something like 0.85.

Its should be easy to see that G is irreducible since R enables any two pages to communicate. Without going into details, G is also aperiodic since it is possible for a page to link to itself (which is also possible in P as well). So G is ergodic and we can in theory compute the long run page distribution L of the random surfer.

So now that we know L exists for G, it remains to compute it. We have not as yet considered that the number of pages M is several billion and growing. So a direct representation of G as an M x M matrix would require storage on the order of 10^(18), or in the exabyte range (giga, tera, peta, exa). Luckily most pages are likely to have only a few links (say less than 20) and we can represent G using lists which will bring us back into the gigabyte range.

Computing L from G is a large but tractable computation. L is an eigenvector of G and there is an iterative algorithm for computing L from G called the power method. The power method begins with an approximation for L and improves on each iteration. The rate of convergence to the true value of L is geometrically fast in terms of the parameter d. Therefore we can compute the long run behaviour of our random surfer.

The diagram below (available from Wikipedia) shows the PageRank analysis for a simple collection of pages. The arrows represent the link data and the pages are drawn in size relative to their PageRank importance. If we divided the number on each page by 100 this would be our long run probability vector L.

pagerank_graph