Sunday, August 15, 2010

My Attempts at Demystifying The PageRank Algorithm

One habit of mine is reading multiple books at any one time. Not holding the books at exactly the same time, but rotating the books after finishing one chapter or a section of each book. The habit is borne out of another habit at cross-referencing, comparing different perspectives offered by different authors and oftentimes supplementing each other. The two books I am reading the same time as Michael Miller's The Complete Idiot's Guide to SEO are Donna L. Baker's How to Do Everything with Google Tools (McGraw Hill 2008) and Google Hacks by Tara Calishain and Rael Dornfest (O' Reilly, 2nd Ed., 2005).

In Chapter 8 (Webmastering) of the latter book, I bumped into the PageRank algorithm of Google reproduced below in its equation form (Hack #87 by Mark Horrell):

PR(A) = (1 – d) + d[PR(T1)/C(T1) + ….. + PR(Tn)/C(Tn)]

where PR(A) is the PageRank (hereinafter shortened to PR) of a page A, PR(T1) is the PR of a page T1 (that links to page A and is considered an incoming link of page A), n is the total number of incoming links, C(T1) is the number of outgoing (outbound) links from the page T1 and so on for C(Tn), and d is a damping factor in the range 0 Wikipedia has given the mathematical background to the this now famous equation, purported named after Larry Page, one of the two co-founders of Google together with Sergey Brin, including a variant of the above equation where each PR is divided by the total number of pages, N, and is touted to be the one actually meant by Larry and Prin based on the statement in their paper that "the sum of all PageRanks is one”.

Wikipedia explains further:

PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. … The PageRank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.”

It is easy to infer from the structure of the equation that the higher the number of incoming link (n), the higher will be the value PR(A), i.e., its PR, since each incoming PR, PR(T1) to PR(Tn) always has a positive value. its. This is the quantity aspect, the more the merrier, which perhaps helped spawn the proliferation of link farms.

Similarly, the higher the PR of individual incoming links, PR(T1) to PR(Tn), the higher will also be the numeric value of PR(A). This then constitutes the quality aspect. This has another subtle twist to it based on the fact the new webpages invariably start with a low PR simply because it takes time for them to be crawled and indexed and be linked. Thus, there is a preference for webpage A to be linked by older webpages than the new ones.

However, do not forget about the denominator term, C(T), the presence of which will alter the fractional value in the opposite sense to that of PR(T), i.e., the larger the number of the outgoing links of each incoming link, PR(T), the smaller is its contribution to the overall value of PR(A). In other words, you would like each of your incoming links to have less outgoing links of their own.

Now, is that not a paradox of sort that tends to foster selfish thinking? Certainly it would not appear to be a win-win situation for all. But, wait. There is another facet to it, as amply expounded by Wikipedia:

Google assigns a numeric weighting from 0-10 (but 0 is used just for penalized or non analyzed-pages) for each webpage on the Internet; this PageRank denotes a site’s importance in the eyes of Google. The PageRank is derived from a theoretical probability value on a logarithmic scale like the Richter Scale.”

The practical implication of using a log (short for logarithmic) scale is that the difference between successive ranking numerals is not in algebraic, but geometric progression, or more specifically in the case, it is an order-of-magnitude difference, i.e., 10 times. Imagine thus:

A PR of 1 may have the actual values from 0.1 to 9.9;
A PR of 2 may have the actual values from 10 to 99.9;
A PR of 3 may have the actual values from 100 to 999.9; and so on.

Thus an increase of a PR by 1 can only come from tremendous efforts, certainly not in the linear trend of climbing from 1 to 2, the higher the PR, the much higher effort to elevate to the next higher ranking. Thus, in the grand scheme of things, having links to webpages with a higher or lower number of outgoing links would matter less than linking to a larger number of link, plus it is out of one's control anyway. Now I can really appreciate the chasm that separates this blog site of mine and Wikipedia's as I may have seen to remark rather casually when pitting a PageRank of 1-2 against a 9.

If PR were the only criterion used by Google, then getting a high rank is no difference than winning a popularity context. In the words of Wikipedia, “a PageRank results from a "ballot" among all the other pages on the World Wide Web about how important a page is. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page.”

Reality is of course a different story. There is another part to the rank determination conducted by Google, a much greater part I hope, as reflected in the following:

Hypertext-Matching Analysis: Our search engine also analyzes page content. However, instead of simply scanning for page-based text (which can be manipulated by site publishers through meta-tags), our technology analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. We also analyze the content of neighboring web pages to ensure the results returned are the most relevant to a user's query.”

Now I do not have retract my previous blog on content being king and all and all the content writers can sleep better.

Upon further Google (what else?) search, I came upon an excellent explanation of the PageRank algorithm by Phil Craven that alerted me to two more features:

- That a website's internal links are included. Therefore, the more pages a website holds, the higher the PR as they contribute the same as would an incoming link. The only caveat is that the pages have to be indexed by Google, that means on their own merit, and no cookie cutter stuff.

- “That when a page votes its PageRank value to other pages, its own PageRank is not reduced by the value that it is voting. The page doing the voting doesn't give away its PageRank and end up with nothing. It isn't a transfer of PageRank.” That is, do not be afraid to link and plunge into the world of inter-connectedness, even though these links may be virtual, and the spirit of reciprocity (an outbound link begets a inbound one) lives on.

The last piece of the PageRank puzzle is the little d. What role does it play? Wikipedia puts it thus:

The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d.”

And in Page and Brin's own words:

The d damping factor is the probability at each page the "random surfer" will get bored and request another random page.”

This fits in with our perception of human nature for we are not machines and are not capable of executing infinite number of clicks. Other than physical exhaustion, boredom will set in much quicker, and the rate of quickening seems to be a function of age as we can all attest to. Mathematically, we know that its inclusion depreciates the overall value of PR. If it were zero, all webpages will have the same PR of 1, a non-starter in the first place. Another thing to note is that the PR can never be zero. If a complete introvert writes a single webpage with no links whatsoever, it would still garner a PR of (1 – d), or 0.15, practically.

Is all the above that I spent a better half of the morning transiting into the afternoon understanding and collating worth the trouble? Intellectually, a resounding YES. It has been a long while that I have killed so much of my brain cells within such a short span of time. It's invigorating to say the least. My analytical mind still works.

Practically though, I am sobered by the recent clarification from Google as appearing almost as a epilogue in Wikipedia:

On October 15, 2009, Google employee Susan Moskwa confirmed that the company had removed PageRank from its Webmaster Tools section. Her post said in part, "We’ve been telling people for a long time that they shouldn’t focus on PageRank so much; many site owners seem to think it's the most important metric for them to track, which is simply not true."”

The actual post by Susan Moskwa was actually more explicit through further elaboration punctuated by a smiley at the end (tongue in cheek?):

"We removed it because we felt it was silly to tell people not to think about it, but then to show them the data, implying that they should look at it. :-)"

Then why retained it in the Google ToolBar? Perhaps the whole situation is better explained in a separate post that Susan Moskwa pointed to, which I will get to it later.

Ah well, life moves on. And so will we.

No comments: