Monday, March 11, 2013

Distance from Philosophy

I was playing around with the html parsers available for python.  I wrote various little 'toy-scripts' to scrape content from websites. While doing this absolutely pointless thing just for fun, I remembered something I had read in an XKCD webcomic  mouseover text: If you start from any page on the Wikipedia and click on the first non-underlined link and continue doing this, you will eventually end up on the Wikipedia's page for 'philosophy'. 

I didn't believe it at first of course. It seemed absurd. So I decided to verify if the claim was true and started picking random stuff - totally unrelated to philosophy, like say, cats, or apple. And began to follow the links from those pages. I was amazed when, in every single case, I ended up on the page to philosophy. 

As I was writing my toy-scripts and playing around, I had an idea to measure "how far from philosophy everything was". It sounds crazy when you hear it. Its even weirder to type. (Having such thoughts is probably the weirdest of all.)

So I wrote a script which would accept a word entered from the user, go to the corresponding Wikipedia page and start following the first non-underlined links and count how many links it had to follow before it hit on the page for philosophy. I thought I'd share it with you, in case you want to take a break and kill time and amuse yourself:

from HTMLParser import HTMLParser
import httplib2

links = []

class MyHTMLParser(HTMLParser):
         def handle_starttag(self, tag, attrs):
              if tag == "a" and attrs[0][1].startswith("/wiki/") and (not attrs[0][1].startswith("/wiki/File")) and  (attrs[0][1].find("Wikipedia") == -1):

def countlinks(keyword):
'''this function counts the number of pages which
are visited when you begin with the 'keyword' on
wikipedia till you reach the page for 'philosophy'
h = httplib2.Http()
site = ''
url =  site + keyword
parser = MyHTMLParser()
count = 0
keywords_seen = []
global links
url =  site + keyword
while(keyword != "Philosophy"):
response,content = h.request(url)
content = str(content)
content =content.decode("utf-8").encode('ascii','ignore')
start = content.find("<p>") #wikipedia explanations begin at a <p> tag
keyword = links[0].replace("/wiki/",'')
initial = 1
while keyword in keywords_seen:
keyword = links[initial].replace("/wiki/",'')
initial += 1

links = []
url = site + keyword
print "visiting page about: ",keyword
count += 1

print "\n found Philosophy after ", count ," links\n"


 Save this code as "" and run using:
python <enter>
<enter the word you want to start with e.g: dog, cat, Rolling stones, etc> <hit enter>

P.S: The distance from 'dog' to philosophy is  23 links.