• 19Apr
    Categories: python

    Have you ever had a need to extract some information for a webpage in a automatic fashion? Well, I have multiple times. And, lazy as I am and given that I needed the process to be runned many times each day, I wanted to create a fancy script to do this. Now, I’m not gonna go into which webpage I needed to extract info from, by lets say the wanted information site is http://example.org. Now, the people behind this website is clever, they don’t want automatic scripts to download and use their information. So, if the HTTP headers is showing that a HTTP request is not from a web browser such as Firefox, Opera, Safari, IE, they would deny the request.

    Our nerdy curiosity would tell us not to give up! So, how would we cope with this kind of obstacle? Well, the Python programming language is a fantastic language providing us with modules for everything. One builtin module is the “urllib2″ which has capabilities to “mimic” a web browser. Here is a simple python code for achieving this:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    #!/usr/bin/env python
    import urllib2
    main_url = "http://example.org"
    txheaders = {'User-agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3'}</code>
     
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    urllib2.install_opener(opener)
    req = urllib2.Request(main_url, '', txheaders)
    handle = urllib2.open(req)
    print handle.readlines()

    And that’s all there is to it. The website http://example.org inspects the headers which is valid and tells the webserver at example.org that the client is a Firefox web browser running on Linux OS.

    One important note: It is illegal to steel information from sites and present it on your own site without an agreement of the source website and refering to the source. Remember to always “be nice”.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.