<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fellinghaug Blog &#187; cherrypy</title>
	<atom:link href="http://asbjorn.fellinghaug.com/blog/tag/cherrypy/feed/" rel="self" type="application/rss+xml" />
	<link>http://asbjorn.fellinghaug.com/blog</link>
	<description>&#62;&#62;&#62; from fellinghaug import asbjorn; asbjorn.play()</description>
	<lastBuildDate>Thu, 19 Nov 2009 21:22:01 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Cheating with HTTP headers in Python</title>
		<link>http://asbjorn.fellinghaug.com/blog/2008/04/cheating-with-http-headers-in-python/</link>
		<comments>http://asbjorn.fellinghaug.com/blog/2008/04/cheating-with-http-headers-in-python/#comments</comments>
		<pubDate>Sat, 19 Apr 2008 20:10:30 +0000</pubDate>
		<dc:creator>Asbjørn Alexander Fellinghaug</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[cherrypy]]></category>

		<guid isPermaLink="false">http://asbjorn.fellinghaug.com/wp/?p=13</guid>
		<description><![CDATA[Have you ever had a need to extract some information for a webpage in a automatic fashion? Well, I have multiple times. And, lazy as I am and given that I needed the process to be runned many times each day, I wanted to create a fancy script to do this. Now, I&#8217;m not gonna [...]]]></description>
			<content:encoded><![CDATA[<p>Have you ever had a need to extract some information for a webpage in a automatic fashion? Well, I have multiple times. And, lazy as I am and given that I needed the process to be runned many times each day, I wanted to create a fancy script to do this. Now, I&#8217;m not gonna go into which webpage I needed to extract info from, by lets say the wanted information site is http://example.org. Now, the people behind this website is clever, they don&#8217;t want automatic scripts to download and use their information. So, if the HTTP headers is showing that a HTTP request is not from a web browser such as Firefox, Opera, Safari, IE, they would deny the request.</p>
<p>Our nerdy <span class="green"><span class="black">curiosity would tell us not to give up! So, how would we cope with this kind of obstacle? Well, the Python programming language is a fantastic language providing us with modules for everything. One builtin module is the &#8220;urllib2&#8243; which has capabilities to &#8220;mimic&#8221; a web browser. Here is a simple python code for achieving this:</span></span></p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre class="python python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">urllib2</span>
main_url = <span style="color: #483d8b;">&quot;http://example.org&quot;</span>
txheaders = <span style="color: black;">&#123;</span><span style="color: #483d8b;">'User-agent'</span>: <span style="color: #483d8b;">'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3'</span><span style="color: black;">&#125;</span><span style="color: #66cc66;">&lt;</span>/code<span style="color: #66cc66;">&gt;</span>
&nbsp;
opener = <span style="color: #dc143c;">urllib2</span>.<span style="color: black;">build_opener</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">urllib2</span>.<span style="color: black;">HTTPCookieProcessor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #dc143c;">urllib2</span>.<span style="color: black;">install_opener</span><span style="color: black;">&#40;</span>opener<span style="color: black;">&#41;</span>
req = <span style="color: #dc143c;">urllib2</span>.<span style="color: black;">Request</span><span style="color: black;">&#40;</span>main_url, <span style="color: #483d8b;">''</span>, txheaders<span style="color: black;">&#41;</span>
handle = <span style="color: #dc143c;">urllib2</span>.<span style="color: #008000;">open</span><span style="color: black;">&#40;</span>req<span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">print</span> handle.<span style="color: black;">readlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>And that&#8217;s all there is to it. The website http://example.org inspects the headers which is valid and tells the webserver at example.org that the client is a Firefox web browser running on Linux OS.</p>
<p><strong>One important note: </strong>It is illegal to steel information from sites and present it on your own site without an agreement of the source website and refering to the source. Remember to always &#8220;be nice&#8221;.</p>
]]></content:encoded>
			<wfw:commentRss>http://asbjorn.fellinghaug.com/blog/2008/04/cheating-with-http-headers-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
