winky_proxy.htm: A Useless but intesting tool

~ A Useless but interesting tool ~
By Winky (slightly edited by fravia+)

Published @ http://www.searchlores.org in late February 2006 | Version 0.01

An introduction to the power of python (and of proxies), by Winky.
It is supposed to be a discussion, not an essay.
This discussion/essay is related to the python script: winky_proxy.py.
See also Winky's presentation:
See also a tool for a useless but intesting tool, by Moonman

A Useless but interesting tool ;)

By Winky, February 2006

Anyway this tool is dedicated to faf :)
by Winky his Wincky, my oververbosed "useless but intresting" friend.
And also to all the other guys on ebmb, even megalo :)
If the shoe fits wear it :)

Background

Recently I was reading the "Twisted Book" about the twisted python framework. As most books it contains useless trivial information. This is not to knock the twisted framework. Anyway inside the book it showed you how to make a simple proxy with twisted. Unfortunately since twisted is a single threaded tool it is very difficult to customize. I had noticed on the various message board people where posting stuff that had been encrypted. As ussual because I am lazy I am too lazy to break the encryption.

Anyway I came up with the idea it would be neat if you could make a tool that when you posted to the board you could surround the text you wanted encrypted with <winky>secret text</winky>. But the even cooler part is that if you previewed the board or viewed the message in you webbrower you could read it.

Planning

Obviously we need a proxy of some sorts.
Proxy will scan posted content looking for <winky>secret text</winky> and encypt the secret text leaving the tags in place.
On the reverse when when we view the webpage it will do the opposite.

Method of Operation

Origionally I was going to create the proxy with twisted. As i explained earlier because twisted is single threaded it is difficult to use so I abandoned the idea. I decided to look on the net for other proxies made in python. The simplest one I found was by SUZUKI Hisao . If possible always avoid writing code completely by yourself and base it on other peoples work. Do not reinvent the wheel. As you will also see later my design even though based on Mr Suzuki's differs in a few key points. I will explain why I chose this design. You will also see that my design has various limitations, which we will discuss.

Ingredients

required

python interpreter http://www.python.org/download/
a good text editor

nice to have

A good python editor/ide http://wiki.python.org/moin/PythonEditors . Personally I use DrPython (buggy as hell but i am used to the bugs ;)). Probably if you are just starting out I would use http://wiki.python.org/moin/SPE . I have never used it but i have heard good things about it.
A good python shell. Personally I use PyCrust, but maybe you should try IPython. I think IPython is probably better but I am old and crockety and do not like change
A good python debugger I ussually use Winpdb http://www.digitalpeers.com/pythondebugger/ .If you are old school you can debug from the comand line ;).

Interlude (languages and tools)

One of the most important things for a coder is tools. Your tools are your editor hexeditor debugger etc. It is important that the tools fit you. For instance I use reqularly 3-4 different types of editors, from DrPython,NotePad++,Boa Constructor. The reasoning why is each editor excels slightly in one area, and also i like them :).
What language to use. For many people computer language is like a religion. I have used here python. I like python, and I know its limitation and quirks. Every language has certain areas where it is better. PHP is good for webpages, but personally I like python better for web. Perl and Ruby are basically the same as python just a different way of doing the same thing. Scheme and Lisp are functional programming languages, and until recently I did not appreciate them. Look at new languages constantly, but on the other hand do not believe the hype they are a magic bullet.

The great part about python/perl is that they have many 3rd party modules etc.

The proxy

Here is the proxy winky_proxy.py in python.
What is a webproxy. A webproxy is basically a webserver and a webclient based in one. Your webbrowser connects to the webserver part, then the webserver part becomes a client to fetch your request which inturn passes back to you.

If you do not know how a http protocol work look on the net. http://en.wikipedia.org/wiki/HTT

Lets examine Mr SUZUKI's proxy http://mail.python.org/pipermail/python-list/2003-June/168957.html

Python has built in many different types of Servers.
He has decided to override the class BaseHTTPServer.HTTPServer to make a threaded server.

Each request will be fed into the class ProxyHandler
def handle(self): is overriden to allow you to specify which clients you will allow to connect to the server

<I>HTTP defines eight methods indicating the desired action to be performed on the identified resource.</I>
For the most part GET and POST are the ones that are of intrest others are rarely used.
GET and POST are very similar

To get the server to process GET request he creates a method called
def do_GET(self):
.......
He then sets do_POST = do_GET to make it so his proxy handles both of these methods the same way.

when the do_Get is called
self.path will equal the path
he then uses the function urlparse.urlparse(self.path,'http')
if self.path = http://www.google.ca/search?hl=en&q=a+b+c+d&btnG=Google+Search&meta=
then
scm = "http"
netloc = "www.google.com"
params = "search"
query = "hl=en&q=a+b+c+d&btnG=Google+Search&meta="

He creates a socket connects to "www.google.com"
He then "manually" creates a GET or PUT request
GET /search HTTP/1.1
Host: www.google.com
and all the other stuff
and sends it to the socket.

Then he listens of the socket, and when there is data on the socket reads it and then feeds it to the client connected to the proxy. The actual reading and writing to the socket occurs in
def _read_write(self, soc, max_idling=20)
(Also a good example on how to use select so socket do not block ;) )
Why he has this relatively complex system is so that as data arrives to proxy it is immediately read and sent to the client.

Examination

If you just need a proxy the above code will work. Unfortunately we need to modify the query string before it goes up, and more importantly modify the data before it goes back to the client. In the page content we need to search for the matching <winky> </winky> tags.

Another thing is that the searchlores mb DQ uses POST instead of GET which has to be dealt with slightly differently.

Browsing throught the python documentation we notice there is a cool module urllib and urllib2 which fetches webpages. Instead of using a raw socket lets fetch the webpage with urllib2 (check out python docs for how to use, also a great way to make a spider).

Implementation

We will need a seperate do_GET and do_POST methods
In both of the methods the params are fetched and parsed.
In the do_POST the query_string has to be read from self.rfile (traditonally stdin).

Then they call the common method def _GET_POST(self,pageUrl,data=None) which fetchs the webpage and then sends it back to the client.

note that i remove the header
try :
            del page.headers.dict['transfer-encoding']
except KeyError :
            pass
when the page is fetched urllib2 will show what encoding is used ie chunked, gzip etc
since the entire page is now on the proxy, these headers do not apply and have to be removed.

Also I used the built in methods to send to the client
self.send_error(400, "http error")
self.send_response(page.code,page.msg)
self.send_header(key,value)
self.end_headers()
self.wfile.write(page_body)

Always -if possible- use builtin functions, these functions have been used by millions of coders. That means generally speaking they are bug free and will work better then home built versions. Also since all the modules are GPL if required you can go in the source and see what is happening.

The encoding decoding
On the searchlores MBs the query param that contains the message posted is "content"
Basically in the on_Post method i do this
if query_dict.has_key("content") :
            query_dict["content"] = [encode_buffer(query_dict["content"][0])]
Also notice that I change the page.headers.dict['Content-Length'] = str(len(page_body))
This is because I am converting HEX to Ascii or vice versa which mean that the Content-Length of page will have changed
I think that Content-Lenght does not have to be 100% accurate with all browsers but i am not sure. When I encode i only match <winky></winky> tags
While when i decode I match
<winky></winky> and escape coded <winky></winky>

The reason is so that the preview post will work, and text in the text form box will be in clear text.

How to teach yourself.

I purposely have not commented the code. Most code in the wild is not commented or very sparse commenting. When you read a book every line is not commented.
If stuck run the code through a debugger see what is happening.

One of the powers of python is its "shell".
Any code object can be inspected at runtime.
example I want to know what methods are available for urllib2 module
http://www.python.org/doc/lib/urllib2-examples.html

open a python shell and do this

 >>> import urllib2
 >>> f = urllib2.urlopen('http://www.python.org/')
 >>> dir(f)
 ['__doc__', '__init__', '__iter__', '__module__', '__repr__', 'close', 'code', 'fileno', 'fp', 'geturl', 'headers', 'info',
  'msg', 'next', 'read', 'readline', 'readlines', 'url']
WOW it shows you all the methods and vars ;)
 >>> f.headers.dict
 {'content-length': '11885', 'accept-ranges': 'bytes', 
 'server': 'Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 mod_ssl/2.0.54 OpenSSL/0.9.7e',
  'last-modified': 'Sun, 19 Feb 2006 15:12:41 GMT', 'connection': 'close', 
 'etag': '"601f6-2e6d-3599cc40"', 'date': 'Sat, 25 Feb 2006 10:50:54 GMT', 'content-type': 'text/html'}
What the headers that are contained

Anyway you should get the idea :)

Conclusion

The winky proxy (name patent pending :P) is an intresting but useless tool :).
One of the limitations of the urllib2 the way I am using it is that the pages will not be sent to the client as the page loads.
Also the proxy does not work on gmail, I suspect all ajax pages maybe you need to implement CONNECT protocol.
Also if you are going to parse html/xhtml do not use your own homemade parser but one like a SAX or DOM parser. (go look in python docs). Another think is also not all the HTTP Protos are handled, and the errors are not handled very elegantly either
Anyway not too shabby for 160 lines of code;) br>

Winky's presentation

An example of it in action is
http://fravia.2113.ch/phplab/mbs.php3/mb003?num=1140855607&thread=1140751397

What you see when proxy is in place

Re: Re: Re: Re: Re: Re: go (25/02/06 09:20:07)

      <winky>lets all go to the ball park</winky>
      <winky>I am going to bash megalos head in with hammer ;)</winky>

;)

Otherwise you see

Re: Re: Re: Re: Re: Re: go (25/02/06 09:20:07)

      <winky>6C65747320616C6C20676F20746F207468652062616C6C207061726B</winky>
     
      <winky>4920616D20676F696E6720746F2062617368206D6567616C6F73206865616420696E20776974682068616D6D6572203B29</winky>

;)


I also think that it is pretty cool that it works for the preview function.

Probably the discussion about the proxy should require a lot of editing & fixing up.
But it is supposed to be a discussion, not an essay ;)

Anyway ;)