~ A Useless but interesting tool ~
By Winky (slightly edited by fravia+)
Published @ http://www.searchlores.org in late February 2006 |
Version 0.01
An introduction to the power of python (and of proxies), by Winky.
It is supposed to be a discussion, not an essay.
This discussion/essay is related to the python script: winky_proxy.py.
See also Winky's presentation:
See also a tool for a useless but intesting tool, by Moonman
A Useless but interesting tool ;)
By Winky, February 2006
Anyway this tool is dedicated to faf :)
by Winky his Wincky, my oververbosed "useless but
intresting" friend.
And also to all the other guys on ebmb, even megalo :)
If the shoe fits wear it :)
Background
Recently I was reading the "Twisted Book" about the twisted python framework.
As most books it contains useless trivial information. This is not to
knock the twisted framework. Anyway inside the book it showed you how
to make a simple proxy with twisted. Unfortunately since twisted is a
single threaded tool it is very difficult to customize. I had noticed
on the various message board people where posting stuff that had been
encrypted. As ussual because I am lazy I am too lazy to break the
encryption.
Anyway I came up with the idea it would be neat if you could make a
tool that when you posted to the board you could surround the text you
wanted encrypted with <winky>secret
text</winky>. But the even cooler part is
that if you previewed the board or viewed the message in you webbrower
you could read it.
Planning
Obviously we need a proxy of some sorts.
Proxy will scan posted content looking for <winky>secret
text</winky> and encypt the secret
text leaving the tags in place.
On the reverse when when we view the webpage it will do the opposite.
Method of Operation
Origionally I was going to create the proxy with twisted. As
i explained earlier because twisted is single threaded it is difficult
to use so I abandoned the idea. I decided to look on the net for other
proxies made in python. The simplest one I found was by SUZUKI
Hisao . If possible always avoid writing code
completely by yourself and base it on other peoples work. Do not
reinvent the wheel. As you will also see later my design even though
based on Mr Suzuki's differs in a few key points. I will explain why I
chose this design. You will also see that my design has various
limitations, which we will discuss.
Ingredients
required
- python interpreter http://www.python.org/download/
- a good text editor
nice to have
- A good python editor/ide
http://wiki.python.org/moin/PythonEditors . Personally I use DrPython
(buggy as hell but i am used to the bugs ;)). Probably if you are just
starting out I would use http://wiki.python.org/moin/SPE . I
have never used it but i have heard good things about it.
- A good python shell. Personally I use PyCrust, but maybe
you should try IPython. I think IPython is probably better but I am old
and crockety and do not like change
- A good python debugger I ussually use Winpdb
http://www.digitalpeers.com/pythondebugger/ .If you are old
school you can debug from the comand line ;).
Interlude (languages and tools)
One of the most important things for a coder is tools. Your tools are
your editor hexeditor debugger etc. It is important that the tools fit
you. For instance I use reqularly 3-4 different types of editors, from
DrPython,NotePad++,Boa Constructor. The reasoning why is each editor
excels slightly in one area, and also i like them :).
What language to use. For many people computer language is like a
religion. I have used here python. I like python, and I know its
limitation and quirks. Every language has certain areas where it is
better. PHP is good for webpages, but personally I like python better
for web. Perl and Ruby are basically the same as python just a
different way of doing the same thing. Scheme and Lisp are functional
programming languages, and until recently I did not appreciate them.
Look at new languages constantly, but on the other hand do not believe
the hype they are a magic bullet.
The great part about python/perl is that they have many 3rd party
modules etc.
The proxy
Here is the proxy winky_proxy.py in python.
What is a webproxy. A webproxy is
basically a webserver and a webclient based in one. Your webbrowser
connects to the webserver part, then the webserver part becomes a
client to fetch your request which inturn passes back to you.
If you do not know how a http protocol work look on the net.
http://en.wikipedia.org/wiki/HTT
Lets examine Mr SUZUKI's proxy
http://mail.python.org/pipermail/python-list/2003-June/168957.html
Python has built in many different types of Servers.
He has decided to override the class BaseHTTPServer.HTTPServer
to make a threaded server.
Each request will be fed into the class ProxyHandler
def handle(self): is
overriden to allow you to specify which clients you will allow to
connect to the server
<I>HTTP defines eight methods indicating the desired
action to be performed on the identified resource.</I>
For the most part GET and POST are the ones that are of intrest others
are rarely used.
GET and POST are very similar
To get the server to process GET request he creates a method called
def do_GET(self):
.......
He then sets do_POST = do_GET to make it so his proxy handles both of
these methods the same way.
when the do_Get is called
self.path will equal the path
he then uses the function urlparse.urlparse(self.path,'http')
if self.path =
http://www.google.ca/search?hl=en&q=a+b+c+d&btnG=Google+Search&meta=
then
scm = "http"
netloc = "www.google.com"
params = "search"
query =
"hl=en&q=a+b+c+d&btnG=Google+Search&meta="
He creates a socket connects to "www.google.com"
He then "manually" creates a GET or PUT request
GET /search HTTP/1.1
Host: www.google.com
and all the other stuff
and sends it to the socket.
Then he listens of the socket, and when there is data on the socket
reads it and then feeds it to the client connected to the proxy. The
actual reading and writing to the socket occurs in
def _read_write(self, soc, max_idling=20)
(Also a good example on how to use select so socket do not block ;) )
Why he has this relatively complex system is so that as data
arrives to proxy it is immediately read and sent to the client.
Examination
If you just need a proxy the above code will work. Unfortunately we
need to modify the query string before it goes up, and more importantly
modify the data before it goes back to the client. In the page content
we need to search for the matching <winky>
</winky> tags.
Another thing is that the searchlores mb DQ uses POST instead of GET
which has to be dealt with slightly differently.
Browsing throught the python documentation we notice there is a cool
module urllib and urllib2 which fetches webpages. Instead of using a
raw socket lets fetch the webpage with urllib2 (check out python docs
for how to use, also a great way to make a spider).
Implementation
We will need a seperate do_GET and do_POST methods
In both of the methods the params are fetched and parsed.
In the do_POST the query_string has to be read from self.rfile
(traditonally stdin).
Then they call the common method def _GET_POST(self,pageUrl,data=None)
which fetchs the webpage and then sends it back to the client.
note that i remove the header
try :
del
page.headers.dict['transfer-encoding']
except KeyError :
pass
when the page is fetched urllib2 will show what encoding is used ie
chunked, gzip etc
since the entire page is now on the proxy, these headers do not apply
and have to be removed.
Also I used the built in methods to send to the client
self.send_error(400, "http error")
self.send_response(page.code,page.msg)
self.send_header(key,value)
self.end_headers()
self.wfile.write(page_body)
Always -if possible- use builtin functions, these functions have been
used by millions of coders. That means generally speaking they are bug
free and will work better then home built versions. Also since all the
modules are GPL if required you can go in the source and see what is
happening.
The encoding decoding
On the searchlores MBs the query param that contains the
message posted is "content"
Basically in the on_Post method i do this
if query_dict.has_key("content") :
query_dict["content"] =
[encode_buffer(query_dict["content"][0])]
Also notice that I change the page.headers.dict['Content-Length'] = str(len(page_body))
This is because I am converting HEX to Ascii or vice versa which mean that the Content-Length of page will have changed
I think that Content-Lenght does not have to be 100% accurate with all browsers but i am not sure.
When I encode i only match <winky></winky>
tags
While when i decode I match
<winky></winky> and escape coded
<winky></winky>
The reason is so that the preview post will work, and text in the text form box will be in clear text.
How to teach yourself.
I purposely have not commented the code. Most code in the wild is not
commented or very sparse commenting. When you read a book every line is
not commented.
If stuck run the code through a debugger see what is happening.
One of the powers of python is its "shell".
Any code object can be inspected at runtime.
example I want to know what methods are available for urllib2 module
http://www.python.org/doc/lib/urllib2-examples.html
open a python shell and do this
>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> dir(f)
['__doc__', '__init__', '__iter__', '__module__', '__repr__', 'close', 'code', 'fileno', 'fp', 'geturl', 'headers', 'info',
'msg', 'next', 'read', 'readline', 'readlines', 'url']
WOW it shows you all the methods and vars ;)
>>> f.headers.dict
{'content-length': '11885', 'accept-ranges': 'bytes',
'server': 'Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 mod_ssl/2.0.54 OpenSSL/0.9.7e',
'last-modified': 'Sun, 19 Feb 2006 15:12:41 GMT', 'connection': 'close',
'etag': '"601f6-2e6d-3599cc40"', 'date': 'Sat, 25 Feb 2006 10:50:54 GMT', 'content-type': 'text/html'}
What the headers that are contained
Anyway you should get the idea :)
Conclusion
The winky proxy (name patent pending :P) is an intresting but useless tool :).
One of the limitations of the urllib2 the way I am using it is that the pages will not be sent to the client as the page loads.
Also the proxy does not work on gmail, I suspect all ajax pages maybe you need to implement CONNECT protocol.
Also if you are going to parse html/xhtml do not use your own homemade
parser but one like a SAX or DOM parser. (go look in python docs).
Another think is also not all the HTTP Protos are handled, and the errors are not handled very elegantly either
Anyway not too shabby for 160 lines of code;)
br>
(c) Winky 2006
Winky's presentation
An example of it in action is
http://fravia.2113.ch/phplab/mbs.php3/mb003?num=1140855607&thread=1140751397
What you see when proxy is in place
Re: Re: Re: Re: Re: Re: go (25/02/06 09:20:07)
<winky>lets all go to the ball park</winky>
<winky>I am going to bash megalos head in with hammer ;)</winky>
;)
Otherwise you see
Re: Re: Re: Re: Re: Re: go (25/02/06 09:20:07)
<winky>6C65747320616C6C20676F20746F207468652062616C6C207061726B</winky>
<winky>4920616D20676F696E6720746F2062617368206D6567616C6F73206865616420696E20776974682068616D6D6572203B29</winky>
;)
I also think that it is pretty cool that it works for the preview function.
Probably the discussion about the proxy should require a lot of editing & fixing up.
But it is supposed to be a discussion, not an essay ;)
Anyway ;)
(c) III Millennium: [fravia+], all rights reserved, reversed, reviled and revealed