Published @ http://www.searchlores.org in early April 2006 |
Version 0.01
(part of the bots section)
Not sure what I have convinced you of. I am not a missionary who goes arround and converts people to "python". Hmm that could be a lot of fun. Pick up java programmers. Burn them with hot irons, recant you infidel, give up evil java and convert to python :). In a nutshell there are 2 ways of processing html, sax and dom. http://www.boddie.org.uk/python/HTML.html look at libxml2dom. sax is like runtime so always quicker but a little more complex. 1 pass dom makes a "tree", so it is slower. Load html -> tree, parse the tree etc. 2-3 passes. svd (i think it was him) made the comment that all languages are the same. He is correct but he is wrong. (paradox). Some languages are more suited for certain tasks. What he was trying to say this "religion" crap with languages is nonsense. For scheme you would use http://www.neilvandyke.org/htmlprag/, it is basically a dom parser, but it creates an "S-Expression" tree. To use urllib2 with google a slight wrinkle. Google checks the browser type. If not proper browser sends back an error. No problem. opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] f = opener.open("http://www.google.ca/search?q=fravia&btnG=Google+Search&meta=") page = f.read() f.close() Tell google you are using Mozilla ;) > Now you are bound to write a 'cleaner' for ask (ex-teoma), msn search and > google :-) That is an exercise left to the reader ;). To be honest right now I have no use for stripping search engines. But on the other hand, when I was starting out, there where many things which I wanted to do, but i could not find info on. That stuff on your old site about teach a dude to fish .... :) Using the same idea lets say you had a list of VIN numbers for vehicles you wanted to check, 5000-10000 of them. Find a website that does it, and basically make a small script, extract the info you want. A few tips I have learned. USE THE CSS :). Examining google css we see that <a class=l href="http://acrigs.com/FRAVIA/FRAVIA_index.htm"> .... search for tag <a> with class="l", pull everything out till the next anchor. I am not sure if the above will work, but that is the idea ;). I just glanced at the html. Google is suspect: uses javascript to rewrite the anchors, class =1 to stuff in thier redirection crap. A good editor that shows the html as a tree can be usefull, I use quanta on linux, but whatever makes your tractor go will be ok. I have a whole sack of editors I use for different purposes, or how I feed that day etc ;). I am not an expert on the subject of html stripping. But if my tips can be useful for someone that is good ;). I have to balance my karma. You can publish the code I send u yesterday, and these tips. Just do not use my real name. Call it maybe "Winky strips for yahoo" ;). Lots of good languages, python,ruby,lua,perl,tcl to mention a few. They all do a similar thing, it boils down to taste. If you see that stallman fellow punch him in the head for inventing Autoconf. I admit it is powerful, but mere human beings can not understand it ;). 1 week trying to make a cross compiler ;). Probably svd is smarter then me and can understand it better :). Anyway I am happy i was helpful. Your site has given me a lot of pleasure over the years. Winky ;) On 4/1/06, fravia wrote: > Nice proof of concept, I'm checking it right now. > I'll seriously consider python for my own fun during the next sabbatical > months, you have convinced me :-) > > Btw: the 'stripper' in the Subject threw your mail straight in the dev/nul > lot, but the yahoo yahoo search and queries words allowed the saving angel > to make a copy of it. > > Now you are bound to write a 'cleaner' for ask (ex-teoma), msn search and > google :-) > > > winky writes: > > > Too hot to work yesterday and I noticed on the board they where > > talking about "stripping" and searching html. > > > > Anyway decided to make one for yahoo, it is primative a learning tool, > > but could be expanded to handle next queries etc. > > > > yahoo_stripper.py -> is the actual script itself. > > clean.html -> is the "output from the script" > > > > To create the "docs" which reside in the html directory just run the > > script thru epydoc. > > > > This sort of tactic of stripping webpages is much more effective then > > just blindly using regular expressions. > > > > If needed you can look for the tag of intrest and then use reg-expr. > > > > Sorry archive in in tar.gz but I am on a linux box and too lazy to reboot ;) > > > > Possible uses a spider ;). What ever makes your tractor go :).