I like Perl, I've been learning it for a while. It's a good language to learn - fairly straightforward, quick, very powerful and ideal for bots, cgi and the net generally! I hope that +fravia will publish this as part of the botstart section and that the bot section will start boting - it's my very first bot. Most of the source code is included - it's yours for a little work.
Perl (standard on Linux and freely available) and
various Perl modules (small, free
downloads),
net access,
a text editor,
Linux (not absolutely necessary, but it's far
superior, free and a real operating system).
What can I say about Perl? It's a good language to learn. Virtually all cgi is done in Perl but it's good for virtually anything that you'd care to do and it's possible to develop applications very quickly. I'm not yet that experienced at Perl - this is my first 'real' app and I'm certain that this bot is not written at all well, but it is written. Perhaps that's the best thing about Perl - it enables you to do things that would not otherwise be possible. The CPAN Perl code repository on the net holds vast quantities of free code to do almost anything you could ever wish - but you have to be able to use Perl. You will need to download at least the LWP (it stands for libwwwperl) modules from CPAN for HCUbot or any Perl bot to work.
There are many Perl bots available on the net, but I'm fairly certain that you will not find one that does exactly what you want. There's also a convention amoung bot writers not to give bots to people who do not understand them - it's considered irresponsible. Of course, once you've learned how to build bots, you can be as irresponsible as you like. What all this means is that you have to learn to appreciate them and Perl or you don't deserve them. Don't worry, it's easy enough - just a little effort.
Please note that this is not good Perl code and I am not a programmer. Rather it shows that to start using Perl you only need to understand scalars, arrays, hashes and regexes. I hope that HCUbot is replaced soon with a better HCUbot2 that shows me how it should be done.
What I've done is provide most of the source to 'HCUbot' - a very simple web retrieval bot that retrieves many web pages from a single site. What's missing is the code to call subroutines and pass arguments to them. The idea is that by assembling the bot you will earn the right to use it. Those familiar with a C, C++ or Java will have very little difficulty. This bot is fairly limited in what it can achieve, (and bots can do far more than download web pages) but you are free to add any functionality you like - just write the code.
To get a taste of Perl, take a look at this. It's a very simple script I knocked together to convert Perl's 'pod' documentation to plain text. It's not actually necessary as there is already a pod-text conversion utility available, but I was practicing my regexes.
##### Strips formating characters from pod documents ##### so that plain(er)txt is achieved #!/usr/bin/perl -w use diagnostics; $infile="$ARGV[0]"; $outfile="/output.txt"; open (INFILE,"<$infile"); open (OUTFILE,">$outfile"); while() { # needs to handle all these '=item *', '=over 4', '=head1 DESCRIPTION # '=cut', '=item $ua = LWP::Robot ...', etc s/^=\w+\s*[\d|\s|\*]*//og; # removes things starting = s/\w*<(.+?)>/$1/og; # removes < and > around terms and preceeding letters. # These are 'regular expressions' or 'regex(es)' - it may # look scary, but it's actually very simple when you # understand how and they're very powerful. print OUTFILE; }
Now take a look at this. All Perl tutorials say that there are many ways to achieve what you want. I wanted to process all the files in a directory with a certain file extension.
sub process_files { my ($dir) = _@; opendir(DIR, $dir) or die " $0: Can't open $dir: $! \n" ; @files = readdir DIR; # @files contains every file in the directory foreach $file(@files) { if ($file !~ /^\d+\.ext$/o) # regex filters @files so that {next;} push (@extfiles, $file); # @extfiles contains only .ext files # print "dev - pushed $file\n"; } # open each file for processing foreach $file(@files) { open(FILE,"> $file") or die "$0: Unable to open file $file - $!\n"; # do something to file close FILE; } closedir DIR; }Or, the second attempt.
sub process_files { my ($dir) = @_; opendir(DIR, $dir) or die " $0: Can't open $dir: $! \n"; # don't need @files array at all while ($file = <*.ext>) { open(FILE,"> $file") or die "$0: Unable to open file $file - $!\n"; # do something to file close FILE; } closedir DIR; }But the best way is like this,
sub process_files { my ($dir) = @_; opendir(DIR, $dir) or die " $0: Can't open $dir: $! \n"; @files = glob("*.ext"); # Easy when you know how, eh? foreach $file(@files) { open(FILE,"> $file") or die "$0: Unable to open file $file - $!\n"; # do something to file close FILE; } closedir DIR; }Very soon after Fravia published this essay, the following comments and corrections by [blue] were posted to his messageboard. I am very pleased to include these corrections and welcome others. Four ways to achieve the same thing.
1. It's always better to directly parse
directory list:
from perlop
chmod 0644, <*.c>;
Because globbing invokes a shell, it's often
faster to call readdir() yourself
and do your own grep() on the filenames.
Furthermore, due to its current
implementation of using a shell, the glob()
routine may get ``Arg list too long''
errors (unless you've installed tcsh(1L) as
/bin/csh).
I think the best way to parse the directory is
something like this:
opendir(DIR, $path) || die "Can't open $path:
$!";
# Avoid "." and ".."
@files=grep( !/^\./, readdir(DIR) );
closedir(DIR);
Any decent operating system implementing a file
system cache will anyway read
entire directory.
2. Speaking about Win32 ActivePerl is
absolutely compatible and BTW there is Perl
on almost ant OS you can think of.
[blue]
OK, here's HCUbot's. HCUbot is written as a Linux application - it will need work to work on windoze (I've not used Perl over windoze - I think that it needs explicit sockets programming). Correction! [blue] states above that Perl for windoze (ActivePerl) is absolutely compatible. I've downloaded ActivePerl (1.5 meg) and I'm going to give it a go.
HCUbot produces many messages for help in development. You will see the headers sent to the server and the response headers back. I redirect the output to a file like this 'perl HCUbot www.orasomename.com > /tmp/BOTtestoutput' or the messages to the screen are overwhelming.
Perl helps you all the way with excellent error messages. You can write it cryptically or you can write it simply. I'm going to write it simply until I learn more - this code is quite clear to me. Use 'use diagnostics' and the -w switch only while developing - they can cause strange messages to be sent to servers. If something doesn't work, try it a slightly different way. I tend to use print statements to identify where perl fails (you may have noticed ;) and it seems to work well but there's also a very good debugger built in.
There are notes after the source to explain what's happening.
#!/usr/bin/perl -w # remove -w switch after sorting use diagnostics; # for development, remove after sorting # use strict; hmm use HTTP::Status; use HTTP::Response; use LWP::RobotUA; # haha! did it use URI::URL; use HTML::Parse; use vars qw($opt_h); # needs work use Getopt::Std; my $url; print "dev - $0 started - initialising variables.\n"; my $arg = (shift @ARGV); my $domain_name = "http://".$arg."/"; print "dev - \$domain_name is $domain_name\n"; local @get_list = $domain_name; # is this ok??? # Yes print "dev - \@get_list is @get_list\n"; local %hcuing = (); print "dev - \%hcuing is initialised as ()\n"; # referer section local %referer = (); print "dev - \%referer is initialised as ()\n"; local $counter = 0; # for naming locally-stored files print "dev - \$counter is $counter\n"; local $maxcount = 15; my $mirror = 0; ######################################################## ### N.B. SUBROUTINES CALLED FROM THIS BLOCK ### &change_dir($arg); while (($url = shift @get_list) && ($counter < $maxcount)) { ##### INSERT ##### ### CODE ### #### HERE #### } ## while there are URLs to fetch &shut_down; #not strictly necessary (helps development, or helped me) ### N.B. SUBROUTINES CALLED FROM THIS BLOCK ### ## print_help() er, prints help ########### sub print_help { print << "HELP"; usage: $0 [-h] domain-name -h help Example: $0 www.ora.com HELP } ## change_dir, change to user's home directory ### sub change_dir { my ($dirname) = @_; $dirname =~ /http:\/\/(\w+)/; print "dev - \$dirname to be created is $dirname\n"; # change to user's home directory chdir(); my $pwd = `pwd`; print "dev - changed to user's home directory. Directory is $pwd\n"; # makedir beneath user's home directory with appropriate permissions if (! ( -d $dirname)) { mkdir($dirname,0660) or die "Unable to create directory $dirname $!\n"; print "dev - created directory $dirname\n"; } # move into that directory - will be creating/renaming files chdir($dirname); $pwd = `pwd`; print "dev - changed to directory $pwd"; return 0; } ## get_html() retrieves html pages ###### sub get_html() { my($url) = @_; print "dev - in sub get_html()\n"; # Create a User Agent object # your email address here ~ be responsible ~ $ua = new LWP::RobotUA 'HCUbot','jclinton@whitehouse.gov'; $ua->delay(0.01); # short delay but probably enough # Ask the User Agent object to request a URL. # Results go into the response object (HTTP::Reponse). my $request = new HTTP::Request('GET', $url); print "dev - \$url is $url\n"; if (defined $referer{$url}) { # referer implementation, works $ref = $referer{$url}; $request->referer($ref); } my $response = $ua->request($request); ##### for development/debugging purposes ####### print "\ndev - \$request>as_string is \n"; print $request->as_string; print "\ndev - \$response->as_string is \n"; print $response->as_string; ##### for development/debugging purposes ####### return ($response->code, $response->content_type, $response->content); } ## not_good() ############ ## checks that page was received ok and that it is html ##### # returns 1 if the request was not OK or HTML, else 0 sub not_good { my ($code, $type) = @_; print "dev - in sub not_good \n"; if ($code != RC_OK) { print "$url had response code of $code"; return 1; } if ($type !~ /text\/html/) { warn("$url is not HTML."); return 1; } return 0; # return false (0) if document is ok } ## save_html() ######### sub save_html { my ($url,$data) = @_; print "dev - in sub save_html \n"; $counter++; open(SAVEFILE,">$counter.ext") or die "unable to save file $url as $counter.ext \n"; print SAVEFILE $data; close SAVEFILE; # save %hcuing hash entry for $url and local($counter) filename # Hash entry now defined as well as existing $hcuing{$url} = "$counter\.ext"; print "dev - \%hcuing key $url given value $counter\.ext\n"; return 0; } ## extract_hyperlinks() ####### ## extracts relative urls, calls absolutise_url() sub extract_hyperlinks { my ($data, $url) = @_; print "dev - in sub extract_hyperlinks \n"; my $parsed_html=HTML::Parse::parse_html($data); for (@{ $parsed_html->extract_links(qw (a)) }) { my ($link) = @$_; my ($absolute_link) = absolutise_url($link, $url); # only interested in i. same-domain ## and #### ii. non-queued or fetched hyperlinks ##### This is the second filter for documents to retrieve if (($absolute_link =~ /$domain_name/o) && (! exists $hcuing{$absolute_link})) { # queue for retrieval push (@get_list, "$absolute_link"); # create but not define hash entry so that url is only queued once $hcuing{$absolute_link} = ""; print "dev - \%hcuing key $absolute_link created. \n"; # referer hash $referer{$absolute_link} = "$url"; print "dev - \%referer key $absolute_link with value $url created. \n"; } } $parsed_html->delete(); # manual garbage collection return 0; } ## converts relative to absolute urls ###### sub absolutise_url() { my ($partial, $model) = @_; print "dev - in sub absolutise_url()\n"; my $url = new URI::URL($partial, $model); my $absolutised = $url->abs->as_string; ## URI::URI returns duplicated urls - filter further #!### #!~ THIS REGEX IS IMPORTANT ~!# #!~ - first filter for queuing docs for retrieval ~!# # must have extension htm(l) # tried /html*#{0}/ and /html*[^#]/ if ( $absolutised =~ /htm[^#]*$/ ) { print "dev - absolutise_url() returning: $absolutised. \n"; return $absolutised; } else { print "dev - absolutise_url() returning null: (not $absolutised). \n"; # want to return null - will this work? yes return $absolutised = ""; } } ## shut_down ########## ## for development use sub shut_down { ## there's probably a name for this by convention ## yeah, maybe shut_down print "dev - in END section\n"; open(SAVEHASH,">hcuing") or die "unable to open hcuing hash file for saving.\n"; print SAVEHASH %hcuing or die "unable to print hcuing hash file to disk. \n"; close SAVEHASH; open(SAVEGETLIST,">getlist") or die "unable to open getlist file for saving.\n"; print SAVEGETLIST @get_list or die "unable to print \@getlist to disk. \n"; close SAVEGETLIST; open(SAVEGETLIST,">referer") or die "unable to open referer file for saving.\n"; print SAVEGETLIST %referer or die "unable to print \@referer to disk. \n"; close SAVEGETLIST; # print each %hcuing key-value pair foreach $k (sort keys %hcuing) { print "dev - \%hcuing $k => $hcuing{$k}\n"; } # print each %referer key-value pair foreach $k (sort keys %referer) { print "dev - \%referer $k => $referer{$k}\n"; } } # possible enhancements # edit documents so links point to local copies # scope properly # enhance that regex # mirroring facility
HCUbot is replacing a browser, sending requests for web pages and receiving responses. HCUbot can even pretend to be a browser - any browser you like. This line
$ua = new LWP::RobotUA 'HCUbot','jclinton@whitehouse.gov';
identifies HCUbot as HCUbot, while the jclinton... is the email address the server administrator should contact if your bot screws up his server - she'll send you an awfully polite email. So to pretend to be a particular browser, you would replace HCUbot with something like "Mozilla/3". You'll have to check the actual string that the browser actually sends.
HCUbot sends a GET command to the server. It says that it wants particular web pages by saying GET this url with the url of the document that you're after. There are other commands - MIRROR (did you notice that my $mirror = 0; variable at the initialising variables section?), HEAD, POST and a few others. Mirror compares the document on the server with your local document. If the server's document is newer or has a different size, that document is retrieved. Mirror works by sending a HEAD request that retrieves headers for the document. The header contains the size of the document and the date that it was last amended. If the document needs retrieving, your machine decides to fetch it. That my $mirror = 0; variable initialisation is for HCUbot to mirror documents (not yet implemented).
Let's take a look at some headers that HCUbot works with.
dev - in sub get_html() ## messages starting "dev - ..." are produced by dev - $url is http://www.oracle.com/ ## HCUbot so that you know what's happening. dev - $request>as_string is GET http://www.oracle.com/ # Here's the request header From: jclinton@whitehouse.gov User-Agent: hcuBOT dev - $response->as_string is HTTP/1.1 200 OK # Here's the response header, that we Cache-Control: public # get back from the server Date: Thu, 20 Jul 1999 20:18:19 GMT Accept-Ranges: bytes Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition Allow: GET, HEAD Content-Length: 12723 Content-Type: text/html ETag: "8ef7c2d83beac682e5b0bb90ecc3791a" Last-Modified: Thu, 20 Jul 1999 16:31:27 GMT Client-Date: Thu, 20 Jul 1999 23:28:07 GMT Client-Peer: 205.207.44.16:80 Title: Oracle Corporation - Home X-Meta-Description: Oracle Corp. (Nasdaq: ORCL) is the world's leading supplier of software for enterprise information management. X-Meta-Keywords: database,software,Oracle,Oracle8i,relational server, server,application,tools,decision support tools,internet,internet computing, CRM,customer relationship management,e-business,PL/SQL,XML,Year 2000,Euro, Java, technology <html> # and the html document requested with a GET starts here.
Quite a whopper that response header, they're not normally that big. The request is simple on this one, it's jclinton@whitehouse.gov saying GET http://www.oracle.com/ using User-Agent: hcuBOT.
The important part of the response is the first line "HTTP/1.1 200 OK".
Hypertext TransferProtocol (HTTP) will either be 1.1 or 1.0. Version 0.9 only supports the GET method and is not used now as far as I'm aware. 1.0 supports GET, HEAD, POST, PUT, DELETE, LINK and UNLINK. 1.1 supports a few extra methods. This header says that it will accept HEAD and GET requests.
An important part is the response code. We want response code 200 as shown here which is the server replying "OK, here's the document you asked for". Response codes 100 to 199 are not implemented. 200 is what we want. 200-299 are request successfull, but that doesn't really mean that you'll get the document. 300-399 are redirection which can cause a bit of trouble. 400 is bad request (syntax error in the request header), 404 is document not found - just like when you click on a stale link. 400 - 499 you don't want. Server Errors are the 500 range which you don't want. 500 is internal server error, one that you don't want but will get often. I implemented the referer in HCUbot to try to avoid RC500s and made some other changes. The referer is the page that gave us the link. You'll sometimes get the document even with a RC500.
Here's a request header with a referer. It's saying "I want http://www.oracle.com/html/custcom.html, I got this url from http://www.oracle.com/".
dev - $request->as_string is GET http://www.oracle.com/html/custcom.html From: jclinton@whitehouse.gov Referer: http://www.oracle.com/ User-Agent: hcuBOT dev - $response->as_string is HTTP/1.1 200 OK Date: Thu, 20 Jul 1999 20:18:23 GMT Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition Allow: GET, HEAD Content-Type: text/html Client-Date: Thu, 20 Jul 1999 23:28:11 GMT Client-Peer: 205.207.44.16:80 Title: Oracle Corporation - Customers.com <html> # HTML document follows
In HCUbot there's this code to test if the document was received OK (response code 200) and that it's html
if ($code != RC_OK) { print "$url had response code of $code"; return 1; } if ($type !~ /text\/html/) { warn("$url is not HTML."); return 1; } return 0; # return false (0) if document is ok }Back to HCUbot. HCUbot uses the LWP (it stands for libwwwperl) perl module which is a predefined linbarary of code that deals with net protocols. So, to write a bot in C++, for example, you'd want to use a networking library to include just like iostream.h and math.h are used. What happens is your program calls on functions in these stored libraries. LWP relieves the programmer (that's me or you) of sockets programming. A socket is how you program the net - you read and write to a socket like you would read or write to a file except that it's more complex. Socket programming allows more control.
Specifically, HCUbot uses LWP::RobotUA, robot user agent which is an appropriate module for web robots. RobotUA is often called 'polite' because it's careful not to aggrevate servers. In particular it delays requests to the server. The default, however, is one minute which I think is far too long for today's servers.
This is how HCUbot works.
dev - in 'MAIN'calling get_html section dev - in sub get_html() dev - $url is http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907191000.18885.html&mode=corp dev - $request->as_string is GET http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907191000.18885.html&mode=corp From: jclinton@whitehouse.gov Referer: http://www.oracle.com/ User-Agent: hcuBOT dev - $response->as_string is HTTP/1.1 200 OK Date: Thu, 20 Jul 1999 20:18:44 GMT Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition Allow: GET, POST Content-Type: text/html Client-Date: Thu, 20 Jul 1999 23:28:32 GMT Client-Peer: 205.207.44.16:80 Title: Press Release <html> <head><title>Press Release</title></head> <body bgcolor="#ffffff"> <!--header--> <table width=600 cellpadding=0 cellspacing=0 border=0> <tr><td colspan=2 align=right> <map name="top"> <area shape="rect" coords="0,0,140,25" href="/" target="_top"> <area shape="rect" coords="343,1,385,23" href="/"target="_top"> <area shape="rect" coords="386,1,441,23" href="/html/sitemap.html" target="_top"> <area shape="rect" coords="442,1,503,23" href="/html/siteidx_frame.html" target="_top"> </map> <img width=528 height=28 src="/templates/images/hdr_top.gif" usemap="#top" border=0 alt="home,site map,site index"></td> <td valign=top rowspan=2> <a href="/ebusiness/" target="_top"> <img width=72 height=56 src="/templates/images/hdr_eb.gif" border=0 alt="#1 e-business"></a> </td></tr> <tr><td valign=top width=203> <div class="search"> <FORM method=GET action="http://orasearch.oracle.com/cgi-bin/query"> <INPUT TYPE=hidden NAME=mss VALUE=simple> <INPUT TYPE=hidden NAME=pg VALUE=q> <INPUT TYPE=hidden NAME=fmt VALUE=.> <INPUT TYPE=hidden NAME=what VALUE=web> <INPUT NAME=q size=10 maxlength=800 VALUE=""><INPUT TYPE="image" src="/templates/images/search_btn.gif" width=36 height=18 value="go" border=0> </FORM> </div> </td> <td valign=top align=right width=397> <map name="tabs"> <area shape="rect" coords="5,0,84,16" href="http://oraclestore.oracle.com" target="_top"> <area shape="rect" coords="85,0,168,16" href="/download/" target="_top"> <area shape="rect" coords="169,0,219,16" href="/support/" target="_top"> <area shape="rect" coords="200,0,259,16" href="/cgi-bin/press/pr.cgi" target="_top"> <area shape="rect" coords="260,0,309,16" href="/corporate/seminars_and_events/" target="_top"> <area shape="rect" coords="310,0,392,16" href="/siteadmin/html/contactus.html" target="_top"> </map> <img width=397 height=28 src="/templates/images/hdr_tab.gif" usemap="#tabs" border=0 alt="Main Navigation Bar"></td></tr> </table> <table width=560><tr><td> <img ALIGN=center WIDTH=246 HEIGHT=40 SRC="/corporate/press/images/pr_ban.jpg" ALT=""><br> <form action="pr.cgi" method="post"> <INPUT TYPE="HIDDEN" NAME="status" VALUE="Search"> <div align=right><INPUT TYPE="SUBMIT" VALUE="Return to Corporate Press Release Index"> </div> </form> <h2>Oracle Capitalizes on Enterprise Demand for Linux Offerings with Announcement of Oracle 8i on Linux</h2> (July 19, 1999)<p> <P><B>Contact(s):</B><TABLE WIDTH=100%><TR><TD VALIGN=TOP ALIGN=LEFT><FONT SIZE=-1>Reema Bahnasy<BR>Oracle Corp.<BR>650/506-3397<BR><A HREF="mailto:rbahnasy@us.oracle.com">rbahnasy@us.oracle.com</A></FONT></TD><TD VALIGN=TOP ALIGN=LEFT><FONT SIZE=-1>Karesha McGee<BR>Applied Communications<BR>415/365-0202<BR><A HREF="mailto:kmcgee@appliedcom.com">kmcgee@appliedcom.com</A></FONT></TD></TR></TABLE><P> <P> Early Adopters Programs Draws Nearly 20,000 Developers <P> REDWOOD SHORES, Calif., July 19, 1999-<A HREF="http://www.oracle.com/">Oracle Corporation</A>, the number one choice for e-business, today announced dramatic growth and demand for Oracle on Linux with strong adoption in both enterprise and general business markets. Oracle also announced the general availability of Oracle8i(TM) on Linux, after a successful early adopter's program. <P> Since <A HREF="http://www.oracle.com/">Oracle Corp.</A> announced Oracle8 on Linux, there have been over 50,000 downloads from Oracle(R) Technology Network (<A HREF="http://technet.oracle.com/">http://technet.oracle.com/ ). Now, after the announcement of Oracle8i , there have been nearly 20,000 registrants for early access in the first few weeks. Outside the development community, Oracle has also seen overwhelming customer adoption with an excess of 800 paying customers today-over half of these orders from enterprise accounts and the remainder from small to mid-sized businesses and organizations. <P> "Until the availability of Oracle database on Linux, we either had to rely on NT or use one of the shareware database servers available for Linux," says Jonathan August, President and CEO of Internection, Inc., a company providing customized Internet services solutions to businesses, including web hosting and e-commerce solutions. "Neither solution provided us the security, performance, manageability or reliability required by our customers. Oracle brings enterprise credibility and robustness to our products. As a result, we've gained access to customers ranging from small businesses to Fortune 100 enterprises like Prudential and Pfizer. Our total revenue since the additional of Oracle on Linux has increased by 250 percent." <P> "Oracle on Linux combines enterprise level reliability, scalability and performance with a free, robust and well supported operating system," says Nick Marden, technical director of e-commerce, Xoom.com, and e-commerce service provider. "It enables Xoom.com to better understand our members' needs and respond to them quickly. Oracle on Linux represents an extraordinary value and it gets the job done." <P> "Oracle is committed to bringing superior technology to the Linux community," says Chuck Rozwat, senior vice president of Server Technologies at Oracle. "Oracle8i on Linux comes with both Java and XML built right in. Together they offer the most cost-effective way to deploy scalable Internet applications." <P> Oracle8i is the first and only database specifically designed for the Internet. Oracle8i extends Oracle's long-standing technology leadership in the areas of data management, transaction processing and data warehousing to the new medium of the Internet. Oracle8i is the centerpiece of Oracle's Internet Platform, which also includes Oracle Application Server and Oracle's Internet development tools. <P> Oracle Corporation is the world's leading supplier of software for information management, and the world's second largest software company. With annual revenues of more than $8.8 billion, the company offers its database, application server, tools and application products, along with related consulting, education and support services, in more than 145 countries around the world. <P> For more information about Oracle, please call 650/506-7000. Oracle's World Wide Web address is (URL) <A HREF="http://www.oracle.com/.">http://www.oracle.com/. <P> <P><CENTER><STRONG># # #</CENTER></STRONG><P> <P> <B>Trademarks</B><BR> Oracle is a registered trademark and Oracle8i is a trademark or registered trademark of Oracle corporation. Other names may be trademarks of their respective owners. </td></tr></table> <html> <body bgcolor="#ffffff" link="000000"> <img src="/images/line.gif" width=600 height=1> <br clear=all> <table width=600 cellpadding=0 cellspacing=0 border=0> <tr> <td align="right" width="100"> <div class="FOOTER"> <a href="/appserver/"> <font FACE="Arial, Helvetica" SIZE="1"> Powered by Oracle Application Server </a></font> </div></td> <td align="left" width="50"> <div class="FOOTER"> <img src="/images/clear_dot.gif" width=50 height=1> </div></td> <td width=450> <div class="FOOTER"> <center> <font FACE="Arial, Helvetica" SIZE="1"> <a href="/" target="_top">Home</a> | <a href="/html/sitemap.html"target="_top">Site Map</a> | <a href="/html/siteidx_frame.html" target="_top">Site Index</a> | <a HREF="http://orasearch.oracle.com" target="_top">Search</a> <br> <a HREF="http://oraclestore.oracle.com/" target="_top">Oracle Store</a> | <a href="/download/" target="_top">Free Download</a> | <a HREF="/support/" target="_top">Support</a> | <a HREF="/cgi-bin/press/pr.cgi" target="_top">News</a> | <a HREF="/corporate/seminars_and_events/" target="_top">Events</a> | <a HREF="/siteadmin/html/contactus.html" target="_top">Contact Oracle</a> <br> <a href="/products/index.htm" target="_top">Products</a> | <a href="/services/index.htm" target="_top">Services</a> | <a href="/solutions/index.htm" target="_top">Business Solutions</a> | <a href="/corporate/oracle_at_work/" target="_top">Customer Successes</a> | <a href="/partners/index.htm" target="_top">Partners</a> <br> <a href="http://technet.oracle.com" target="_top">Developers/IT</a> | <a href="/corporate/index.htm" target="_top">About Oracle</a> | <a href="/international/html/" target="_top">International</a> | <a HREF="/html/employ.html" target="_top">Employment</a> | <a HREF="http://cnn.com/customnews" target="_top">cnn custom news</a> <br><p> <b>Copyright © 1995,1999 Oracle Corporation. All Rights Reserved.<br></b> <A HREF="/html/copyright.html">Legal Notices and Terms of Use</a> | <a href="/html/privacy.html">PRIVACY STATEMENT</a></font></center></div></td></tr> </table> <br clear=all> <table width=600 cellpadding=0 cellspacing=0 border=0> <tr><td align=right> <a href="http://ad.doubleclick.net/jump/www.oracle.com/products/trial/html/trial.html"> <img src="http://ad.doubleclick.net/ad/www.oracle.com/products/trial/html/trial.html" width=468 height=60 border=0 ismap></a> </td></tr> </table> </body> </html> <!--end footer--> </body> </html> dev - in sub not_good dev - in sub save_html dev - %tebotize key http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907191000.18885.html&mode=corp given value 5.tbt dev - in sub extract_hyperlinks dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/ebusiness/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected mailto:rbahnasy@us.oracle.com). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected mailto:kmcgee@appliedcom.com). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://technet.oracle.com/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/.). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/appserver/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/). dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/html/sitemap.html. dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/html/siteidx_frame.html. dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://orasearch.oracle.com/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://oraclestore.oracle.com/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/download/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/support/). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/cgi-bin/press/pr.cgi). dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/corporate/seminars_and_events/). dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/siteadmin/html/contactus.html. dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/products/index.htm. dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/services/index.htm. dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/solutions/index.htm. dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://www.oracle.com/corporate/oracle_at_work/). dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/partners/index.htm. dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://technet.oracle.com/). dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/corporate/index.htm. dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/international/html/. dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/html/employ.html. dev - in sub absolutise_url() dev - absolutise_url() returning null: (rejected http://cnn.com/customnews). dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/html/copyright.html. dev - in sub absolutise_url() dev - absolutise_url() returning: http://www.oracle.com/html/privacy.html. dev - in sub absolutise_url() dev - absolutise_url() returning: http://ad.doubleclick.net/jump/www.oracle.com/products/trial/html/trial.html. dev - in 'MAIN'calling get_html section dev - in sub get_html() dev - $url is http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907130500.13306.html&mode=corp From: jclinton@whitehouse.gov Referer: http://www.oracle.com/ User-Agent: hcuBOT dev - $response->as_string is HTTP/1.1 200 OK Date: Thu, 20 Jul 1999 20:18:48 GMT Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition Allow: GET, POST Content-Type: text/html Client-Date: Thu, 20 Jul 1999 23:28:36 GMT Client-Peer: 205.207.44.16:80 Title: Press Release <html> <head><title>Press Release</title></head> <body bgcolor="#ffffff"> <!--header--> <table width=600 cellpadding=0 cellspacing=0 border=0> <tr><td colspan=2 align=right> dev - $request>as_string is GET http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907130500.13306.html&mode=corp From: jclinton@whitehouse.gov Referer: http://www.oracle.com/
Perl is not the only language to write bots.
You can install Linux to your Windoze machine - you
know you want to.
You could try something like this at altavista '+Perl
+tutorial'or '+Perl +robot
+tutorial'
I expect to update this page fairly soon with an
improved HCUbot.