howsewor.htm: How to search the web, by fravia+ howsewor

~ How search engines work ~

				How search engines work

Version March 2000

Eh, this is still in fieri, and is just a first attempt, ya know...

How search engines work

by fravia+, March 2000

A search engine is a complex funny beast, as you can imagine. It's eyes are the programs it uses in order to find information, its great ugly body is a giant database of continuously updated findings and his brain are the specific algos used to decide the relevance results for a given query. Of course there are major "parts". Each engine needs (at least)

to FIND information;
to STORE information
to ORGANIZE its stored information (weighting, ranking)
to ALLOW querying (user interface)
and, since most users using the search engines are basically idiots incapable of searching with more than one term,
to automatically AMELIORATE the mostly poor query that they got as input (query expansion, neural net techniques, pseudo-relevance feedback) in order to offer results that make some sense.

Finding information
Human URL-adding and spidering.
Each search engine has a "spider" (aka crawler, aka bot) which automatically scours the Net for sites and report the results. The spider visits web pages, reads them, and then follows links to other pages within the site. At times these spiders discover "uncharted pages" following links outside the submitted or found site (this is happening less and less now, for reasons discussed [elsewhere] on my site).
This is what it means when someone refers to a site being "spidered". The spider returns to the site on a regular basis, such as every couple of months, to check if the site still exists and to look for changes. Note that the average life of a site on the web was of less than two months at the beginning of 2000.
The spider's findings land into the second part of all search engines, the database, aka the index or catalog. A giant book containing a copy (google) or a archived and zipped reference to every web page that the spider has found. If a web page changes, then this database is updated with the new information.

This indexing process of the spidered info, nota bene can take weeks.

Search engine specific algos are another important element for all search engines. This are programs and algos that grep and sift through the millions of pages recorded in the database to find matches fore a query and rank them in order of what the programmers believe should be most relevant. All search engines have the parts listed above, but the 'tuning' is different, that's the reason the same query will give DIFFERENT results on different search engines. In the other pages of this [details] section we will examine 'in depth' the specific quirks of the main search engines.
Of course the 'relevance ranking' part is a tricky business. Filtering unwanted information ourt can even be more difficoult than searching for info in the first place, since in searching for info you may allow a certain marge of uncertainity, but when you fuilter something out, that's a binary aut aut, not a vel vel decision.
You should beware of filter agents (unless you have or happen to have found their source code :-) Godzilla knows what for silly algos have been implemented by the programmers to filter information out that YOU wanted but that Billy the userzombie usually discards.

In fact the most commonly used algos leave a great marge for errors and blunders:

link analysis ("best sites based on how many sites link to that page") I am sure you can list on yourself a dozen good reasons NOT to trust such a methodology :-)
result clustering that's summarizing results into (supposedly) useful groups and automatic creation of taxonomies (and directories) on the fly... see Northernlight for an example: those folders may at times be even useful, but are often puzzling, to say the least;
clicking weighting this is an hotbot speciality: user click on a link from the retrieved results? That site will get a plus. User remains very little on that site and goes back? It's a bogus site, will have less weight next time. User click first on site listed number eight and jumps the first seven? That site gets more weight, and so on... it is easy to imagine how this system will automagically tend to a 'common denominator' of crap/spam sites that bozo the user loves and that you personally wouldn't probably want to touch with a badger pole. Another good reason to always begin from the fourth (at least) page of results downwards :-)

All the methods and algos listed above are arbitrary to say the least. Maybe future clustering algos will be able to use format distinctions more efficiently, since the relevance analysis necessary to evaluate a scientific paper must obviously use different parameters from those used to evaluate a spamcommmercial site. But for the moment you better be very careful and sceptic about the results presented and try to use queries that are unlikely to have been standardized by the programmers. Once more, never forget that there are 2 milliard sites and that the search engines 'see' only a tiny part of that huge mass. Search engines are useful to start a query, but once you have found the signal among the noise, you will usually switch to more advanced and complicated seeking techniques.

Does the arbitrariety of modern automated relevance ranking methods mean that only human directories can offer correct ranking results? Of course not: the real problem is that categorization is per se very hard to create and to maintain and that if you put three humans together they will almost immediatly - and quite hotly - disagree about each other's chosen categorization guidelines and choices. Automated systems make admittedly more blunders than humans and return a greater number of irrelevant sites, but the ab uility to cast larger nets compensate this. Anyway both will happily try to impose you their 'petty' taxonomical visions, which more than often have also the added duisadvantage of being extremely provincial. (This is typical of all euroamericanocentrics attempts: you'll quickly realize how true this is once you will have learned how to search -and fish- extremely valuable info on -for instance- chinese, korean or russian sites).

A request for help deguised as anti-commercial rant :-)
Note that on the ugly web of to-day many commercial bastards offer this kind of (easy to find) information to the zombies on a 'per pay' basis. Your help in developing this specific detail section of my site is therefore even more appreciated than usual, as a first, and I hope painful, 'retaliating' action against those that want to sell easy to find information instead of spreading it. Painful because if we get enough deep with this stuff we will cast a knowledge spike through our enemies' heart: their wallet.