~ How search engines work ~
|       | 
   | 
  | 
   | 
How search engines work | 
Version March 2000
 
    Eh, this is still in fieri, and is just a first attempt, ya know...
   
by fravia+, March 2000
A search engine is a complex funny beast, as you can imagine. 
It's eyes 
are 
the programs it uses in order to find information, 
its great ugly body 
is a giant database of continuously updated findings and his 
brain are the    specific 
algos used to decide the relevance results for a given query.
Of course there are major "parts". 
Each engine needs (at least) - to FIND information;
 - to STORE information
 - 
to ORGANIZE its stored information (weighting, ranking)
 
- 
 to ALLOW querying (user interface)
and, since most users using the search engines are basically idiots incapable of searching with 
more than one term,
  - to automatically AMELIORATE the mostly poor query that they got as input 
(query expansion, neural net techniques, pseudo-relevance feedback)  in order to 
offer results that make some sense.
 
Finding information
Human URL-adding and spidering.
Each search engine has a "spider" (aka crawler, aka bot) 
which  automatically scours the Net for sites and report the results. 
The spider visits web pages, 
reads them,
and then 
follows links to other pages within the site. 
At times these spiders discover "uncharted pages" following links 
outside the submitted or found site (this 
is happening less and less now, for reasons discussed [elsewhere] on my site).
 
This is
what it means when someone refers to a site being "spidered". 
The spider returns to the site on a regular basis, such
as every couple of months, to check if the site still exists and to 
look for changes. Note that the average life of a site on the web was of less than 
two months at the beginning of 2000.
The spider's findings land into the second part of all search
engines, the database, aka the 
index or catalog. A giant book containing a copy (google) or a 
archived and zipped reference to every web page 
that the
spider has found. If a web page changes, then this database is updated 
with the new
information.
         
This indexing process of the spidered info, nota bene can take weeks.
Search engine specific algos are another important element for all 
search engines. This
are programs and algos that grep and sift 
through the millions of pages recorded
in the database to find matches fore a query 
and rank them in order of
what the programmers believe should be 
most relevant. All search engines have the parts listed above, but the 'tuning' is different, that's the 
reason the same query will give DIFFERENT results on different search engines. 
In the other pages of this 
[details] 
section we will examine 'in depth' the specific quirks of the main 
search engines.
 Of course the 'relevance ranking' part is a tricky business. 
Filtering unwanted information ourt can even be more difficoult than 
searching for info in the first place, since in searching for info you may allow 
a certain marge of uncertainity, but when you fuilter something out, that's a binary 
 aut aut, not a vel vel decision.
You should 
beware of filter agents (unless you have or happen to have found 
their source code :-)    Godzilla knows what for silly algos have been implemented by 
the programmers to filter information out that YOU wanted but that Billy the userzombie 
usually discards.
 
In fact the most commonly used algos leave a great marge for errors and blunders:
- link analysis 
("best sites based on how many sites link 
to that page") I am sure you can list on yourself a dozen good reasons NOT to trust 
such a methodology :-)
 
- result clustering
 that's summarizing results into (supposedly) useful groups and 
 automatic 
creation of taxonomies 
(and directories) on the fly... see Northernlight for an example: those folders 
may at times be even useful, but are often puzzling, to say the least;
 -  clicking 
weighting this is an hotbot speciality: user click on a link from the retrieved results? That site 
will get a plus. User remains very little on that site and goes back? It's a bogus site, will 
have less weight next time. User click first on site listed number eight and 
jumps the first seven? That site gets more weight, and so on... it is easy to imagine 
how this system will automagically 
tend to a 'common denominator' of crap/spam sites that bozo the user loves and that 
you personally wouldn't probably want to touch with a badger pole. Another good reason to 
always begin from the fourth (at least) page of results downwards :-)
 
 
All the methods and algos listed above are arbitrary to say the least. Maybe future 
clustering algos will be able to use format distinctions more efficiently, since the 
relevance analysis necessary to evaluate a scientific paper must 
obviously use different parameters from those used to evaluate a spamcommmercial 
site. But for the moment you better be very careful and sceptic about the results 
presented and try to use queries that are unlikely to have 
been standardized by the programmers. Once more, never forget that there are 2 milliard sites 
and that the search engines 'see' only a tiny part of that huge mass.  Search engines 
are useful to start a query, but once you have found the signal among the 
noise, you will usually switch to more advanced and complicated seeking techniques.
 Does the arbitrariety of modern automated relevance ranking methods 
mean that only human directories can offer correct ranking results? 
Of course not: 
the real problem is that categorization is per se very hard to create and to maintain and that 
if you put three humans together they will almost immediatly - and 
quite hotly -  
disagree about each other's  
chosen categorization guidelines and choices. 
Automated systems make admittedly more blunders than humans and return a greater 
number of irrelevant sites, but 
the ab uility to cast larger nets compensate this. Anyway both  
will happily try to impose you their 'petty' taxonomical visions, which more than 
often have also the added duisadvantage of being extremely provincial. (This is 
typical of all euroamericanocentrics attempts: you'll quickly realize how true this is 
once you will have learned how to search -and fish- extremely 
valuable info on -for instance- chinese, korean or russian 
sites).
A request for help deguised as anti-commercial rant :-)
Note that on the ugly web of to-day  many  commercial 
bastards offer
this kind of (easy to find) 
information  to the zombies 
on a 'per pay' basis. Your help in developing this specific detail section of my site 
is therefore even more 
appreciated than usual, as a first, and I hope painful, 
'retaliating' action against those that 
want to sell easy to find information instead of spreading it. Painful because if we get enough 
deep with this stuff we will cast a knowledge spike through our enemies' heart: 
their wallet.
 
   
(c) 2000: [fravia+], all rights
reserved