~ How search engines work ~
|
|
|
|
How search engines work |
Version March 2000
Eh, this is still in fieri, and is just a first attempt, ya know...
by fravia+, March 2000
A search engine is a complex funny beast, as you can imagine.
It's eyes
are
the programs it uses in order to find information,
its great ugly body
is a giant database of continuously updated findings and his
brain are the specific
algos used to decide the relevance results for a given query.
Of course there are major "parts".
Each engine needs (at least) - to FIND information;
- to STORE information
-
to ORGANIZE its stored information (weighting, ranking)
-
to ALLOW querying (user interface)
and, since most users using the search engines are basically idiots incapable of searching with
more than one term,
- to automatically AMELIORATE the mostly poor query that they got as input
(query expansion, neural net techniques, pseudo-relevance feedback) in order to
offer results that make some sense.
Finding information
Human URL-adding and spidering.
Each search engine has a "spider" (aka crawler, aka bot)
which automatically scours the Net for sites and report the results.
The spider visits web pages,
reads them,
and then
follows links to other pages within the site.
At times these spiders discover "uncharted pages" following links
outside the submitted or found site (this
is happening less and less now, for reasons discussed [elsewhere] on my site).
This is
what it means when someone refers to a site being "spidered".
The spider returns to the site on a regular basis, such
as every couple of months, to check if the site still exists and to
look for changes. Note that the average life of a site on the web was of less than
two months at the beginning of 2000.
The spider's findings land into the second part of all search
engines, the database, aka the
index or catalog. A giant book containing a copy (google) or a
archived and zipped reference to every web page
that the
spider has found. If a web page changes, then this database is updated
with the new
information.
This indexing process of the spidered info, nota bene can take weeks.
Search engine specific algos are another important element for all
search engines. This
are programs and algos that grep and sift
through the millions of pages recorded
in the database to find matches fore a query
and rank them in order of
what the programmers believe should be
most relevant. All search engines have the parts listed above, but the 'tuning' is different, that's the
reason the same query will give DIFFERENT results on different search engines.
In the other pages of this
[details]
section we will examine 'in depth' the specific quirks of the main
search engines.
Of course the 'relevance ranking' part is a tricky business.
Filtering unwanted information ourt can even be more difficoult than
searching for info in the first place, since in searching for info you may allow
a certain marge of uncertainity, but when you fuilter something out, that's a binary
aut aut, not a vel vel decision.
You should
beware of filter agents (unless you have or happen to have found
their source code :-) Godzilla knows what for silly algos have been implemented by
the programmers to filter information out that YOU wanted but that Billy the userzombie
usually discards.
In fact the most commonly used algos leave a great marge for errors and blunders:
- link analysis
("best sites based on how many sites link
to that page") I am sure you can list on yourself a dozen good reasons NOT to trust
such a methodology :-)
- result clustering
that's summarizing results into (supposedly) useful groups and
automatic
creation of taxonomies
(and directories) on the fly... see Northernlight for an example: those folders
may at times be even useful, but are often puzzling, to say the least;
- clicking
weighting this is an hotbot speciality: user click on a link from the retrieved results? That site
will get a plus. User remains very little on that site and goes back? It's a bogus site, will
have less weight next time. User click first on site listed number eight and
jumps the first seven? That site gets more weight, and so on... it is easy to imagine
how this system will automagically
tend to a 'common denominator' of crap/spam sites that bozo the user loves and that
you personally wouldn't probably want to touch with a badger pole. Another good reason to
always begin from the fourth (at least) page of results downwards :-)
All the methods and algos listed above are arbitrary to say the least. Maybe future
clustering algos will be able to use format distinctions more efficiently, since the
relevance analysis necessary to evaluate a scientific paper must
obviously use different parameters from those used to evaluate a spamcommmercial
site. But for the moment you better be very careful and sceptic about the results
presented and try to use queries that are unlikely to have
been standardized by the programmers. Once more, never forget that there are 2 milliard sites
and that the search engines 'see' only a tiny part of that huge mass. Search engines
are useful to start a query, but once you have found the signal among the
noise, you will usually switch to more advanced and complicated seeking techniques.
Does the arbitrariety of modern automated relevance ranking methods
mean that only human directories can offer correct ranking results?
Of course not:
the real problem is that categorization is per se very hard to create and to maintain and that
if you put three humans together they will almost immediatly - and
quite hotly -
disagree about each other's
chosen categorization guidelines and choices.
Automated systems make admittedly more blunders than humans and return a greater
number of irrelevant sites, but
the ab uility to cast larger nets compensate this. Anyway both
will happily try to impose you their 'petty' taxonomical visions, which more than
often have also the added duisadvantage of being extremely provincial. (This is
typical of all euroamericanocentrics attempts: you'll quickly realize how true this is
once you will have learned how to search -and fish- extremely
valuable info on -for instance- chinese, korean or russian
sites).
A request for help deguised as anti-commercial rant :-)
Note that on the ugly web of to-day many commercial
bastards offer
this kind of (easy to find)
information to the zombies
on a 'per pay' basis. Your help in developing this specific detail section of my site
is therefore even more
appreciated than usual, as a first, and I hope painful,
'retaliating' action against those that
want to sell easy to find information instead of spreading it. Painful because if we get enough
deep with this stuff we will cast a knowledge spike through our enemies' heart:
their wallet.
(c) 2000: [fravia+], all rights
reserved