fravia @ Paris
Ecole polytechnique, 6 February 2001



How to search: The structure of the web -why should you care?

Out there, on the web, somewhere: different laws, alien approaches, original methods, unique tactics, new solutions, proposals you would never have thought of in order to solve your local (& probably provincial) problem.


Please excuse my English. Mais je craigne que mon Français soit encore pire. There are also sound terminological reasons to use English during today's workshop. I will, however, answer in French to your questions afterwards, if needs be. Note that I wont give many references supporting my statements. References are not necessary among searchers: once stated -for instance- that the diameter of the web is around 19 clicks, you have already enough data (angles) to find all necessary references by yourselves, and then some. You'll be able to check or refute such a statement on your own. That's the beauty of websearching. Reminds medieval "Quellenforschung" techniques. Yep, you should search for that as well :-)

HOW TO SEARCH THE WEB
volume, diameter and structure

First of all you must understand the incredible depth, but also the amazing limits of the web.

Let's see... how many fairy tales do you think human beings have ever written since the dawn of human culture?
How many songs has our race sung?
How many pictures have humans drawn?
How many books in how many languages have been compiled?
The mind shudders, eh?

Nevertheless (unless you're really ill-fated in your queries) every song, every image, every text, every book and every software program are now being stored -with fewer and fewer exceptions- on the web.
I'm not exaggerating. Whole national libraries are going massively on-line in some godforsaken country at this very moment. Tumbling prices for scanners, hard disks and web-connection made this possible.
In fact every media product is already there... somewhere, albeit buried under huge mountains of futilities and commercial crap.

But there are even more important goodies than media products out there. There are SOLUTIONS! Imagine you're an administrator, or a consumer defender, or a lawyer, or a scientist, or simply yourself when seeking a solution, doesn't matter a solution to what, ça c'est égale.
Well, the solution to your problem is there. Actually you'll probably find MORE THAN ONE solution to your current problem, and maybe you'll be even able to build on what you'll have found to develop another further approach.
Out there, on the web, somewhere: different laws, alien approaches, original methods, unique tactics, new solutions, proposals you would have never thought of in order to solve your local "provincial" problem in -say- French Nantes, have been found and tried out (maybe in Spanish) in -say- Buenos Aires (or vice versa, or anywhere else under any other language ~ nation ~ sky ~ latitude).
Alas! You desperately need that info, but you wont be able to find it... if you don't know how to search...
They are there, somewhere, but -like a tantalizing juicy fruit- just out of your reach.
That happens -most of the times- because you're not able to search the web effectively. Effective searching -on the web- is nowadays a very important gift. Once you learn it, it wont matter any more if the info, or the program, or the solution, or the picture you're looking for, is freely available, or kept prisoner inside a hidden database in order to make money with it, or buried beneath tons of commercial advertisement crap. You'll find it wherever it is, cutting through the web-pudding with the dissecting sharpness of your seeker's logical swords (and tools).

We will learn the basic approaches, right here, today. You'll be able to apply them again and again, for a long time. I believe you won't forget what we'll learn today... but you'll judge by yourself how useful this could be. Note that these approaches are mostly simple and almost banal... when you examine them a posteriori :-).

As usual, I also want to underline the cosmic power that software reverse engineering can give you. Therefore I will close this workshop demonstrating a simple technique to get rid of advertisement banners planted inside software (using the latest version of the very good 'Opera' browser as example and target). Since this approach does not require any particular software reversing skill and can thus be applied by anyone of normal intelligence, it should help to put a nice stop to a pestering habit that utterly annoys me :-)

Let's start from the very beginning: what's the web and what does it look like? As you'll see this matters quite a lot when searching. 3 and a half milliard (billions). That's the volume.



Today's web depth is anyone's guess, nobody really knows how big it is, but the most realiable data give for the beginning of February 2001 (today), around 3.500.000.000 pages (that's 3 and a half milliard pages, duh). The most capable search engines (google, fast and altavista... when their servers are not overloaded) cover at best a small part of it. Let's hope one third (crossing our fingers and our toes for good measure ;-) but even this appears optimistic: it looks more and more like they are simply not able to keep the pace of growth.

Search engines are not enough (by a long shot not enough) in order to search this huge bulk of scattered information, therefore different methods MUST be used to search the web.

Since we are dealing with web dimensions, let's have quickly a look at its diameter as well.
As strange as it may seem to you, the diameter of the main portion of the web (the "bulk" aka "central core") is relatively small: around 19 links (also called 'degrees' in this contest) on average. Since the average number of links per page is seven (even if on many sites you'll find hundreds of links), given the (presumed) dimensions of the web you can hop from any Internet site to any other one (yep!) using on average just 19 clicks. This limit will not increase much with the growth of the web (there's a logarithmical correlation) and may increase to a maximum of 21 or 22 links MAX (on average). This is also called the "small world" phenomenon. This said the real chances that you can reach at all (not on average) from any random site any other random site CLICKING ONLY FOWARD are just around 25%. The bigger a site (node) the higher number of pages it will have with specific and often unrelated links. Note, moreover, that a path from site A to site B DOES NOT imply that there's also a path from site B to site A (duh). This, as we'll see in a moment, is of paramount importance for seekers.




The structure of the web, why should you care?

Since the best 'global' (or 'main') search engines cover one third of the web at best, you must often resort to other methods of searching if you want to find what you were looking for. These 'other methods' depend of course from the nature of your target (a specific information, file, music, book, document or image; information about somebody; hidden or classified information inside a non-public database... and so on) but resolve most of the time to a 'broader search' through local and regional search engines, a 'combing' search on usenet newsgroups and on various messageboards, an advanced search using self-made ad hoc bots (mostly in perl nowadays, but you can of course use ole good C or powerful php or REBOL or whatever you want to build a bot, even -yeeerk!- Visual Basic if you fancy), and the various other techniques like combing, klebing, guessing, hacking hidden databases and stalking that we'll (in part) discuss to-day.



The structure of the web (from a searcher's standpoint) is of paramount importance to decide which techniques you should apply and where. Let's start with the 'Nucleus'. Imagine a huge bulk of nearly 3,5 milliard pages, all mutually interconnected. This is the CORE of the web, strongly interconnetted pages, sites, usenet newsgroups, messageboards, you name it. This is the 'web' as you know it, where you happily browse from link to link smearing all your personal data along, as we'll see later, when we'll examine some anonymity themes. It is not easy to represent the web threedimensionally. The Nucleus is far from being a compact and uniform 'ball' of mutually interconnetted sites, you should think at a fractal like entity, with almost 'organic' features, with spaghetti-similar 'tubes' that quickly connect some areas while leaving 'link-holes' in many places, it would probably look like a chump of Gruyère :-)

As the image behind my shoulders should make clear, only a relatively small part of the Nucleus has been indexed by any search engine.

Take note that a part of this huge bulk is USENET, an incredible mass of micro-information, which can frequently give very useful results, especially for seeking, combing, klebing and stalking purposes.
We'll get back to Usent, later, when dealing with micro-searching techniques.

So, back to our 'image' of the structure of the web. We have (today) this 3,5 milliard page Nucleus (the bulk) with a part of it consisting of usenet newsgroups (articles, images, files and people). Only a part of this core has been indexed by search engines. Never forget this.
For searching purposes we are now going to divide the web into parts that link to and/or are linked from the Nucleus.

First of all there's a big area called "Outside linked". This consists of pages and files that the Nucleus links towards, but that do NOT point reciprocally back to the Nucleus. Thus they are 'outside' the reciprocally connected bulk, yet they are not particularly difficult to find. In order to search this part of the web, mostly made of non-hidden databases (imagine a huge collection of images, for isntance) you'll use the same techniques (power-searching, combing and building your own searchbots) commonly used in order to search the Nucleus.

Then there's another big area called "Outside linkers". The pages located in this area of the web "point" to the Nucleus but are not pointed back from it. Imagine as an example the "personal links" pages of a scientist: lotta juicy links to the Nucleus yet no need to publicise their existence. There is -per definition- NO LINK from the Nucleus that you browse and peruse to this part of the Web. A page with information you may need is there, somewhere, without any link whatsoever that could bring you to it.
The "outside linkers" are a part of the web you cannot reach using "normal" search techniques, since no link whatsoever points to them. Yet they may hoard knowledge you need. There are, fortunately, some techniques that you can apply in order to find them. The most common one being klebing, as we'll see in a minute.

Then we have a big area (or to be more precise, a very big quantity of small areas scattered around the Nucleus, but I didn't know how to draw it) of hidden databases which are connected to the bulk, but that can be accessed (in theory) only conditionally.
In fact these 'hidden databases' contain pages or files that the Nucleus points to, and that may (or may not) point back to the Nucleus. Yet for commercial (or other access-restrictive) reasons visitors of sites located here are supposed to "pay" (or adhere to some "clan") in order to access them. As you may imagine, these pages are NOT mutually linked (but they might point to some "outside linkers" as well).
Fortunately (for us, unfortunately for the commercial bastards) the web was originally built in order to share (and neither to hoard nor to sell) knowledge. And thus the building blocks, the "basic frames" behind the structure of the web are still helpful for those that want to share .
If I may dare a comparison: exactly as it is pretty easy to break any software protection written in a higher language if you know (and use) some assembly, so it is easy to break any server-user delivered "barrier" to a given database if you know (and can "outflank" and/or exploit) the protocols used by browsers and servers.
Let's simply say that it is relatively easy to access all pages in this "hidden databases" area of the web reversing the (simple) perl or javascript tricks used to keep them "off limits" for zombies and lusers ("losers users"). A sound knowledge of Internet protocols and the ability of Guessing (which can be a very sofisticated Art) represents an incredibly powerful approach in this field.

This facility of access is extremely worrying, since a not too small part of this area is made of so-called "politically and strategically sensible" data, hidden within military or government servers, which are -alas- connected to the web (this being in this contest a very big mistake eo ipso IMHO).
Since this is a workshop "for the establishment", I'll limit myself to point out that if you learn how to search the web you'll find soon, and without excessive fatigue, a whole plethora of effective ways to access this kind of info as well... should you fancy it.

But this is exactly the problem we have started with: how do you search effectively? Since the invention of the "Altavista-type" of indexing texts, search techniques became COMPLETELY different than before. Let's have first of all a deeper look inside Altavista.


«««««  Back to Paris.htm «««««        »»»»» Forward to Paris2.htm  »»»»»
(c) III Millennium by [fravia+], all rights reserved and reversed