tadimens.htm: How to search the web, by fravia+: tadimens

Searching problems, web problems

[A deep and uncharted web] ("first optimistical law" of seeking)
[You'll always find what you want] ("second optimistical law" of seeking)
[Plagiarism and Anti-plagiarism] ("third optimistical law" of seeking)
[Ethical problems] ("fourth optimistical law" of seeking)
[Smaller laws] ("minor laws" of seeking)

[Final advices]

A deep and uncharted web

The web is huge, of course, but that's not a real problem. The problem is that time and space have a different meaning on the web. Add to this the truth that everything that 'happens' is carved forever: try to pull something 'off the web' and you will soon realize that you wont be able to do it. Everything you write and publish will defy eternity, carved in electrons: the very moment you put something on the web, someone, somewhere, will make a copy out of it. It is bound to reappear, somewhere sometime: indestructible and redoutable powers of the void.

Time is different too. As anyone that has real web-experience knows, it happens that something you wrote, or published, remains unanswered - and apparently uncared of - for months, or years... and then, all of a sudden, when you almost forgot it yourself, a dozen persons begin contacting you out of the void, with an enormous and for you inexplicable interest in what you wrote so long ago.
There are in fact micro-communities, working along similar paths, that ignore reciprocally their existence for an inexplicable long time. This has of course to do with both the vastness of the web and the fact that people do not know how to search.
Furthermore, the web is truly international, in a depth that even those that did travel a lot (mostly) underestimate. Some of the readers interacting with you may have problems, ideals and aims so 'alien' from your point of view that you cannot even hope to understand them. On the other hand this multicultural and truly international collaboration may bring some sparkles of fresh breath in a world of cloned Euro-American zombies who drink the same coke with the same bottles, wear the same shirts, the same shoes (and often enough the same pants), and sit ritually in the same McDonald's in order to perform their compulsory collective and quick "reverse shitting".

The web is truly huge: at the moment there should be around 32 milliard/billions (indexable) pages on the "visible" web, according to most (self-proclaimed) experts. Less than one half of these documents are indexed by the main search engines.

How many documents there really are is anybody's guess.
Moreover there's an "invisible" (or "deep") "hidden databases" web (made out of dynamic, not persistent, pages, the content of which can only be found by a direct query) which could be up to 500 times bigger than the "visible" (or "surface") web.
Fact is that the main search engines (in this very moment, on a continuously moving landscape, Google, Yahoo, MSN, ask, exalead and Gigablast) cover at best one third of the indexable web, and probably far less. The pace of growth being amazing as well, there is no hope that the main search engines will ever be able to cope with the www.
Hence search engines are not enough (by a long shot not enough) in order to search this huge bulk of scattered information, therefore different methods MUST be used to search the web.

But for searchers the huge number of sites and the incredible pace of growth are not motives of despair. In fact even if the web will continue to increase with the same incredible pace it has registered in the last years, its DIAMETER will remain low. Thus searchers will always be able to find what they are looking for, provided they know how to search and how to program their own [bots.htm] and scrolls.

At the moment the diameter of the web should still hover around 19 'clicks'.
Since the average number of links per page is seven (on some sites you'll find hundreds of links, on some others: none), given the presumed dimensions of the web you should be able to hop from any Internet site to any other one using - on average - just 19 clicks.
NOTA BENE: This limit will not increase much with the growth of the web (there's a logarithmical correlation): it may increase to a maximum of 21 or 22 links towards the end of the century.

Therefore (this is the important corollary): a seeker that knows his skills will always be able to find what he's looking for in a relatively short time, no matter how big the web is or will be, this being the "first optimistical law of seeking" :-)
This said, the real chances that you can reach at all (I mean, not "on average") - from any random site - any random site clicking only forward are just around 25%.

The structure of the web, from a searcher's point of view, is - to say the least- pretty weird, as you'll see in the picture below.
Yet the structure of the web is of paramount importance (from a searcher's point of view, eh) in order to decide which techniques you should apply when searching and where i.e. in which areas.
Imagine a huge bulk of around 35 milliard (billions) pages, all mutually interconnected. This is the 'Nucleus', the very CORE of the web, a bulk of strongly or losely interconnetted pages, sites, usenet newsgroups, messageboards, you name it.
This is the 'web' everyone knows, where you happily browse from link to link and from blog to advertisement, smearing all your personal data along, as we try to counter in the '[Anonymity]' lore. It is not easy to represent the web threedimensionally. The Nucleus is far from being a compact and uniform 'ball' of mutually interconnected sites, you should think at a fractal like entity, with almost 'organic' features, with spaghetti-similar 'tubes' that quickly connect some areas while leaving 'link-holes' in many places, in fact it would probably look like a chump of Gruyere (the commercial advertisement banners being the cause of its bad smell :-)

This image I have made could be of some help...

images532*380

As you can see there are four main different areas: 'bulk', 'hidden databases', 'outside linkers' and 'outside linked'.
Different techniques are used to access these different areas.

An important area of the web is made of hidden databases. These are pages that the Nucleus points to, and that may (or may not) point back to the Nucleus. Yet for commercial (or other access-restrictive) reasons visitors of sites located here are supposed to "pay" or adhere to some political, religious or social "clan" in order to access them. As you may imagine, these pages are NOT mutually linked.
Fortunately (for us... unfortunately for the commercial bastards) the web was originally built in order to share (neither to hoard nor to sell) knowledge. And thus the building blocks, the "basic frames" behind the structure of the web are still the same of the web of old: a tool made to freely and gratuitously exchange information.
If I may dare a comparison: exactly as it is pretty easy to break any software protection written in a higher language if you know (and use) assembly, so it is easy to break any server delivered 'barrier' to a given database if you know (and can outflank) the protocols used by browsers and servers.
As a result let's simply say that it is relatively easy to access 99% of the pages you may encounter reversing the (simple) perl or javascript tricks used to keep "off limit" zombies and lusers. (You wont even have to recurr to common exploits à la "politically correct" :-)

The Nucleus "points" to another area of the web, the "outside linked". The sites in this area are linked from the Nucleus but do not point back to it. A simple example: the elements of a database of images, linked from the Nucleus but not necessarily pointing back to it. This part of the web, made mostly out of 'storage clumps' and "non hidden" databases (but not only), can be searched and combed with the same searching techniques that we usually apply when searching the Nucleus. These pages are "outside" the nucleus, yet not particularly difficult to find.

Like matter and anti-matter, to the "outside linked" pages we spoke of above, correspond an inversed related part of the web: the "outside linkers" pages. Indeed all the pages located in this specific area of the web do "point" to the Nucleus but are not pointed back from it. Imagine -as an example- the personal links page of a scientist: lotta interesting links to the Nucleus yet no need to publicize its existence. A page with information you may need is there, somewhere, without any link whatsoever that could bring you to it. Indeed there are per antonomasia no links back from the Nucleus to these "outside linkers".
Since no link whatsoever points back to them, the "outside linkers" are a part of the web you cannot reach using "normal" search techniques. Yet these places may hoard knowledge you need. There are, fortunately, some techniques that you can apply in order to find them, the most simple and common one being 'klebing' (using referrers and luring/stalking techniques on top of it).
Anyway, first of all learn the the basic!

You'll always find what you want

And even when all your searching techniques have failed, when all your cunning approaches did not catch anything, your lonely searches seems endless, when all your tricks have brought you no reward... even when a rude database dares deny you access, even when your target has been pulled off the web, jailed, destroyed, censored, annichilated by the powers that be... even in those dire moments you will always know that you can find what you are looking for even if it is no more there!
Yes, as strange as it may seem to you, this is the "second optimistical law of seeking" :-)
How do you find your disappeared target? You travel trough time.
How do you travel trough time? You take advantage ofthe fact that EVERYTHING THAT HAS BEEN PUT ON THE WEB ONCE WILL LIVE ON COPYCATTED ELECTRONS FOR THE ETERNITY, either trough the usual & well-known public '[time machines]' (basically huge caches or 'photosnaps' of the web at a given time), or through more 'gray' (and difficult) channels and alleys that you'll discover and take advantage of in due time.
Anyway be assured: you are going to be a seeker, you will be able to find everything, even targets that do not exist any more.
Given that everything you put on the web will live forever on copycatted electrons, we can derivate a third law...

Plagiarism and Anti-plagiarism

A Plagiarist is someone that 'appropriates' informations or thoughts or data he finds on the web without giving due reference to the original Author of the data / thoughts / information he is using, trying to 'give the impression' he is the real Authors of such data / thoughts / information.
Unfortunately (for the Plagiarist) on the web it is as easy to plagiate as it is to find out the plagiarism, using ad hoc searching techniques.

In fact the web is incredibly 'stiky': try searching for the first line of this very page:
"Knowing how to find anything you want on the web, once you will have learned it, will give you power."
You will most probably (on the slippery web you may always 'slip' :-) land here again.
This is the reason those that create content should never fear to give it out for free! A correct 'restitution' of the paternity of any given snippet is possible using ad hoc searching techniques (like the ones regarding the 'disappeared web' we have pointed to in the previous chapter).
This is the reason we don't need to fear being plagiarized by self described 'Super searcher experts', 'internationally respected Internet trainers' and/or 'Revered Authorities on Web search engines' nor any other of the commercial hot baloons that abound on the web (And that you should learn to evaluate (or rather 'devaluate' :-) on your own: they will NEVER be as able as we are in providing NEW content. It would take them a whole life of (heavy) studies just to reach those levels we have already abandoned. Besides I firmly believe that the very moment you cease to do something for fun, joy or interest and you begin doing it for 'money', your value decreases... :-)
The above explains also - incidentally - the reason you may (and should) want to contribute to THIS site: your work will never be plagiarized without being discovered (and eventually triggering a well-deserved punishment :-)
This is still the realm of the web of old, and no trespassers are allowed from the commercial sargassos seas.
Your contributions may be used by others, they may be teached, learned, shared, developed or built upon, but they will always remain YOURS, we'll together take care of that :-)
Thus the "third optimistical law of seeking" reads: 'On the web EVERYTHING - and at the same time NOTHING - can be plagiated' :-)

Ethical problems

Knowing how to find anything you want on the web, once you will have learned it, will give you power. As always, that power can be used for 'evil' deeds and/or for 'good' deeds (let's leave aside, for now, the rather complex question of what would be 'evil' and what 'good', just use your own parameters).
A similar problem arised in my previous 'page of reverse engineering' (1995-2000), a site that dealt specifically with software reverse engineering techniques and tools. The idea was to convert young crackers (i.e. people interested almost only in breaking software protections) into software reverse engineers, something that the [world needs badly], especially given the many [malwares] practices around.
The experiment worked only in part, hence the decision a couple of years ago to freeze that site

Spreading knowledge on the web is indeed a difficult act of balance.
There is for instance a special section, ported from the previous site, called 'ideale.htm', that was indeed intended as a sort of small 'introduction' to various "lores of destruction" (yet purposedly not server-attack oriented). There are indeed enough informations (and 'angles') there to allow anyone with average searching skills to find out, on other parts of the web, everything he may need to wreak havoc on a server and then some.
Not that anyone would really need help for that: a simple search like [+directory +indexing +bugtraq] will "turn some mighty big stones", as a contributor pointed out some time ago.

This raises an interesting paradox.
Basically it is now clear that those that do learn how to search, which is what this site is intended to teach, will be able to find on the web -if needs be- ANY KNOWLEDGE, and thus anyone of the million possible exploits, some of them 'so recent that your server administrator wont even understand what's going on'.
Thus, once learned how to search they COULD turn out to be potential dangerous little script kiddies.
Yet -at the same time- there are reasons to believe (and hope) they won't be 'simple' script kiddies anymore once they realize what incredible power the sheer fact of being capable to search gives them.
No kidding: we came to the conclusion -for me quite unexpected until a couple of years ago- that a good seeker can be MUCH MORE DANGEROUS than a good cracker or a good hacker or a good viri writer or a good reverser, because HE CAN BE ALL OF THEM AT ONCE, lacking their in-depth understanding of their relative specific field, but easily compensating this with his evaluation and searching skills.
That's one of the reasons of our insisting (through the [reality cracking] section and various ramblings) on the ETHICAL side of the coin: may the "dangerous" seekers we contribute to create endanger the enemies of poetry, knowledge, diversity and tolerance and annoy instead the apostles of compelled behaviour, consumistic slavery, monolithical thinking and religious fanaticism (be it under the forms of catholic, jewish or islamic intolerance).
This said, I do believe also that seekers are and will be open-minded per definition, and thus pretty difficult to force into any pre-definite, pre-digested or 'obligatorial' way of thought.
Therefore ethic should kiss their shoulders 'semi-authomatically' when they seek.
Seekers will grow ethically for the simple fact they seek, this evolution into an ethical seeker being the "fourth optimistical law of seeking" :-)
In fact, judging from the firewalls' loggings, most attacks seem to come from idiots that DO NOT KNOW how to search... hence they are using common exploits and can easily be stopped (and if necessary directly and immediately punished :-)

Anyway, first of all, learn how to seek!

Smaller laws

The web is heavily skewed to the recent and the trivial
This is unfortunately a consequence of the massive web-presence of 'uneducated' masses of zombies (the "unwashed"), and advertisement slaves, that don't know how the web of old was, dont know how to search and don't understand that everything can of-course- be obtained for free.
These guinea pigs, while running day and night inside their masters' turning wheels (and bogus sites), are being conditioned to pay for everything, books, images, music... gosh, they give out money even for software!

Everytime they discover that everything they pay for can be fetched for free elsewhere (and often enough in a better version with higher quality) they have, however, a healty Satori-like reaction.
Things will/should change (albeit slowly) with the more and more frequent presence of free literature and scholar works and papers.

Final advices

This introduction is getting far too far... go back to the [entrance] and click on the main logo... your long trip into the lore of searching is about to begin.

As you'll discover perusing my site, there are many more sections. For security reasons some of them like the [PHP Lab] are located on a different server while some other will become accessible only once you yourself will have gathered enough searching - and hopefully [ethical] skills.
Yes, there are indeed many pages on this site... you are not compelled to follow any logical path. You may peruse everything at will, you are welcome. Do not be scared, nor paralysed if you don't understand everything immediately, knowledge is like one of the chill white wines bottled in the ancient lagoons I come from: you should sip it slowly and knowingly, else it won't do you no good.

Back to the entrance