Web-searching Session @
FOSDEM, 17/02/2002

(ULB, Brussels ~ Sunday 17 February 2002)

Advanced Web searching approaches
by fravia+

This file dwells @ http://www.searchlores.org/fosdem.htm

Scaletta of this session
Searching for disappeared sites    Two rabbits out of the hat   
Slides    Opera 6.1    Some reading material   
Bk:flange of myth 

Scaletta
NOT MUCH TO DO WITH FSFThis workshop has -apparently- not much to do with the free software movement, nor with the open source movement, and in so much it differs from most of the other workshops at fosdem.
Yet -together with my friend Richard Stallman- I think that people working for such worthy aims deserve to know the techniques I will describe to-day.
Another difference, as you will have noticed, is that I am the only one, here, that uses a pseudonym instead of his real name. The simple reason for this is that some of the searching techniques I will describe may be seen as "non kosher" in many of our euroamerican copyright-obsessed dictatorships.
ANYTHINGInfact we will see together to-day, how to find ANYTHING on the web.
Indeed the web is so huge that -with almost no exceptions- searching in the correct way will enable you to find anything you may be looking for. Any image, any music, any book, any document, any data, any software (proprietary or not), any newspaper... that has been published in the history of mankind. Whole national libreries are going on line in this very moment in some god-forgotten country in Africa or Asia. Half a million people are putting in these very 5 seconds half a million scanned images on some god forgotten homepage. Another 200 thousand users are uploading, in THESE five seconds 200 thousand mp3, somewhere on the web.
Some of these books, of these images, of these musics, have never been on the web before. But they will now remain on Internet for the ETERNITY.
NO WAY BACKIndeed, everything that has been published once, will remain on the web for ever and ever (and ever), in copycatted electrons, because the very moment something is there it will be copied. A simple demonstration of this is that you can find things that DO NOT EXIST ANYMORE on the web using one of the many repositories.
One of them is google cache, another one is the wayback machine, but there are many more, that will allow you to find data that have been 'pulled off' the web.
SOMEWHERE... WHERE?Yep, all this stuff is on the web, somewhere, but where?
You will have to use a lot of different techniques and approaches to search effectively the web. As you will see google, though worthy, is by FAR not the solution for your searches. In order to understand WHY you have to use different tools, you should first of all understand what the web looks like, from a searcher point of view.
Structure
Explain structure
Explain diameter 19: do not dispair
There is ONE important thing in this image that i wish you will not forget: the difference between INDEXED web (coupla milliard/billions sites) and NOT INDEXED web (9 milliard/billions pages more). So when you are searching with the main search engines, with google, or with fast, or with wisenut, you are just limiting yourself to -in the best case- a FIFTH of the web.
SO, HOW DO YOU SEARCH?Ok, how do you search your needles in this huge ocean of commercial hay?
Let's begin with the beginning: usually you do not search a specific target: you search people that have searched that target. If the target has enough signal among the noise you may even search for people that have searched people that have searched for that specific target... :-)
This approach is called COMBING, and is rather effective. But before explaining it, we will have to understandhow the MAIN search engines really work, and WHY they are there. Simply stated, these "free" search engines exist in order to grep what you and million of other users are searching for.
Anonymity -proxies- free homepages - free email addrsses -free search engines
EXAMPLE ~ S.E. DIFFERENCES Let's take an example: you are interested ina specific camera, how to use it, if it is worth using... whatever. Let's say a nikon F2
Now, of course, you could search on google for nikon F2:
search?as_q=%22nikon+f2%22&num=100 :8180 results
This is a good, simple query and it is what most people would do. And they may even be happy with it.
Nevertheless a good idea would be to use ANOTHER main engine as well, let's say FAST:
&query=%22nikon+f2%22: 2561 results
Before discussing this, it would not be bad to use at least a THIRD main search engine on such a broad query:
query.dll?q=%22nikon+f2%22: 6807 results
A first thing to understand is that you should ALWAYS use at least two main search engines, when starting a broad query. As you may see if you follow the links above, wisenut, for instance, is more 'asian-centered' than google, which in our case, searching for a japanese camera, would probably be useful.
Now a normal user would be happy: Woah, 2000 - 6000 - 8000 results! I may browse for ever just here
In fact first of all you CANNOT really see all those results. There is a difference between the number stated by teh search engines and the results you may really check yourself.
If you really tryed to see ALL those links, you would quickly discover that
Here a table I made two months ago, based on another query, as you can see there is a huge difference between alleged results and results you can investigate:
Yo-yo index


THE YO-YO Index
(Based on the broad query: "advanced searching")


s.e. Yo-yo indexreal maxmiddle3/4Alleged Total
Google3,8299950075026100
Altavista1,2740020030031834
Lycos100%real!real!real!18240~23940
Fast17,08 (46,31)40102005267023419 (8853)
Wisenut0,7230015023041776
Northernlightn/a (high)n/a (high)n/a (high)n/a (high)19344
Hotbot7,281397700105019200
Teoma2,571941001507540
Excite43,4210005007502303
Yahoo5,1367722545013200
WHILE WE ARE STILL AT THE MAIN SEARCH ENGINESSome simple rules:
1. always use more than one search engine! "Google alone and you will never be done!"
2. Always use lowercase
3. Always use MORE searchterms, not only one "one-two-three-four, and if possible even more!"
This is EXTREMELY important. Note that -lacking better ideas- even a simple REPETITION of the same term can give you more accurate results:
nikon: 1,410,000 (alleged) results
nikon nikon: 627,000 (alleged) results
If you are interested in this 'pleonastic' stuff, read The epanaleptical approach.

Since we did not do it before, it's time to use more searchterms now. Here is a "better" query for our target (I will use google, but remember -yourself- to use also OTHER main searchengines when broadsearching, you will be amazed by the non overlapping results).
&q=%22nikon+f2%22+nikkormat
Interesting, eh?
Now look at this query:
q=%22links.html%23nikon%22
You may not recognize the querycodes above... it is just "links.html#nikon". We are already slowly moving away from simple main search engines searching towards combing. In fact I was searching for pages of links that are of ineterest for my query. I can go further:
"nikon.htm" OR "nikon2.htm"
"nikon*.htm"
Another approach: +nikon nikkor +photo resources
As you see it's commercial infested... there is some need for our yo-yo here
You get the idea.
I could also use the netcraft trick: do we happen to have many "nikonsites"?
Woha... 800 sites NAMES vontain the word nikon! (But many of them will be dormient).
BUT THE REAL DIFFERENCEBut the real difference between the simple queryes we have made above and a good seeker approach, is that above we are still just "skimming", or only slightly touching, the relatively small INDEXED part of the web. We are still missing 4/5ths of it! That's the reason you will have to learn at least some rudiments of combing.
The first -simple- combing approach (remember: searching those that have already searched) is to use old glorious USENET!
Usenet
Messageboards
Homepages
Webrings
Local searching (spanish search engines - buscadores hispanos)
getting at the target from behind: netcraft, synecdochial searching.
GRAN FINALE Guessing
Passwords through google
Database accessing (politically correct)
brute forcing? Guessing / Searching
Bots searching, scrolls, wands
Software reversing: commercial bots capering
ANY QUESTIONS?Now, at the beginning of our workshop I told you that you can find ANYTHING on the web. My experience has tought me that there is almost always -unfortunately- ONE sad exception. It is almost always next to impossible to find quickly the curious targets that people ask for at the end of my workshops, so do not ask me to find something specific for you now...



SEARCHING FOR DISAPPEARED SITES

http://web.archive.org/collections/web/advanced.html ~ The 'Wayback' machine, explore the Net as it was!


Visit The 'Wayback' machine at Alexa, or try your luck with the form below.


Alternatively learn how to navigate through [Google's cache]!

NETCRAFT SITE SEARCH

(http://www.netcraft.com/ ~ Explore 15,049,382 web sites)

VERY useful: you find a lot of sites based on their own name and then, as an added commodity, you also discover immediately what are they running on... verbum sapienti sat est eheh, I mean... a word to the wise is enough...
Search Tips
Example: site contains [searchengi] (a thousand sites eh!)





RABBITS (out of the hat)

Searching entries 'around the web', no specific target, using 'common' passwords:
For instance: bob:bob

Searching entries to a specific site (not necessarily pr0n :-):
For instance: "http://*:*@www" supermodeltits


The above is not 'politically correct' is it? But it works. And speaking of "politically correctness", some of you'll love the Borland hardcoded password faux pas... Databases are inherently weak little beasts, duh, quod erat demonstrandum.

SLIDES

(FOSDEM, 17/02/2002)

How big?

Structure

Kosher - non kosher

Web coverage, short term, long term

Main search engines coverage

Opera 6.1 (windoze opensourcing)
Traditionally I conclude my workshops examining ways to remove advertisement from this FANTASTIC Web-browser, that beats hands down any other browser on the market.
You'll find a Linux version at
http://www.opera.com/linux/, where you can either use the ad-infested version or purchase the non-ad one.
Below I'll explain you how to eliminate advertisement in Opera 6.1 for windoze, but I would advice you to purchase, as I did, your own version, and help slick, good and fast Opera against the awful Netscapian and Microblowian browsersaurii.

I do not believe that the following information will damage Opera, on the contrary: