Some basic must know - 1: Terminology Hey! You are
allowed to continue without reading the following stuff only
if you are really sure that you already know these basic
matters...
Missing a document name the browser will
automatically open any document named "default"
or "index" found in the directory
A URL (Uniform Resource Location) is the mechanism used for Web addresses.
URLs are used in Web browsers to find the location of a particular Web page.
A URL consists of three main parts - the protocol, the host name, and
the directory location. First comes the protocol. This is usually "http" (Hypertext Transfer Protocol) or "ftp"
(File Transfer Protocol).
The protocol comes first, followed by a colon and two slashes. In this
case it is using the HTTP protocol, which means a Web page is at that
location. The next portion is the Internet host name www.searchlores.org.
Somewhere on the Internet (here) is a system with that name, with a corresponding
IP numeric address provided by the Internet DNS service (see below).
The last portion is a specific document inside a directory structure, in this case /milano/milano.htm. Since
a Web server will typically have many different Web pages on multiple
directories,
the URL provides a way of specifying where to look.
Remember that you can almost always retrieve a page EVEN IF THE URL CHANGED, provided you use
an exact portion of the text on that given page.
Try for instance to retrieve this very page using
google
or fast :-)
Internet Protocols ~ IP
The most fundamental protocol is called IP, for 'Internet Protocol' (duh).
On a network, the common term for a location is 'address', and each system
on the Internet has an address. Since you are on the web right now,
in order to read this, you are automatically a node of the web and you have
an Address as well. You can check your actual address using your
C:\Windows\Winipcfg.exe utility (if you run windoze).
This 'IP address' has many possible formats, a often overlooked fact which is
quite relevant for seekers, as you'll understand in due time.
Internally, each computer system
uses an IP address that is composed of four numbers, usually written for
humans with dots between each number. An example IP numeric address is '209.103.174.104'
(an older IP address for my main site). However, since it's easier for humans
to remember names instead of numbers, most IP addresses have corresponding
names, also separated with dots. The previous address, written as a name corresponded to
'www.searchlores.org'.
Scattered throughout the Internet are dedicated systems with the responsibility of
translating Internet name addresses into the IP address numeric form. These
systems are called 'name servers'. Another term used in conjunction with Internet
name addresses is 'host name', because every Internet address must
correspond to his hosting computer system somewhere on the Internet.
The systems that provide IP name
to number translation are called 'Domain Name Servers', or DNS.
In windows there is a HOSTS file that represents your 'personal DNS server': before accessing any
remote computer through a name to address lookup, windows checks
the hosts file to see if the name has been defined as an address ALREADY in hosts.
You can edit your HOSTS file in order to block annoying
advertisement.
Request for Comments (RFC)
If you want to learn more about these things NOTHING can be better than RFCs.
The Internet Request For Comments (or RFC) documents are the written definitions
of the protocols and policies of the Internet.
In fact every aspect of the Internet is documented in these
documents
RFCs are
usually in plain text form.
Once an RFC is published, it is not changed. However, it may be
"obsoleted" by later work. Unfortunately, there is no easy way to
"browse" RFCs to discover which RFC is the latest on a particular
topic, although there are various Web Indexes which can be useful and
where you can use the server specific form to
search for specific terms.
Some basic must know - 2 Search engines' limits and annoyances
One of
the amazing characteristics (and probably advantages)
of the Web is that it is not indexed in any standard
manner. As a consequence retrieving
information can seem difficult. Commercial search engines à la google,
fast, teoma or altavista,
the most popular tools
for locating web pages, return a huge mass of
results, a general query risks to retrieve thousand of pages, many
of them completely irrelevant, most of them quite off-topic. The problem is that you often enough get
too many results.
All search engines crawl the Web and log in their databases
all words from the
web pages they have gathered. Some search engines, like google or
baidu, even mirror
all pages in their cache, thus allowing you to find even 'disappeared' or "lost" pages. These repositories,
incidentally, make it almost impossible to "retire" something, once put on the web.
I have conducted a personal experiment, trying to "pull" off the web
my old, software reverse engineering oriented, site (that I have 'deprecated' :-) and
found it next to impossible. In fact, once you publish something on the web, if it has some original content, or
if it is
of some interest, "it goes forth
and multiplies". Back to the crawling search engines with terabytes of
compressed texts:
Once you start logging words from hundred of millions of pages the
results of an (unstructured or "simpleton") query can be overwhelming.
Without searching knowledge,
and without a clear strategy,
using a search engine is like wandering aimlessly in the dark, without
spectacles and light, in the stacks of
a poorly organized library trying to find a particular book.
A good Seeker's supper should not smell too much.
See, the problem is not only how to find, your info, but also and foremost,
how to evaluate it. Imagine your search will give you "only" 200
- possibly
valuable - results. That's way too many for effective human evaluation purposes. Let us quickly
demonstrate this:
let's say you have a good "zen" perception and it takes you just
half a minute to quickly "diagonally scan" a
page with your eyes~brain to "feel" if it is worth keeping~reading or not... that
makes 100 minutes for that single query! More than one and a half hours smelling possibly stale
pages! Clearly that's not a
very effective searching approach. For this very reason you should master some techniques that will allow you
to drastically reduce the
number of fishes you'll have to smell before choosing the ingredients for
your seeker's supper.
Robot-driven search engines can be defined as search engines which
use
a bot (aka "spider", "worm", "wanderer" or "crawler") to
automatically collect sites for their index. They are different
from
subject directories/trees which are hierarchical and rely on
people
to add sites to their index. No search engine can actually cope
with the exponential growth of the web. The best engines
(google,
fast and teoma) cover actually less than 20% of the texts, images,
programs and sounds present on the web. You'll
have to use advanced searching techniques like your
own bot writing, local digginggoing regional, combing,
luring and klebing to (try to)
get at the remaining 90%.
The main search engines are in fact not
following
"uncharted" links any more: the task of keeping the links
they already have gathered UPDATED
is difficult enough (as the many 404 you'll encounter testify).
They now spider only
"on submission".
This said you should by all means still use the main search engines
(which in average overlap only for 50% of their coverage, therefore you may well
find on one engine something that another engine does not show). Be aware of the algos
they use (highly variable from engine to engine, as explained in detail elsewhere on my site):
mostly based on common parameters. For keywords occurrence, for instance:
Order of appearance of keyword terms
frequency of keyword
keyword in title and metatags
funny or rare keywords
Take however account of the fact that
all main search engines are purposely spammed through the
many
tricks used to "get on the top positions", which -once more- depend from the
specific
algorithms used on each search engine. The most common spamming techniques are:
Overuse
or repetition of keywords
Use of keywords that do not relate to
the content of the site
Use of fast meta refresh
Use of coloured text on same-color background (a very old trick)
Duplication of pages with different URLs
Use of different pages that bridge to the same URL
Creation of dynamic pages on the fly as answer to your search (one of the worst spammingtechniques)
This -for many
"broad" queries - means
that people should actually simply JUMP the first twenty-thirty occurrences of
results (SERPs)
and start the results' evaluation directly from page 4 downwards :-)
Note also - moreover - that all main search engines are now beginning
to sell "slots" and to allow some "jolly" answers -based on whatever query you have made-
to figure in the SERPs, thus pushing
some sites on the first positions. One reason more to jump to
page 3 or 4: screw commercial spammers and faked algos.
Finally, never forget that the very reason for having someone SETTING UP a search
engine "for free" (sic) is
to log ALL your queries in order to sell
such data to third parties (see [seamara2.htm]). Therefore
you should by all means
be quite concerned with all sorts of anonymity matters and counter
measures.
Some basic must know - 3 When to use what
(directories and search engines)
You can and will learn to use effectively the various search tools on the web,
but don't forget that the most important part of searching for information
happens before you even get online.
It helps a lot to know which tool to use when,
and that all depends on what kind of question you're
asking or what type of information you're looking for.
I mean the "when" literally: the time of the day can affect
results. It is often useful to avoid intensive searches when euroamericans
are
[awake].
Normal/Advanced inconsistencies & quircks
Remember also that most search engines have a "normal" and an "advanced" search mask,
and that the "advanced" search mask gives DIFFERENT RESULTS vis-à-vis the normal one.
If you try an altavista query for +how to search +hints on
the [normal]
form you'll get 1999 pages (September 00), if you try the same query
on the
[advanced] form
and you'll get only two! Note that search engines quirks will give you eight pages on the advanced form
if you invert the querystring to
[+hints +"how to search"],
a 400% difference for the 'advanced' form that does not correspond on the normal page: [+hints +"how to search"] where
with this chiasmatic inversion you'll get just the same pages as before.
directories and search engines
As almost everyone knows, there are two main (and quite different) search tools on the web: directories and search engines. Directories like
[yahoo],
[Open directory] and
[LookSmart] are CATEGORIZED lists, with brief descriptions of
the sites. categories are based on submissions by web site owners, which are edited by
more or less capable, mostly volunteer editors. Directories are good when you want prompts to
guide you towards your signal, or when you want
to go on a surf trip... when you need a specific information quickly you'll be better off
with a search engine. See: Modern search engines index ALL words on the pages they register,
and are therefore very useful if you know how to squeeze what you need out of the noise.
Let's seek this very page on fast/alltheweb... let's choose -purposely- a very RARE sequence of words, that I have written 10 lines above... you'll immediately see
how this page will jump out of a billion sites :-) "where with this chiasmatic inversion you'll get" The same works with google as well... "where with this chiasmatic inversion you'll get"
The reason you should choose a RARE sequence of words is obvious. The web is the contarry of a library:
the more rare and esoteric your quarry, the more easy it is to find it. Spelling mistakes are
an added bonus: "you will get" will give you millions of pages, "you woll get" will give you only one
page (one at the time of writing this... probably two by the time this page will be indexed anew).
Of course directories and search engines are by no means the only options you have for
searching your quarries.
You should choose your search
tools "categorizing your question":.
Broad Topic
"What's out there on combing?"
"What's out there on searching strategies?"
"Can I get info about the actual web dimensions?"
Directories + Search engines + newsgroups + personal pages
Uncommon Hunt
Specific and unusual or unique
"I need info on the algos used by a specific search engine"
"I'm looking for source code I could re-use for my searching bots"
"I heard about a brand new searching algorithm"
Search Engines + Meta-Search Engines + newsgroups + messageboards + maillists + personal pages + your own bots
"Societal searching"
Anything people "are talking
about", when a quick answer is needed on that field
"New tips for searching warez"
"Has anyone made new searching bots in rebol?"
"I just downloaded a new version of Ferret and it does not seem to work"
Always remember: Before plunging into the interesting (and
often startling) world of web searching,
ask yourself "Could someone have already done this work for me?"
[Combing], i.e. searching - inter alia - people that
have already searched the very information you are seeking, can give
extremely interesting results. Messageboards, private pages and
the whole usenet world are the seas you'll troll around when combing. In
same cases the troll is meant in its REAL sense, read the
[trolling
FAQ] if
you don't know that a
troll is simply a posting on
Usenet or on any messageboard designed to attract predictable responses. And this can in some cases be used to gain some extra-knowledge: see the ad hoc
[trolls page] if you are interested.
Focusing on specific needs
The best
way to look at Internet resources is through the focus of specific
needs. Otherwise, you can spend a lifetime drifting through archipelagos of fascinating,
but ultimately fruitless links.
A good rule of thumb is that if in less
than 15 minutes you don't at least approach what you are searching for,
you better revise your search_strategy.
So "approaching" needs a definition.
A search session can be divided in 5 phases:
Preparation (layout of the search strategy)
Start (or "broad") phase (signal weak)
Refining (main corrections)
Approaching (small corrections: signal very strong)
Closing in (the last efforts, no more noise on the signal)
With the "Dah-daa! Bingo I found my target!" as ultimate aim, of course.
Each of these phases has specific characteristics and needs specific knowledge,
that I try to explain elsewhere on this site.
Yet there are some "global" parameters that are valuable for ALL phases, the
most important one being, possibly, the capacity to remain "on track" during
your session.
Once more, as Sielaff said long ago about footnotes and bibliography:
"Focus always on your specific
needs. Otherwise, you can spend a lifetime drifting through archipelagos of fascinating,
but ultimately fruitless links". I don't need to underline how true this is for Internet
searching as well...
Imagine yourself as a 'blade', cutting through million of useless "pudding sites"
to find the few "rosinen" you are looking for. DO NOT GO ASTRAY!
If you are looking for -say-
some scripts in rebol that you badly need to ameliorate your own web searching bots
you SHOULD NOT stop in order to
read some fascinating info about IRC-bots,
even if you never saw anything as good elsewhere about this subject
even if these techniques could
(eventually) be of some
use,
even if you are genuinely interested in that stuff.
The above reasons do not matter! Do not listen to
Sirens: they are
luring you in order to crush your search vessel
onto the Scylla & Charybdis of all searchers! For crying out loud: wake up!
YOU ARE NOT searching for
IRC-bots, you are searching for rebol-scripts! Write your task in
big red letters on a yellow "post-it" and
stick it in the middle of your screen (next to the counter that is
counting down your 15 minutes maximal allowed search time :-)
This is more important than most newbies seekers believe. The problem is that surfing and browsing the internet
is quite seductive. You continue to find tidbits of interest. In fact, no matter
how accurate your query was, a significant percentage of the search results you will obtain
will not correspond to what you were looking for. Oh, you'll get results
fairly quickly, the real time-wasting problem is the document review phase.
This is the reason you should spend more
time on the query formulation, you'll gain
MUCH MORE time sparing it during the results evaluation phase. We'll together see how you
can automate part of the evaluation task (thus sparing a huge amount of time).
Admittedly it can be
at times quite interesting, from a seeker's point of view, to investigate WHY you did
get some queer results (one of the assignments inside [lab1]
requires you to search for a latin sentence, a specific search that
usually gives some "funny" results), but in general you should imagine
yourself (and your browservessel)
as a sharp "blade" cutting through sites rather than as a lazy butterfly surfer drifting
from link to link.
Let's imagine you are interested in some of the goodies that
the Massachusetts Institute of Technology
may have about searching... well you can
limit your search to pages at MIT only: +host:mit.edu +"search tips" for instance (or whatever other
combination you may fancy)
Try it! [+host:mit.edu +"search tips"]
This page's special Google-tip
eliminating redundance when searching
filetype is a very useful option in order to eliminate some noise when searching...
Try this kind of filtering!
[+fravia -filetype:htm]