|  | essays | 
|---|
One thing (among many others) the unawares learn with your site is that the search engines don't index all the web (far from that...), and that they have serious limitations. Nevertheless, I didn't find much informations on your pages about these ones, other than database capacity limitations. So, I've decided to write this short essay about this (major) problem.
 
  PS : My reference about the actual number of Apache severs can be studied here : 
   http://www.netcraft.com/survey/
  Microsoft's servers lose the competition... as they deserve!
Rumsteack
 
    
 Title :  Search engines limitations, or What search engines don't index, 
  and why ?   
  These few lines should be useful for people who don't know the problems the 
  s.e. encounter when indexing web sites.   It could be easy to think that 
ALL data which are on indexed pages are indeed searchable, 
  because it could be easy to think that
  submitting a site to a s.e. allows necesseraly ALL its content to be 
  indexed. Too easy a thought actually: that's not true at all!   
There are several things which can prevent some parts of a site from being 
  indexed by a search engine spider.  
  First, the 'voluntary' things: you are aware of what you are doing.
  In this category, we have the possible use of:  
    Consequently, you see that some pivate parts (generally, for obvious reasons, the most 
  interesting ones) of a site can
  easily be protected (from the search engines spiders, but also from YOU).     - 
robots.txt
    This text file was set up in 1994, following some complaints from the 
  sysadmins about web spiders indexing 'indelicate' matters.    <field>: <optionalspace> <value> <optionalspace>    The field values can be (one line by bot/directory, 
as previously said):    The use of special commands is allowed, but the unrecognised fields are 
  ignored :     User-agent: * (for all bots)    Of course, multiple combinaisons are allowed, 
and you can even think of 
  sites which are "racists" toward some specific bots, and
  not toward anothers.     For example : User-agent: Bot I like     - robots 
meta-tag
    Note that not every spider obeys to this tag. That's one of the few 
  ways a web-page AUTHOR (server admin not needed anymore) can prevent some pages of 
  his site from being indexed.
    This meta-tag (like all meta-tags, by the way) should be placed in the 
  head section of your page. It is one line, like this:      
<meta name="robots" content="">     The content value can be : "all", "none" or mixes.    
"index" mean that the page will be indexed.
    "noindex" is the contrary.
    "follow" means that the spider will follow the links present on the page.
    "nofollow" is the contrary.     "all" is equivalent to "index, 
follow". It is also the default value, if 
  robots meta-tag is not present.
    "none" is equivalent to "noindex, nofollow".   
    So, when searching for specific URLs (url: or u:), keep in mind that if 
  the spider you use reads this meta-tag, you can't
  reach a page which has been protected with the "noindex" tag.        - special 
commands 
    These commands enable the author to choose what he wants to be indexed on 
  a page. For example, he can choose
  to keep some passages invisible to search engines (for whatever reason!).
    Nevertheless, I only know the command capable to stop the Infoseek's spider.
    Here it is :  
    Authors are consequently able to choose what they want to index or not. 
  All the *special* words (mp3, appz, warez, hack,
  commercial crap, smut sites nuking, destroy commercial bastards, :-D ) can 
  be deleted from the 'vision field' of the search engines, even if
  the page itself is indexed. This is VERY useful to know when searching for - ahem - 
special content.      
  Given that my little knowledge of this lore ends here, let's talk about the 
  other things that our beloved tools don't like.
  That is, let's talk about the content they CAN'T index. In fact, spiders not only can  be 
  blocked by admins/authors through the above procedures, but also have some
  problems when they bump into :   
    That explains you why you must use specific search engines, when 
  searching for specific subjects: the big s.e. don't have
  any access to the specialised databases that exist (which are called 'the Invisible 
  Web'). So, different targets, different tools as someone I don't recall said :-D  
  I hope you have enjoyed this little text, and I hope you will now be aware 
  of the REAL limitations of the s.e. (that is,
  not only about their size), which oblige us to find OTHER tools for 
  searching purposes (combing and the like).
  I'm going to talk about the ones I know, but you should consider what 
  follows ONLY as a base to begin your own future works and researches in this field. 
   - 
Domain 
or password restrictions 
    The most commonly know method is the .htaccess file.
  Within this file, you set up a list of the ressources you do not want 
  everyone to be able to see. In other words, you create a database of logins 
  and passwords, associated with the different parts of your site.
    The use of this file was introduced in the NCSA servers, and continued 
  with the Apache servers. So, given that there are
  7.8+ millions Apache servers in use, you see the huge quantity of data 
  which can be blocked from indexing, just with this (basic) method.
 I wrote 
    "Just with this method" because there are many others: SQL and other 
  databases have similar possibilities as well.
Moreover Javascript offers some 
blocking possibilities (yes, it is possible 
to protect a site 
  with vanilla javascript: see 'gate keeper' scripts). 
In all these cases, you will need some 
  cracking/hacking skills if you want to make these sites more *cooperatives*. 
    With it, a site-owner can indicate which parts of his site he doesn't want the 
  well behaved spiders to index.
    This file is generally placed in the root directory of a site (that's 
  where it is most useful).
    Actually, when a well behaving robot visit a site, he first seeks 
for the /robots.txt 
  file. Provided he finds one, he will obey it and wont index any page beneath the robot.txt.
        Nevertheless, keep in 
  mind that you can write your OWN bots, which you can make as 'irreponsibles'  
  as you like (or you can use utilities like Teleport Pro, for example, 
with its  "obey robots.txt 
  file" case unchecked !!)
     The content of this file (which can 
be viewed with your browser, by the 
  way... See where I want to go? :-D  consist in lines
  which are all like this : 
#field is case insensitive, 
  but the value is case sensitive ! 
User-agent: the name of the spider(s) you (don't) like
    Disallow: the directories (or the files) you (do not) want the previous 
  bot to index 
    Disallow: / (to forbid the whole site) or a white space (to allow the whole site)
    #: for inserting comments in the file. 
     	       Disallow: Bot I dislike
     	
     	       User-agent: *
     	       Disallow: / 
    A site with a robots.txt file of this kind can be a pain for search 
  engines. BUT, since you can see it with your browser,
  you are able to know which are the 
"restricted" parts of a site with no effort whatsoever!. 
    The mixes values are : "{index || noindex}, {follow || nofollow}". 
<!--stopindex--> Write whatever you want here <!--startindex--> 
That's also finished for this part.
    Search engines parse out HTML files, like your browser. But the s.e. may 
  not be complying with the errors that your browser would gladly 
allow to go by. 
 Syntax checker, or better, a good knowledge of HTML coul be helpful to 
  avoid these mistakes, by the way (or to use them on purpose :-D
    There are certain kind of files that the search engines simply don't 
  read. Actually, if a plug-in or an add-on software is needed to read a 
  file, then the search engine will (probably, that's not true for all of 
  them) ignore it.
    Some examples are PDF, PostScript, RTF, Word, Excel, and PowerPoint files. 
  Links search is here of a helping hand. 
    Though theorically search engines offer full-text indexing, practicaly 
  there is an upper limit, beyond which robots stop indexing a file. 
    These methods of enhancing your site have problems with the s.e. Frames 
  and image maps are not indexed by all of them. So, all of the content 
  (including links) entered into image maps and frames sites are not scanned 
  by the all of the search engines. 
  But maybe this is not so bad after all! 
    On some servers most of the web pages are dynamic (produced from 
  databases or other applications, in response of a  request), and they do 
  not exist as static HTML files. Consequently, the search engines CANNOT index them.
    To better understand the problem, we can say that all the URLs for 
  database access that contain a "?" symbol in them, are not indexed by 
  any s.e. I hope you see what huge amount of information you are losing here!! 
Rumsteack, May 2000
 
    
