~ Yahoo!/Inktomi's search syntax ~

No one lights a candle and hides it under a bushel, except Yahoo!

         to  
essays   

First published @ Searchlores in February 2004 | Version 0.03 September 2005 | By Nemo

Version 0.01 is still on http://www.searchlores.org @ inktomi.htm (note the missing l in ".htm")


¤ Introduction ¤ Inktomi unveiled ¤ Inktomi's syntax ¤ References ¤

Introduction

Inktomi (the search engine powering Yahoo!) is one of the best search engines out there. Unfortunately its search syntax is not well documented, which is a pity, because Inktomi offers one of the richest search syntaxes, with lots of unique features and a ranking algo which works often quite well. The purpose of this essay consists precisely in documenting Inktmi's search syntax and providing examples showing its usefulness. For that purpose old HotBot's search FAQs and others Inktomi's web partners' search FAQs were read. The core syntax present in them was expanded using search engines and the WayBack Machine. Finally, from the source code of old HotBot's advanced search pages, additional search syntax was guessed: feature:homepage, originurlextension: and stem:.


Inktomi unveiled

Inktomi is the search engine behind Yahoo!'s search box. By 2004 Inktomi had two databases: WebMap and Web Search 9.

I think that WebMaps' database is now searchable, because Yahoo! claims that it has 19.2 Billion documents in its database (I think that this figure is credible, as we shall see later) and that they still use this database to distill a subset (corresponding to the Web Search 9 database) containing the best documents.

Inktomi relies on title, keywords (meta tag) and text to sort search results (cf. [4] page 23). The use of meta tags is a good idea: -Webmasters know for sure what their pages are all about! Inktomi has grown wise and is no more vulnerable to long titles spam, as the following example shows: title:hotel title:hotels (you are asking for troubles when you search for pages whose title contains several word variants including plural, singular, and tense, because this is the favorite trick used by spammers and SEOs to boost their pages). Inktomi also relies on anchor text to sort search results, because, as Google, it is also vulnerable to Google bombs, as the following queries show:

french military victories,   miserable failure   and   weapons of mass destruction.

As anchor text, the content of keywords and description meta tags are as important as page's title and share the same use from the seekers' point of view, may be a sunny day Inktomi will also offer the inanchor: (as Google and MSN already do:), keywords: and description: operators (as Gigablast and Inktomi, more precisely Ultraseek, already do:). Given that any of these fields are prone to be abused by spammers and SEOs lets hope that Inktomi will also offer the text: operator.

Inktomi was bought by Yahoo! in 2004. Yahoo! have also acquired Altavista and AllTheWeb, unfortunately there is no more any reason to use these later search engines because they also share the Inktomi's search syntax. Lets hope that after they have grown Inktomi's database they finally add the still missing operators supported by the former Altavista and AllTheWeb:

anchor:, applet:, filesize:, image:, limip:, link: or link.all:, text:, altavista's truncation (*, ** and ?) and altavista's proximity operators (~ aka NEAR, ~~ # aka within #, <, <~, > and >~)

to help finding the needles in the immense Inktomi's haystack.


Inktomi's syntax

Default search

Multiple search terms are processed as an AND operation.

Basic inclusion/exclusion of terms

You can use the + and - signs to include, respectively exclude, a term. To exclude terms in an effective way, read my search engines anti-optimization essay.

Boolean search

Inktomi offers full Boolean searching and its syntax is OR and NOT (as in Google, nothing stands for an AND), allows the use of - instead of NOT and searching can be nested using parentheses (). Operators must be in upper case. You are well advised to not use the OR operator for keyword variants, because your query will attract irrelevant search results (Inktomi gives an higher rank to documents containing all ORed keywords), in those cases you should use stemming whenever you can. Example, compare:

Phrase search

Inktomi lets you search for phrases by enclosing them with quotes ("). You can also use underscores (_) to build phrases (partially discovered by fagan), compare:

The standard way for searching phrases inside fields, like title:, inurl:, etc, do not work (example title:"index of"). Nevertheless for every such field you have two ways for searching phrases. Example for title: (the other are similar):

Phrase searches are often used to search for documents generated by some kind of software (and therefore have some fixed strings of text). The "index of", or more precisely title:index_of is a classical example, where you search for open directories, in this case those generated by the apache server.

Phrase searches are also a valuable tool when you arrive to pages showing a glimpse of some document and trying to sell the whole document... More often than not, that very same document is available somewhere else for free! Lets take these bastards, which have stolen the previous version this very same document and are trying to sell the access to their database for $9.99 a month. There you find the following snippet:

Inktomi is one of the best search engines out there. Unfortunately its search syntax is not well documented, which is a pity, because Inktomi offers one of the richest search syntaxes, with lots of unique features and a ranking algo which works often quite well. The purpose of this essay consists precisely in documenting Inktmi's search syntax and providing examples showing its usefulness. For that purpose old HotBot's search FAQs and others Inktomi's web partners' search FAQs were read. The core syntax present in them was expanded using search engines and the WayBack Machine. Finally, from the source code of old HotBot's advanced search pages, additional search syntax was guessed: feature:homepage, originurlextension: and stem:. Inktomi unveiled Inktomi doesn't provide a public search engine in a way that search engines like AltaVista or Google do. This paper is the property of learnessays.com Copyright © 2003-2005

My dear reader when you find something like this all you have to do is take a phrase and put it on a search engine, example: "Inktomi is one of the best search engines out there". Morale: phrases provide very powerfull spells to summon the document you want!

Wildcards

The asterisk * can be used within a phrase search to match any word in that position. Thanks to the * you can do proximity searches on Yahoo! This is a very handy feature to search images for instance, because most people follow the "content - name relation" when naming files. For example if you are searching for a Caravaggio picture, you can do the following search on Yahoo: "caravaggio * jpg". That way you'll get pages linking/containing images named "caravaggio_2.jpg", "caravaggio 07.jpg", etc. Do not expect as many search results as in Google, because Yahoo do not index image's alt attribute (done by Google), nor images src attribute, nor the href attribute of <a...> tags.

Case

Inktomi has no case sensitive searching. Using either lower or upper case results in the same hits.

Truncation

No truncation (*, ?) is currently available, but you can use word stemming (stem:).

Stop words

All words are searched. There are no known stop words.

Ranking

Inktomi was one of the first search engines allowing you to change its ranking algorithm. This is done by giving to each keyword a weight. Weight factors can vary betwen 0.0 and 9.9 and the syntax is weight*keyword, by default each keyword has weight 1.0 as you can see comparing these two queries: 1.0*fravia and fravia. The simplest way of using this feature is by using the 80 - 20 rule, i.e. multiply bad keywords (the highly spamed ones) by 0.2*, multiply context keywords by 0.0* (to not disturb the ranking algo, they must be there, but don't rank) and multiply good keywords (those less likely to be spammmed) by 1.0*. Example:

As a rule of thumb this Pareto rule is not too shaby...

depth:[number]

Denotes how far webpages will be searched in a site's directory structure. The number (0, 1, 2, 3, 4) specifies the maximum number of subdirectories, relatively to host's root directory, which could appear in the URL. As a general rule (not universal! duh:) webpage's content increase with directory's depth and, besides, spammers think that webpages on home directory get a ranking boost and are more likely to being indexed, therefore they put often their doorway pages there. This useful feature offers a handy way of getting ride of those anoiances... excluding root directories' pages!

Example: title:german hear feature:audio -depth:0

domain: (aka site:)

Restricts a search to the selected domain. Domains can be specified up to three levels deep. Once you have found a promising site, this operator provides you a way of building a local search engine and, in that way, of flying directly to the meat. Example of use -constructing a local search engine to searchlores site- : domain:searchlores.org .

feature:acrobat

Searches for pages linking to PDF files, although there are some who do escape. Compare the queries:

As quality documents, like papers, are often written in pdf format, this filter provides a way of getting high quality pages, those linking to that very same files. Example: "link structure" feature:acrobat. As PDF files may have not been indexed for some reason (examples: robots.txt file or robots meta tags), this feature may provide, in an indirect way, some interesting results.

feature:activex

Detects pages containing embedded content, be it sounds, movies, flash, java, pdf files, powerpoint presentations, etc... almost everything can be embedded in a webpage. The detection is made by verifying the presence of an <object... > tag, as you can see comparing the results of following queries with the original pages:

Content embedded with the <embed...> tag is not matched by feature:activex, as the following example shows:

As the canonical way of embedding content for M$IE is by using activex and as almost every luser uses M$IE, page's creators are compelled to also embed content by using the <object...> tag which, nowadays, is also the official HTML 4.01 standard. That said, this feature provides a handy way of getting (or excluding:) pages containing precisely that very same content. Example -searching for pages containing fravia's workshops embedded as movie or sound: fravia stem:workshop feature:activex.

feature:applet

Detects <applet ...> tags in page's source code, compare:

the tags <object ...> (for Internet Explorer) and <embed ...> (for Netscape) can also be used embed applets, but Inktomi doesn't detect applets embedded this way. Compare:

Documents containing links to .class or .java aren't also taken into account, compare:

Example of use -searching for pages where you can play chess interactively: feature:applet title:play title:chess.

feature:audio

Detects if a page contains a link to an audio file. Audio files could be among others: wav, mp3, m3u, mid, midi, au, snd, ... The link could be in a:

feature:audio doesn't match embedded audio files:

If you want to search for embedded audio files you must have to resort to use the rather coarse feature:activex. Example of use -searching for audio files of fravia's workshops- : fravia feature:audio

feature:flash

Contrary to what we could expect, Inktomi do not detect neither the existence of the <embed ...> tags, nor the existence of the <object ...> tags. For Inktomi feature:flash simply means webpages linking to files with extensions: fla, spl or swf, compare:

If you want to search for embedded flash you must have to resort to use the rather coarse feature:activex.

feature:form

The Inktomi's crown jewel. Detects the <form> tag in page's source code. Inktomi may not index the hidden web, but offers you a way of knowing where the front doors are! For instance you can use Inktomi to find Laws' Databases, translation services: dutch english translate url feature:form, etc.

feature:frame

Detects pages containing frames.

feature:homepage

Restrict your search to personal pages (identifier ~). Very useful, because it's still the convention for personal pages on educational sites. Example: web search feature:acrobat feature:homepage.

feature:image

Detects <img...> tag in HTML or a link to an image.

Interested in finding images of birds of paradise? Try the following query on Yahoo!:

("bird of paradise" OR "birds of paradise") (papua OR "new guinea") feature:image -stem:travel -stem:hotel

Images are widely used for aesthetic reasons. If an HTML webpage doesn't contain images you may wonder if there's an hidden agenda... probably it's a cloaked/spammed page by a a spammer putting only keywords n' links and not taking the hassle of building a real webpage. You can often trash those annoyances using this useful feature!

feature:index

Restricts your search results to the host's top page. Very useful to find sites about a given theme! The host's homepage is the most valuable site's real estate, there the site's owner should put a resume of what his site is all about and provide links to his most important pages. Example searching for FTP search engines: ftp search feature:index feature:form. Inktomi indexes approximately 1,520,000,000 webhosts cf.: feature:index. 1.5 Billion webhosts is quite an odd figure, because Inktomi has 19.2 Billion documents in its database, so, on average, Inktomi indexes 13 documents per webserver. Given that some domains spam a lot, for most servers Inktomi indexes only the entry page... maybe there's nothing more to index... Nevertheless is quite odd. Altough, as ritz points out, this probably is an anecdotal evidence that the number of sites containing n pages folow a Pareto distribution.

feature:javascript

Detects pages containing the <script ...> tag with the attribute language="javascript", compare:

Inktomi doesn't recognize javascript embedded in other tags' attributes, compare:

webpages linking to javascript files (extension .js) are not considered as containing javascripts, compare:

Javascript, with the help of forms, is a cheap, yet powerfull, way of providing interactive pages. Sometimes is the right tool to cut all bragging pages that do not offer the interactive content they promise. Example:

title:german exercises feature:javascript feature:form

Spitze!

feature:meta

Detects <meta ...> tags in webpage's source code.

feature:shockwave

Detects pages containing links to files with extension dcr, dir, fla, spl or swf, compare:

If you want to search for embedded shockwave you must have to resort to use the rather coarse feature:activex.

feature:script

Detects <script ...> tags in HTML, in particular detects other script languages than javascript (for instance VB script), compare:

feature:table

Search for pages containing the <table ...> tag. Tables are widely used to control page's layout and of course to build tables! If an HTML webpage doesn't contain tables you might wonder if there's an hidden agenda... probably its a cloaked/spammed page by a SEO fearing that some search engines may not full y support tables, or a spammer putting only keywords 'n' links and not taking the hassle of building a real webpage. You can often trash those annoyances using this useful feature!

feature:title

Detects pages containing the <title> tag. As allmost all webpages contain a title, this feature gives a good estimation of how many HTML documents are in Inktomi's database. Cf.: feature:title.

feature:video

Search for pages linking to video files (file extensions: avi, mpg, mpeg, mov, etc.). Videos embedded with <img...> tags with the legacy attribute dynsrc are not matched, compare:

neither pages with <embed...> or <object...> tags, compare:

If you want to search for embedded video files you must have to resort to use the rather coarse feature:activex. Interested in finding videos of fravia's workshops? Try the folowing query on Yahoo!: fravia feature:video.

feature:vrml

Search for pages containing a link to a vrml file (wrl, wrz, vrml). Compare:

Inktomi is unable to see embedded vrml files. Compare:

Example: web links graph feature:vrml

hostname:

Allows one to find all documents from a particular host only. It has similar uses to those already found on domain:.

inurl:

Searches for words in URL, you can also search for phrases, but the syntax isn't the one we would expect: inurl:"keyword1 keyword2", instead it is "inurl:keyword1 inurl:keyword2". Once you have found a promising directory's site, this operator provides you a way of building a local search engine and, in that way, of flying directly to the meat, or of getting a 'directory listing'. Example of use -constructing a local search engine to the Seeker's message board- : domain:2113.ch inurl:mb001.

link:

Finds pages containing hypertext links to the exact specified URL. Comes in handy when you land, using a search engine, on a webpage you like and you want more similar pages from that web site. In those cases you can try to find the 'table of contents' of that specific site. Example, using the URL of this essay to find where are the searchlores' 'table of contents': link:http://www.searchlores.org/inktomi.html domain:searchlores.org, among other 'table of contents' you find the folowing ones (once they get reindexed:) http://www.searchlores.org/news.htm, http://www.searchlores.org/essays.htm and http://www.searchlores.org/main.htm.

Intelligent seekers search the web backwards! Once they have found a good site, they identify the most interesting pages on that site, 'table of contents' pages are good candidates, and see who's linking to them. This strategy provides a way, once we know a good site, of finding more good sites or pages linking to good sites. The rationale is good sites only link to good sites! Searching backwards, once you have found a good site, is the main way of combing the web for good sites which others have already found. Example: link:http://www.searchlores.org/news.htm, link:http://www.searchlores.org/essays.htm and link:http://www.searchlores.org/main.htm.

Inktomi doesn't have an operator which lets you search for keywords on links, so you don't have a direct way of searching for links to a given directory. The only thing you can do in such cases is to use wget to get a directory listing from that directory and feed all those URLs to Yahoo using wget and the link: operator.

linkdomain:

Searches for pages linking to any page in a given domain up to three levels deep. This operator provides a more versatile way of searching the web backwards (cf. the link: operator). Examples of use:

linkextension:

Searches for pages linking to files with a given extension. This operator provides a way of searching for files which are not downloaded and processed by Inktomi's spiders, such as images, audio, videos and other binary files. One possible use for this operator is searching for blogs. Most blogs, at least most blogs that are nowadays worth reading, have a RSS feed somewhere on them. This operator provides a great way of finding pages containing RSS feeds, which are usually just a RSS, XML, RDF, or ATOM document type. Interested in finding blogs on web search techniques? Try: searchlores linkextension:rss, searchlores linkextension:xml, searchlores linkextension:rdf and searchlores linkextension:atom.

originurlextension:

Restricts documents search by type, aka file extension. Document's type is a good proxy of document's quality. Some examples of high quality documents are .pdf, .doc (word), .xls (spread sheets), .ppt (power point presentations), .ps, .dvi and .rtf files. Example of use: web search originurlextension:pdf.

outgoingurltype:[url_type]

Searches for pages linking to document with a given mime type. Mime type is inferred by document's extension as the folowing example shows: outgoingurltype:image/jpeg -linkextension:jpg -linkextension:jpeg -linkextension:jpe -linkextension:jfif. This operator does more or less the same as linkextension:, altough is a little bit more general, because it clusters file extensions by type .

path: (aka originurlpath:)

Searches for words in URL's path, you can also search for phrases, but the syntax isn't the one we would expect: path:"keyword1 keyword2", instead it is "path:keyword1 path:keyword2". Once you have found a promising directory's site, this operator provides you a way of building a local search engine and, in that way, of flying directly to the meat, or of getting a 'directory listing'. Example of use -constructing a local search engine to the Seeker's message board- : domain:2113.ch "path:phplab path:mbs.php3" inurl:mb001.

region:name

Restricts your search to a geographical region (africa, centralamerica, downunder aka Oceania, europe, mediterranean, mideast aka Middle East, northamerica, southamerica, southeastasia). You can find which countries are included in each region here. I think that Inktomi assigns, for each domain name, either the information got by a whois search when top domains are .com, .org, .net, .biz, .edu, etc, as we can infer from the following two queries on Yahoo:

or assigns the country corresponding to the two letters top domain names (examples: .au, .ca, .de, .es, .fr, .uk, .us, etc). The main use for this operator is restricting your search to a given geographical region, example: stem:laws noise stem:levels region:europe. This field can also be used to get an estimation of how many documents are in Inktomi's database:

region:africa
region:asia
region:centralamerica
region:downunder
region:europe
region:mediterranean
region:mideast
region:northamerica
region:southamerica
region:southeastasia
Total:
                 32,100,000 documents
549,000,000 documents
13,300,000 documents
242,000,000 documents
4,200,000,000 documents
25,800,000 documents
51,600,000 documents
13,000,000,000 documents
274,000,000 documents
45,000,000 documents
18.723.000.000 documents

Yahoo claims that it has 19.2 Billion (the American one:) pages in its index. Although we have got an estimation of only 18.7 Billion documents, I believe that Yahoo's number might be correct, for instance when the host name is an IP number, Inktomi do not perform a reverse DNS look up. Take for instance one of Google's IPs: http://216.239.37.99/, this page (url:http://216.239.37.99/) is in Inktomi's database, the server is located at The United States, but for Inktomi this sever is not located at the US as shown by the following query: url:http://216.239.37.99/ -region:northamerica.

As HTML tags are almost exclusively used in HTML documents, the exclusion of feature:title, feature:meta, feature:image, feature:table, linkdomain:com and outgoingurltype:text/html, from region:[africa, centralamerica, downunder, europe, mediterranean, mideast, northamerica, southamerica, southeastasia] provides a good estimation of how many non HTML documents are in Inktomi's database:

region:africa -...
region:asia -...
region:centralamerica -...
region:downunder -...
region:europe -...
region:mediterranean -...
region:mideast -...
region:northamerica -...
region:southamerica -...
region:southeastasia -...
Total:
                 273,000 documents
4,910,000 documents
289,000 documents
2,200,000 documents
40,300,000 documents
228,000 documents
567,000 documents
170,000,000 documents
1,750,000 documents
451,000 documents
221.968.000 documents

From this we conclude that Inktomi has, at least, 18.723.000.000 - 222.000.000 HTML documents, i.e. 18.5 Billion HTML documents. From previous SERPs you can also find some URLs which Inktomi knows their existence, but which have not indexed yet. As Inktomi has 384,000,000 pdf documents (cf. originurlextension:pdf), 60,700,000 word documents (cf. originurlextension:doc), 10,800,000 Excel spreadsheets (cf. originurlextension:xls), 11,500,000 powerpoint presentations (cf. originurlextension:ppt), 6,110,000 rich text files (cf. originurlextension:rtf) and 61,600,000 plain text files (cf. originurlextension:txt), Inktomi can't have more than 165 Million of unindexed URLs.

stem:

Searches for documents containing grammatical word variants including plural, singular, and tense. Example: web search -stem:advertise -stem:business -stem:christ -stem:game -stem:genealogy -stem:host -stem:hotel -stem:job -stem:offer -stem:position -stem:product -stem:service -stem:shop -stem:travel.

title: (aka intitle:)

Searches for words in the title, you can also search for phrases, but the syntax isn't the one we would expect: title:"keyword1 keyword2", instead it is "title:keyword1 title:keyword2". Inktomi has a rather odd behavior for title searches, because, sometimes, page's title in SERPs do not have the keywords you are asking for, example title:search. Apparently the reason for this odd behavior is the substitution of the actual page's title by the title as is in Yahoo! directory, as SEOs claim. Classical example of use: "title:index title:of" -originurlextension:htm -originurlextension:html

url:

Webmasters use this operator to see if a page is in Inktomi's database. Example: url:http://www.searchlores.org/welcome.htm (do not forget the http:// part! duh:). From a seeker's point of view its main use is to hold a page and reconstruct its content without seeing the original document nor document's cache. We can do so, because Yahoo shows the context of your search. The process of reconstructing document's content is done step by step as shown in the following example:

This feature comes in handy when Yahoo's document's cache is hidden from you... either by spammers trying to hide theirs mischieves or by webmasters showing you the cake, but redirecting you to the subscribe/pay page, snatching away the cake before you can eat it!



References

  1. The Inktomi Difference
  2. The Inktomi Difference (old version)
  3. Inktomi Unveils Web Search 9
  4. Frequently Asked Questions (HotWired old HotBot's FAQ)
  5. HotBot Review on Search Engine Showdown
  6. MSN Search Review on Search Engine Showdown
  7. BBC - Search - Advanced Search Tips & Tricks
  8. HotBot FAQ: HotBot Tools
  9. HotBot Help | Search Tips
  10. Search Engine Marketing Tutorial - Lycos InSite
  11. ENKEL SÖKNING
  12. Yahoo! Help
(c) Nemo 2004 2005    nem0@nowhere.org    replace nowhere by linuxmail.

Petit image

(c) III Millennium: [fravia+], all rights reserved, reversed, reviled and revealed