machine_translation.htm: linguistic web searching: searching in foreign places

portal.htm → dictiona.htm → machine_translation.htm		Updated March/2009 version 0.12a
Machine Translation by fravia+ 1st published in March 2009 How to search the Web in languages you don't know		search lore

Introduction	(Language barriers & language magic boots)
Tools of the trade	(A small panoramic view) \| Google Translate \| Googlex \| EurLex \| Eurodicautom/Iate \| \| Babylon \| Systran \| MS-own \| Yahoo's own \| Lec Translator \| rikai \|
Silly tricks	(when you'r on a hurry)

Introduction
Language barriers & language magic boots

Přes šteští vidění tak slepý Přes dar slyšení tak hluchý		The luck of being able to see and, notwithstanding, so blind. The joy of being able to hear and, notwithstanding, so deaf
			Ancient Seekers' lore

Granted! Personal (good) knowledge of foreign languages is always a mighty addition -and weapon- for all web-seekers: You happen to know -say- Czech and whoop-là! A whole world of browsing and untapped information opens up. Yet even the best linguists-seekers ("lusi naturae", all of them :-) cannot cope with the fathomless linguistic depths of the infinite web: in order to fetch real information on a given subject you very often have to delve inside languages you don't understand a single word of.

So what d'you do? You need to find (or pilfer) some linguistic magic boots! So you (learn to) use & abuse all kind of online public machine translation services, which are nowadays getting better and better almost by the minute, especially when they are in fact - strictly speaking- NO "machine" translations at all!
In fact any "automated translation" works better when it builds only (or predominantly) onto huge corpora of texts that have been originally translated by humans, as proved by the recent google experiments. In fact "statistical, data-driven translation" à la Babelfish (a name used originally by altavista -and now by yahoo- but apparently also behind google's future "translation gadgets") does wonders, even in google's own watered-down, on line public version of google translate. The results clearly beat any attempt to actually "translate" phrases, ex-novo, adapting algos for each specific couple of languages (as per syntax-based translation models à la eurodicautom). In fact all these syntax-based models need building huge -and often not very useful- sets of specific rules for each language couple, while modern system are data-driven. Such systems are self-obtimizing: they can grep from already translated materials the translations of terminology, and even some stylistic phrasing.

"Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And -look and behold- all of a sudden, the worst algorithm is performing better than the best algorithm on less training data"
(Peter Norvig)

This basically means that a good and almost error-free "machine translation" can actually be obtained ONLY for languages that already have millions (or even better: billions) of documents that have been translated by HUMAN translators. Note that the professional capacity of the human translators themselves has no importance whatsoever once such critical mass has been reached... once you have a billion documents, the law of statistics will let the correct version emerge among all the wrong ones.
Granted: maybe the result won't be the most "poetic" translation, but it will for sure be a very correct one, while all erroneous (and among these a few very poetic and nice) translations will sink automatically below the horizon. Yep! The triumph of bureaucracy (and of mass-conformity) versus inventiveness... but also versus poor quality.

UN LANGUAGES Alas, very few languages have such big corpora of human translations. Yet some indeed do: Arabic, Chinese, English, French, Russian and Spanish have their own UN-corpora (and translation memories): at the UN all documents, useful or not, are HUMAN-produced in the six official languages and are issued simultaneously when all the language versions are available.
Take this example: http://www.google.com/search?num=100&hl=en&lr=&ie=ISO-8859-1&q=%22well+being+of+present+and+future+generations%22+site%3Aun.org&btnG=Search and you'll notice that google did indeed index quite a lot of (all?) UN docs, and that google delivers them MORE quickly that the UN own servers :-).

The UN do not make it easy to switch to the other language versions, but any average web-seeker can quickly grasp the structure of their databases: On the web, as we all know, nomen est omen.

EU LANGUAGES Add to the UN also the European Union's huge translation services: Know that Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish are at the moment the official languages of the European Union, whose many institutions tend to produce HUGE amounts of human-translated "paperasses".

These languages have been adopted in different times, following the development of the European Union (and because of macho nationalistic postures and of the laws of political compromising, even completely useless "languages", like Maltese or Irish, had to be lavishly translated as well). Therefore not all languages can count (yet) on enough wide corpora.
The EU probably offers, for statistical translation purposes, enough translation memories and data only for the "oldest" EU-languages: Dutch, English, French, German and Italian. Maybe Greek, Spanish and Portuguese too, nowadays.
We underline the need for millions of human translated pages for real "statistical" translation effectiveness!
This means that English and French (maybe Spanish too) are the ONLY languages that have a double (complete and huge) UN + EU corpora of human translated texts. I guess that if google should really use all the documents of the UN and European institutions it has indexed, its real (and uncrippled) "babelfish" translations should already now be almost perfect for English and French (and maybe Spanish).
Note that for French and English (to and from) there is also a big set of Canadian human translations.

This said and understood, the recent developments in -for instance- Arabic-English and Chinese-English translation (both, since 2005, heavily subsidized and fostered for obvious american political reasons), with the results that google is now capable to deliver for free, are truly staggering. Even if I cannot really judge, since I don't speak arabic nor chinese, google's "babelfish-alike" public translation service conveys enough meaning to allow me -mine de rien- to easily browse and peruse Arabic and Chinese sites. And it feels (almost) like "real" translation! Yessir! No poor simple gisting any more.
How and on whose budgets they did manage such amazing results I can only guess :-)

Have a try by yourself, using translate.google (crippled) public version. Here an Arabic-English example: This Arabic snippet was taken on the fly from the al jaazera site, just click on it :-)

As you can see Arabic... قالت مصادر فلسطينية إنه تم قطع شوط مهم في معالجة القضايا الخلافية المطروحة في حوار الفصائل الفلسطينية المجتمعة بالقاهرة، وإن عمل لجنة المصالحة يسير بشكل جيد وإيجابي. يأتي ذلك بينما تواصل وفود الفصائل الفلسطينية اليوم الأربعاء جلسات الحوار التي افتتحت أمس في القاهرة برعاية مصرية ومشاركة جامعة الدول العربية، ويتوقع أن تستمر على مدى عشرة أيام.
...gives, in English: "Palestinian sources said that it had been important strides in addressing the controversial issues raised in the dialogue, the Palestinian factions meeting in Cairo, although the work of the reconciliation committee is progressing well and positively. This comes as it continues and the Palestinian factions on Wednesday, the dialogue sessions, which opened yesterday in Cairo under Egyptian auspices and participation of the League of Arab States, and is expected to continue for ten days."

Some minor glitches, but nothing terrible. Quite impressive. Of course in this specific case it was unnecessary: Al Jazeera has an english human translated edition. But you get the point.

"Most state-of-the-art commercial machine translation systems in use today have been developed using a rules-based approach and require a lot of work by linguists to define vocabularies and grammars. Several research systems, including ours, take a different approach: we feed the computer with billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model"

(Franz Och)

Translate.Google is a gorgeous free tool, but it is still worth noting that some VERY POWERFUL "specific" free translation tools are NOT well known and publicized. Just to make an example, for russian/english and english/russian one of the best on line machine translation tool is provided by a russian server (alas quite slow): http://www.rambler.ru/dict/.

In the following "tools of the trade" section we will examine more in depth some among the various "machine translation" services seekers can use. As usual we'll also understand how to "steal some magic boots" and -more generally- how seekers can take advantage of the foolish commercialisation attempts of a Web whose very STRUCTURE was made for sharing... not for hoarding and not for selling.

Tools of the trade
A small panoramic view

Google translate

Da real beefz

Fundamental tool for "unknown language searching". Translates from the following languages: Albanian, Arabic, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Vietnamese.

Since Autumn 2007, Google seems to have dispensed entirely the rules-based system (which was provided by Systrans) and offers translations using its own statistical method for all the 41 now existing language pairs. In fact google’s translation matrix has expanded to 1,640 Language couples: With the recent additions of turkish, thai, hungarian, estonian, albanian, maltese (sic!) and galician, hence the total number of languages of Googletranslate has reached 41 (counting the two "chineses" as one), totalizing in march 2009 the astonishing level of 1,640 language couples (41*40).

A "translated search" option will allow you to get automatically translated pages relevant to your search: http://translate.google.com/translate_s?hl=en
Obviously your choice of the target language/geographic area/community should depend from your searchstring and targets, as explained in the lore of local searching. For instance: http://translate.google.com/translate_s?hl=en&clss=&q=rapidshare+uploading+depositfile&tq=&sl=en&tl=uk

Some minor glitches are still present on this google's Babelfish (crippled) public version. Try for instance: moi je cherche un tas de choses bizarres sur la toile and see how the tense gets lost in "I sought a lot of strange things on the web".
I personally find this tool particularly useful for really awkward languages (for me) like vietnamese, japanese, thai, korean or chinese.

Also check the limits of automatically translated web pages choosing english to french (or using "swap") and then inputting, for instance
http://www.searchlores.org/longtermsearching.htm
A "Poor man" proxy Note that this kind of "URL translation" services can also be used as a quick "poor man proxy" in order to bypass many moronic -school, parents, religious or state- censorships. Just ask for a -say- "Albanian-English" translation and you'll have google (not you) fetching and delivering your target in perfect english (this will only work if the censorship software doesn't block the keywords inside the URLs, in which case you'll have to use other methods, for instance socks and/or ssh proxy, to bypass).

Finally for your own intranets playing pleasure you'll find an ad-hoc google "Ajax" translation mask on its ad-hoc page. It's all simple javascript, so you can port the whole mask wherever you want.

Googlex

A gazillion of human made translations

Use google (instead of the slow Eur-Lex search masks) to quickly find a gazillion EU-translation
I am proud to have devised this simple "googlex" (google+eur-lex) mask: despite its obvious simplicity, it turned out to be an incredibly versatile and powerful translation web-tool, useful for anyone dealing with web-searches (and, more generally, translations).
In fact, once found your target, if you choose to fetch the cached copy of google instead of the original eur-lex server document you'll cut two mustards with one stone :-) You'll have an automatic highlight of your original searchstring AND you'll probably get the document itself much more quickly.

More detailed instructions are available on its ad-hoc page: googlex.htm

Googlex		Version .09/03

Eur-Lex

Have bilingual display, will understand
Following a European Parliament resolution of 19 December 2002 the access to the old "legal" database of the European institutions "Celex"(Communitatis Europeae LEX) is free of charge from 1. July 2004.
The result of the merging with the EUR-Lex portal is the new system, also named EurLex.
Visit the EUR-Lex search mask but check also the googlex tool discussion about searching the same documents through google's servers.

Official Journals can for instance quickly be gathered: http://eur-lex.europa.eu/JOIndex.do?year=2009&serieL&textfield2=11&Submit=Search&_submit=Search&ihmlang=en
See? OJ L 2009/11: only those 4 parameters do change.
If you want a specific section, you must know the page number in the OJ as well: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2009:011:0083:0083:EN:PDF.
Self explanatory URL, duh. Note that changing the page number will still give you the complete pdf file of the specific OJ subsection.
If you want a bilingual display, nothing beats the Eurlex bilingual facility, though. For instance OJ L 2009/11 (in English and starting at page 6):
http://eur-lex.europa.eu/Notice.do?mode=dbl&lng1=en,es&lang=&lng2=bg,cs,da,de,el,en,es,et,fi,fr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv,&val=486721:cs&page=1&hwords=Commission+Directive+2009%2F2%2FEC+of+15+January+2009+amending%7Efor+the+purpose+of+its+adaptation+to+technical+progress%7Efor+the+31st+time%7ECouncil+Directive+67%2F548%2FEEC+on+the+approximation+of+the+laws%7Eregulations+and+administrative+provisions+relating+to+the+classification%7Epackaging+and+labelling+of+dangerous+substances%7E
Note the mode=dbl&lng1=en,es, here for english into spanish.

Eurodicautom-Iate

Old but sturdy one-term translation tool
The old Eurodicautom EU-translators' database is now freely available in its "iate" (Inter-Agency Terminology Exchange) implementation.
Eurodicautom covers at the moment both the "old" and "new" EU-languages (see above for a complete list). Not really a comprehensive dictionary: focused on technical language. The server it is hosted on is probably (as per early 2009) a pentium III held together by gummi bands and small pieces of adhesive tape: it might at times work with sufficient speed... if you are lucky.

The four original languages of Eurodicautom were Dutch, French, German and Italian, to which Danish and English were added in 1973, Greek in 1981, Portuguese and Spanish in 1986, Finnish and Swedish in 1995 and so on. Latin is also (sporadically) present. It contains million of terms and hundred thousands of abbreviations.
All actual official EU-languages are feeded, because this free tool aims at meeting the (patently nonsensical yet politically motivated) European Union wish of conferring the same recognition to each language... as if languages were really equal :-)

Let's try as example to fetch all the available translations for the term "search":
http://iate.europa.eu/iatediff/SearchByQuery.do?method=search&saveStats=true&screenSize=1680x1050&query=search&valid=Search+&sourceLanguage=en&targetLanguages=s&domain=0&typeOfSearch=s&request=
Notice: 1) &sourceLanguage=en for english; 2) &targetLanguages=s for "any" target language (but you can specify only some EU-languages), and, 3) the query itself which is here the term "search" in &query=search, a term that you can of course change on the fly inside Opera's address bar with a different query.
Why the developers have so insisted in masking the real http:// request that their script sends to the server, beats me :-)
Also note you could use optional useful "domain" criteria... for instance limiting the query to "3236 Information technology and data processing" if needs be.

Criticisms You must however be careful with multiterm searches: Eurodicautom is mostly a stupid and blind "single-term" search technical database, and multiple terms (try substituting "search" with "web+preferences") are seldom useful. Phrase searching, which is a sine qua non for translators, is usually MUCH better served through yahoo or google (again the laws of "statistical" versus "syntax based" translation approaches). Use Eur-Lex, or (probably better) our own "googlex" form, calling up a document in -say- Spanish and then switching to English -or whatever- and you'll thattaway probably cut your own translating mustard much better.

As someone (rightly) wrote many years ago: "The original logic was that Eurodicautom would ensure consistent translation and usage across European institutions, but the database is really too large to meet that goal. A great many terms go into it, but there isn't much effort to check how much those terms are actually being used. As neat as huge term databases are, quality control is the real Achilles heel of this concept. Where possible, it's better to have a small database of known good terms than a big database of terms of unknown quality - although it's better still to have both. I tend to think of this sort of database as a last-ditch resource - cheaper than original term research and better than making something up, but definitely a source of questionable quality"

Real life translators dealing with bureaucratic papers could be interested also in the eurovoc thesaurus (that you can download for free).

Babylon

A cracked proprietary dream
Babylon dictionaries (only covering "important" languages).
These can be downloaded for free, and theoretically you would need the babylon program to use them.
Yet theory and practice often differ, especially in our web-netherworld, and thank to Bilbo and many other among the world's finest reversers at woodman's messageboards it is now relatively easy to port every Babylon book to GNU/Linux in its COMPLETE FORM (if necessary obtorto collo :-) using some ad hoc small scripts.

Of course you could go for the complete book instead, but in that case check the laws of your country (or the country of your proxy) about the possibility that some of the findings of your queries could be in fact not in the public domain, despite their massive and widespread presence on the web (some patent-obsessed clowns really seem to believe they can put toothpaste back inside the tubes). So never do anything illegal on the web: there's mostly no need whatsoever anyway. The real lesson here is for the patent-obsessed: "If you try to fence off your little corner of the Internet, you’re better off herding cats".
Note in fact that some countries, like S.Marino or Somalia, and their proxies, didn't adhere to the various bogus conventions of the patent holders' mafia.

Systran

Another proprietary-software dream
Systran is a very old translation system, that works well, sometimes, for some language couples. It is a commercial proprietary solution, though, and of course there's no reason whatsoever to pay for results you can have for free -and often better- elsewhere.
They should propose it for free (instead of just letting lusers leech it all over the web), with source code and everything, let it be ported to GNU/Linux (it is obscenely geared towards windows and various Microsoft-crapola products like word) and so get it improved and finetuned by better programmers than what they seem to have in-house. This would give them a (weak) chance against the google buldozer.

Language pairs: English - Arabic, French - Dutch, English - Chinese, French - German, English - Dutch, French - Greek, English - French, French - Italian, English - German, French - Portuguese, English - Greek, French - Spanish, English - Italian, Italian - German, English - Japanese, Italian - Portuguese, English - Korean, Portuguese - German, English - Polish, Spanish - German, English - Portuguese, Spanish - Italian, English - Russian, Spanish - Portuguese, English - Spanish, English - Swedish.

It's not that bad, after all. Try our text=moi+je+cherche+un+tas+de+choses+bizarres+sur+la+toile and you'll get an almost correct "me I seek a lot of odd things on the web").
This said, its free accessible mask for text direct translation OR webpage translation could come handy, at least as a simple "poor man" proxy (see above).

MS live translator aka microsofttranslator

Nice try, pity it's stil behind
Microsoft must have thought they HAD to compete in this sector too with google. A pity that they seem still to be far behind the horizon.
There's now a new domain (delivered by akamai's sniffing and censorship-prone clowns): http://www.microsofttranslator.com/, rumored to be slightly quicker.

Languages available are English to/from: Arabic, Chinese Simplified, Chinese Traditional, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian (RUS ==> EN only) and Spanish. Also Chinese Simplified <==> Chinese Traditional.
If you click on a "Translate this page" link in the Live Search results or enter a web address in the web page translation box, translations will be presented to you in the "Bilingual Viewer," providing easy access to the original web page and its translation. There are also some interesting view-options, like hovering translation or hovering original.
Alas! Not all pages will be translated: following a typically "microsoft-moronic" approach, this interesting feature is -purposely- optimised for the dangerous (in fact: masochistical) MS-Explorer browser and works badly with all real browsers (Worse! It does not work at all with the best browser of the planet: opera).

Try our "moi+je+cherche+un+tas+de+choses+bizarres+sur+la+toile" and you'll get a slightly "tarzan-like" text with the dubious choice of "look" over "search": "me look a lot of strange things on the Web".
This said, its free accessible mask for text direct translation OR webpage translation could come handy, at least as a simple "poor man" proxy (see above).

Yahoo babelfish

Average, again: yahoo style
The old Altavista's machine translation service, repackaged by Yahoo, powered by Systran.
Again, the "Babelfish" name was chosen.

A futile attempt, by yahoo, to keep abreast against the google bulldozer.

Try our "moi+je+cherche+un+tas+de+choses+bizarres+sur+la+toile" and you'll get an horrific "me I seek an odd heap of things on the fabric".
This said, its freely accessible mask for text webpage translation is not that bad and could come handy, at least as a simple "poor man" proxy (see above).

Lec translator

Below average commercial crap
Lec's (Language Engineering Company) translator works -somehow- for japanese/english and english/japanese.
Also it has Pashto (quite useful if you are roaming around between Kabul, Kandahar and Bahawalpur :-)

For some other language combinations -frankly- it is not even funny :-(
Luckily they provide a demo that will show all its limits.

Our "moi je cherche un tas de choses bizarres sur la toile" example will result in the rather outlandish "me I look for a heap of bizarre things on canvas".

Again this just proves how on the web free products (mostly) beat commercial products black and blue. This of course also happens elsewhere, in the software world (see GNU/Linux versus Windows) and in the real one (see realicra/aquafina_and_dasani.htm).

Rikai

Japanese-english hovering translation
Among "hovering translators", Rikai is (for japanese to english) surely one of the best free scripts available. You'll have just to use Opera's excellent "right click" ==> "block content" feature in order to eliminate its obnoxious advertisements.
Try it out at http://www.rikai.com/perl/Home.pl (pass the cursor on some japanese symbols and enjoy :-)

Of course the whole point is to use it on sites you happen to find during your searches. In fact the rikai perl scripts work quite well when applied to a japanese site you want to understand: here for instance comics.shogakukan.co.jp.

Note how clicking on any link will bring your "hovering rikai" feature along with you.

Silly tricks
(When you are on a hurry)

The "images" kniff

Well, not so silly, after all. Let's imagine you found the arabic term حصان, and let's imagine you don't know arabic and you want to quickly find out what it means. You could either use one of the on-line translators described above (for instance google's one: حصان) or you could just input your foreign unknown term inside any "images" search engine! This will often give you an even more interesting, while context richer, explanation of the unknown term (here for instance using google images).
Just to put things on par, let's choose an example from another (but equally religious-obsessed :-) language: "חתול" means of course חתול... what else?
Note btw, that this images-related trick works in beautiful black and white ONLY if you use yahoo instead (click on the "show only b&w" option)... the equivalent option: &imgc=mono, does NOT work very well in google.

fravia+, march 2009

B k:f l a n g e o f m y t h