Machine Translation

Here goes everything we consider important for doing MT with The Beast.

Table of Contents

Atomic Types

With The Beast you need a fixed set of possible values for the attributes of objects/rows. These are internally mapped to integers for efficient processing and storage. Let's call these sets Atomic Types for now.

For Phrase-Based MT it would make sense to have the following types (written in TheBeast format):

CREATE TYPE SourceWord (Ich, mag, das, boot);
CREATE TYPE TargetWord (I, like, the, boat);
CREATE TYPE ID (0,1,2,3,4,5,6...);
CREATE TYPE Position (0,1,2,3,4);
CREATE TYPE TranslationPhrase ("I like","the boat", "I like the", "the boat", "the",...);

Note: The integer type variables will be defined like CREATE TYPE ID (1..10) soon.


I would propose the following schema for our variables. We would need a table to store a source sentence. A table of tokens with indices might be a good idea:

   position Position,
   word SourceWord);    

Then we need table for possible translation phrases (plus which phrase in the source they belong to)

   id ID,
   words TranslationPhrase,
   Position begin,
   Position end);

We also need a table that represents our results: pairs of (target) phrases that reflect what we translate and in what order. Each entry in this table says that in the translation the first phrase follows the second phrase of the pair.

   first ID,
   second ID);

Finally, we need a table that gives us the scores from pharaoh for each phrase pair. For this we need a weighted table (a table that has an extra real-valued column).

   first ID,
   second ID);


For now we get the local scores from the PharaohScore table:

FOR Follows SELECT s.first, s.second FROM PharaohScore s WEIGHT s;

This means that for the follows table we select all pairs from the score table and give them a weight according to the weight of the pair in the score table. Whether we keep these pairs will depend on the hard constraints we add.

-- SebastianRiedel - 26 Oct 2006

Topic revision: r3 - 29 Oct 2006 - 04:01:56 - SebastianRiedel
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies