Chodorow et al (2007)

Martin Chodorow, Joel Tetreault and Na-Rae Han (2007). "Detection of Grammatical Errors Involving Prepositions". Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions, Prague.

Types of preposition errors: (a) incorrect preposition selection; (b) extraneous prepositions.

Methodology

  1. A TRAINING SET consisting of sentences containing a vocabulary of 34 selected prepositions (7 million preposition occurrences in total) were extracted from:
    • the MetaMetrics corpus of 1100 and 1200 Lexile text (11th and 12th grade)
    • newspaper text from the San Jose Mercury News
  2. The sentences were POS tagged and chunked (into NPs and VPs)
  3. Each preposition occurrence/sentence pair was abstracted into a feature vector of 25 contextual features, e.g.
    • bigram to left
    • headword of following phrase
    • preceding verb
    • lemma of preceding word
    • trigram to right
    • etc.
  4. Feature-value pairs which occurred less than 10 times were eliminated (to avoid need for smoothing)
  5. The abstracted training set was used to train a maximum entropy model to estimate the probability of each of the 34 prepositions
  6. A TEST SET was derived from a random selection of the 1100 Lexile text, which had not been used for training. This test set involved 18,157 occurrences of the 34 prepositions.
  7. Each preposition occurrence/sentence pair in the test set was POS tagged, chunked and abstracted into a feature vector, as described above.
  8. Each abstracted feature vector in the test set was presented to the ME model, which had to classify the abstracted preposition contexts into one of 34 classes (one for each preposition type in the vocabulary)
    • agreement = 0.69
    • kappa = 0.64
  9. We examined the errors made by our classifier and discovered that many of these involved a situation where its highest ranked preposition differed from the second highest ranked by just a couple of percentage points, corresponding to contexts where two distinct prepositions are both grammatical
  10. We reran step 8, allowing the classifier to skip any cases where the difference between the probabilities of the first and second ranked prepositions was less than 0.60 (in other words, it responded only when it was confident of its decision). 50% of cases were shipped, and for the remainder:
    • agreement = 0.90
    • kappa = 0.88
  11. A second TEST SET was created, consisting of 2,000 preposition contexts from ESL test essays written by Chinese, Japanese and Russian native speakers. Some problems for the classifier involved:
    • mispelled words - the classifier was able to skip any context containing mispelled words
    • punctuation errors e.g. missing commas - the classifier was allowed to skip contexts which our heuristics judged to be common sites of missing commas
    • antonyms - when the highest ranked preposition is an antonym of the one used by the student (e.g. with/without), the classifier was prevented from reporting an error
    • benefactives - when the highest ranked preposition for a benefactive "for person/organisation" is not "for", the classifier was prevented from reporting an error
  12. The classifier was run on the 2,000 extracted, abstracted preposition contexts, along with a filter which told it to skip contexts which were problematic, as in step 11

-- MarkMcConville - 09 Sep 2008

Topic revision: r1 - 09 Sep 2008 - 14:11:30 - MarkMcConville
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies