Unsupervised Adaptation

Data

  • PTB Gold.
  • BIO unlabelled.

Requirements

  • Script to random add domain label to dependency labels. add_domains(G,X)
  • Script to fix domain dependency labels. fix_domains(G,X)
  • Script to filter incorrect system dependency compared to a gold standard. filter

Initial Procedure

  1. PTB -> add_domains(G,N) -> PTB-D
  2. PTB-D -> train parser
  3. BIO -> run parser -> BIO-D-noisy
  4. BIO-D-noisy -> fix_domains(G,B) -> BIO-D

This gives us the following data:

  • PTB-D: gold dependencies with randomly assigned domain labels.
  • BIO-D: system dependencies with a mix of system and randomly assigned domain labels.

Iterative Phase

  1. PTB-D + BIO-D -> train parser
  2. PTB + BIO -> run parser -> PTB-D-noisy + BIO-D-noisy
  3. PTB-D-noisy -> fix_domains(G, N) -> PTB-D
  4. PTB-D + PTB-Gold -> filter -> PTB-D
  5. BIO-D-noisy -> fix_domains(G, B) -> BIO-D
  6. If not converged: goto step 1.

Alternatively

  1. define a loss function l_d1(gold-structure, guessedgold, guess) for the domain d1 we have gold data (penalizes out of domain labels and wrong structure)
  2. define a loss function l_d2(guessedgold, guess) (only penalizes out of domain labels)

Preliminary Results

We trained on the ptb-pbiotb data with General, News and Bio domain tags added randomly. Using mstparser-0.4.

  General News Bio
PTB 112175 112223 0
Bio TB 35682 0 11568
Initial 147857 112223 11568
Iteration 1 noise 207354 63264 1030
Iteration 1 fixed 209201 58799 3648

Filter output for iteration 1:

Parses fixed: 2708
Domains added: 2708
Domains changed: 5244

New Direction

  • Train on PTB with label N.
  • Run PTB model on PBIOTB.
  • Relabel PBIOTB with label B.
  • Combine PTB and PBIOTB data.
  • Train model on gold PTB and news gold PBIOTB.
  • Run model on training data.
  • Compare unlabelled accuracy of output on our gold PTB and news gold PBIOTB.

We want to see a high accuracy on the PTB and a low accuracy on the PBIOTB at first. This will mean that edges in our fools PBIOTB have changed, thus we are not creating the same edges the baseline parser (trained on PTB) would be creating. If there is a low accuracy for output against news gold PBIOTB. Take the output to on PBIOTB to be our fools gold.

Preliminary Results

Trained on ptb-pbiotb data set using two domaini tags (news and bio). Ran resulting model (ptb.1of2.pbiotb.1of2.25.iteration0.model using mstparser-0.2) on the training data again. Edges did not change much at all, almost all the labels were given news domain. Only 1003 bio domain labels remained.

Possible reasons:

  • Training data order. We have so much news data first that the bio can never overcome the strong news weights.
    • Try running training on pbiotb-ptb instead.
    • Randomise the data (would have to reconstruct original order afterwards for comparisons.
  • mstparser-0.2 doesn't have many labelled features, thus most the weight is going on a set of general features.
    • Add labels to these general features, the labels can just be domain labels (they don't have to be dependency labels).

Ran using pbiotb-ptb order instead. Result on training set again: 1282 Bio labels, the rest are news.


This topic: TheBeast > WebHome > UnsupervisedAdaptation
Topic revision: r5 - 28 Feb 2007 - 09:16:25 - JamesClarke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies