How to train a sequence CRF model with Mallet

463 views Asked by At

I am new Mallet user, I have started with the last stable version 2.0.8. My task is coding a sequence tagger.

This is the code:

ArrayList<Pipe> pipes = new ArrayList<>();

pipes.add(new SaveDataInSource());
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenTextCharPrefix("prefix1=", 1));
pipes.add(new TokenTextCharPrefix("prefix2=", 2));  
pipes.add(new TokenTextCharSuffix("suffix1=", 1));
pipes.add(new TokenTextCharSuffix("suffix2=", 2));  
pipes.add(new TokenText("word="));  
pipes.add(new RegexMatches("CAPITALIZED", Pattern.compile("^\\p{Lu}.*")));
pipes.add(new RegexMatches("STARTSNUMBER", Pattern.compile("^[0-9].*")));
pipes.add(new RegexMatches("HYPHENATED", Pattern.compile(".*\\-.*")));                
pipes.add(new TokenTextCharNGrams("bigram=", new int[] {2}));                
pipes.add(new TokenTextCharNGrams("trigram=", new int[] {3}));                
pipes.add(new MyTargetTagger()); 
pipes.add(new PrintTokenSequenceFeatures()); 
pipes.add(new TokenSequence2FeatureVectorSequence()); 

String[] str = new String[] {
    "this is the first sentence John how are you",
    "this is the second sentence Maria how are you",
    "this is the third sentence Will how are you"
};                

Pipe pipe = new SerialPipes(pipes);

InstanceList trainingInstances = new InstanceList(pipe);
trainingInstances.addThruPipe(new ArrayIterator(str));            

CRF crf = new CRF(pipe, null);
crf.addStatesForThreeQuarterLabelsConnectedAsIn(trainingInstances);
crf.addStartState();

Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you ?",null,null,null));                
System.out.println(r.getData().toString());

As you can see i have used a new Pipe (MyTargetTagger) that has this code:

public Instance pipe (Instance carrier)
{            
    TokenSequence ts = (TokenSequence) carrier.getData();           
    LabelSequence labelSeq = new LabelSequence(getTargetAlphabet());

    for (int i = 0; i < ts.size(); i++) {       
        if (ts.get(i).getText().equals("John")) {
            labelSeq.add("PERSON");
        } else if (ts.get(i).getText().equals("Maria")) {
            labelSeq.add("PERSON");
        } else if (ts.get(i).getText().equals("Will")) {
            labelSeq.add("PERSON");
        } else {
            labelSeq.add("O");
        }
    }

    System.out.print(labelSeq.toString());

    carrier.setTarget(labelSeq);            
}

It is stupid, i know, but it is only a test to understand how the target labels will be interpreted. The labels of the three sentences are equals (obviously):

0: O (0)
1: O (0)
2: O (0)
3: O (0)
4: O (0)
5: PERSON (1)
6: O (0)
7: O (0)
8: O (0)

As you can see i also added pipes.add(new PrintTokenSequenceFeatures()); this is the output:

First sentence:

name: array:0
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t 
O bigram=is word=is suffix1=s prefix1=i 
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t 
O trigram=rst trigram=irs trigram=fir bigram=st bigram=rs bigram=ir bigram=fi word=first suffix2=st suffix1=t prefix2=fi prefix1=f 
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s 
PERSON trigram=ohn trigram=Joh bigram=hn bigram=oh bigram=Jo CAPITALIZED word=John suffix2=hn suffix1=n prefix2=Jo prefix1=J 
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h 
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a 
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y 

Second sentence:

name: array:1
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t 
O bigram=is word=is suffix1=s prefix1=i 
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t 
O trigram=ond trigram=con trigram=eco trigram=sec bigram=nd bigram=on bigram=co bigram=ec bigram=se word=second suffix2=nd suffix1=d prefix2=se prefix1=s 
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s 
PERSON trigram=ria trigram=ari trigram=Mar bigram=ia bigram=ri bigram=ar bigram=Ma CAPITALIZED word=Maria suffix2=ia suffix1=a prefix2=Ma prefix1=M 
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h 
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a 
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y

Third sentence:

name: array:2
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t 
O bigram=is word=is suffix1=s prefix1=i 
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t 
O trigram=ird trigram=hir trigram=thi bigram=rd bigram=ir bigram=hi bigram=th word=third suffix2=rd suffix1=d prefix2=th prefix1=t 
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s 
PERSON trigram=ill trigram=Wil bigram=ll bigram=il bigram=Wi CAPITALIZED word=Will suffix2=ll suffix1=l prefix2=Wi prefix1=W 
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h 
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a 
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y

When i do:

Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you",null,null,null));                
System.out.println(r.getData().toString()); 

to see the performance of a new instance, the output is:

PERSON O PERSON O PERSON O PERSON O

Why this output ??

I know that i need a lot of data to train my model better. Of couse, but I would like to know if there are problems with my code.

Thank you so much!

0

There are 0 answers