I am new Mallet user, I have started with the last stable version 2.0.8. My task is coding a sequence tagger.
This is the code:
ArrayList<Pipe> pipes = new ArrayList<>();
pipes.add(new SaveDataInSource());
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenTextCharPrefix("prefix1=", 1));
pipes.add(new TokenTextCharPrefix("prefix2=", 2));
pipes.add(new TokenTextCharSuffix("suffix1=", 1));
pipes.add(new TokenTextCharSuffix("suffix2=", 2));
pipes.add(new TokenText("word="));
pipes.add(new RegexMatches("CAPITALIZED", Pattern.compile("^\\p{Lu}.*")));
pipes.add(new RegexMatches("STARTSNUMBER", Pattern.compile("^[0-9].*")));
pipes.add(new RegexMatches("HYPHENATED", Pattern.compile(".*\\-.*")));
pipes.add(new TokenTextCharNGrams("bigram=", new int[] {2}));
pipes.add(new TokenTextCharNGrams("trigram=", new int[] {3}));
pipes.add(new MyTargetTagger());
pipes.add(new PrintTokenSequenceFeatures());
pipes.add(new TokenSequence2FeatureVectorSequence());
String[] str = new String[] {
"this is the first sentence John how are you",
"this is the second sentence Maria how are you",
"this is the third sentence Will how are you"
};
Pipe pipe = new SerialPipes(pipes);
InstanceList trainingInstances = new InstanceList(pipe);
trainingInstances.addThruPipe(new ArrayIterator(str));
CRF crf = new CRF(pipe, null);
crf.addStatesForThreeQuarterLabelsConnectedAsIn(trainingInstances);
crf.addStartState();
Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you ?",null,null,null));
System.out.println(r.getData().toString());
As you can see i have used a new Pipe (MyTargetTagger
) that has this code:
public Instance pipe (Instance carrier)
{
TokenSequence ts = (TokenSequence) carrier.getData();
LabelSequence labelSeq = new LabelSequence(getTargetAlphabet());
for (int i = 0; i < ts.size(); i++) {
if (ts.get(i).getText().equals("John")) {
labelSeq.add("PERSON");
} else if (ts.get(i).getText().equals("Maria")) {
labelSeq.add("PERSON");
} else if (ts.get(i).getText().equals("Will")) {
labelSeq.add("PERSON");
} else {
labelSeq.add("O");
}
}
System.out.print(labelSeq.toString());
carrier.setTarget(labelSeq);
}
It is stupid, i know, but it is only a test to understand how the target labels will be interpreted. The labels of the three sentences are equals (obviously):
0: O (0)
1: O (0)
2: O (0)
3: O (0)
4: O (0)
5: PERSON (1)
6: O (0)
7: O (0)
8: O (0)
As you can see i also added pipes.add(new PrintTokenSequenceFeatures());
this is the output:
First sentence:
name: array:0
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t
O bigram=is word=is suffix1=s prefix1=i
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t
O trigram=rst trigram=irs trigram=fir bigram=st bigram=rs bigram=ir bigram=fi word=first suffix2=st suffix1=t prefix2=fi prefix1=f
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s
PERSON trigram=ohn trigram=Joh bigram=hn bigram=oh bigram=Jo CAPITALIZED word=John suffix2=hn suffix1=n prefix2=Jo prefix1=J
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y
Second sentence:
name: array:1
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t
O bigram=is word=is suffix1=s prefix1=i
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t
O trigram=ond trigram=con trigram=eco trigram=sec bigram=nd bigram=on bigram=co bigram=ec bigram=se word=second suffix2=nd suffix1=d prefix2=se prefix1=s
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s
PERSON trigram=ria trigram=ari trigram=Mar bigram=ia bigram=ri bigram=ar bigram=Ma CAPITALIZED word=Maria suffix2=ia suffix1=a prefix2=Ma prefix1=M
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y
Third sentence:
name: array:2
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t
O bigram=is word=is suffix1=s prefix1=i
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t
O trigram=ird trigram=hir trigram=thi bigram=rd bigram=ir bigram=hi bigram=th word=third suffix2=rd suffix1=d prefix2=th prefix1=t
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s
PERSON trigram=ill trigram=Wil bigram=ll bigram=il bigram=Wi CAPITALIZED word=Will suffix2=ll suffix1=l prefix2=Wi prefix1=W
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y
When i do:
Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you",null,null,null));
System.out.println(r.getData().toString());
to see the performance of a new instance, the output is:
PERSON O PERSON O PERSON O PERSON O
Why this output ??
I know that i need a lot of data to train my model better. Of couse, but I would like to know if there are problems with my code.
Thank you so much!