Currently working on a natural language processing project in which I need to convert unstructured bibliography section (which is at the end of research article) to structured metadata like "Year", "Author", "Journal", "Volume ID", "Page Number", "Title", etc.
For example: Input
McCallum, A.; Nigam, K.; and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, 169–178
Expected output:
<Author> McCallum, A.</Author> <Author>Nigam, K.</Author> <Author>Ungar, L. H.</Author>
<Year> 2000 </Year>
<Title>Efficient clustering of high-dimensional data sets with application to reference matching <Title> and so on
Tool used: CRFsuite
Data-set: This contains 12000 references
- Contains Journal title,
- Contains article title's words,
- Contains location names,
Each word in given line considered as token and for each token I derive following features
- BOR at the start of line,
- EOR for end
- digitFeature : if token is digit
- Year: if token is in year format like 19** and 20**
- available in current data-set,
From above tool and data-set I got only 63.7% accuracy. Accuracy is very less for "Title" and good for "Year" and "Volume".
Questions:
- Can I draw any additional features?
- Can I use any other tool?
I'd propose to base solution over existed approaches. Take a look for example at this paper
Sections 3.2 and 4.2 provide descriptions of dozens of features.
As for CRF implementations, there are other tools like this one, but I don't think it is a source of low accuracy.