Using Stanford CoreNLP/NER to extract titles (of books, articles, etc)?

1.2k views Asked by At

Is there some sequence of tags that could possibly indicate a title among a webpage? For example, extracting the title of the book from its amazon page, where other text/sentences may have similar sentence structures. I feel like this is an extremely fundamental task but cannot figure out exactly how to do it with Stanford's NER/CoreNLP.

Thanks in advance!

2

There are 2 answers

1
Vineet Kosaraju On

A solution without using the CoreNLP library - If you are looking for a title on a webpage, why not parse the <title> tag?

For example, the title for the amazon book page for the Hunger Games (http://www.amazon.com/Hunger-Games-Trilogy-Boxset/dp/0545626382/ref=sr_1_2?s=books&ie=UTF8&qid=1386299491&sr=1-2&keywords=hunger+games) is:

The Hunger Games Trilogy Boxset: Suzanne Collins: 9780545626385: Amazon.com: Books

Of course, title tags depend on the website, and they can either relate to the page or just be generically the title of the overarching website.

0
Yasen On

Detecting a sequence of html tags is not really an NLP problem. See web scraping. You can write a set of regex / xquery / etc. rules to detect the titles in your specific corpus. Pdfs and other documents also have some sort of markup you can exploit, see the tika parser.

For scientific articles you can easily count on the title being the first thing before a couple of newlines, or something like that.