Using Stanford CoreNLP/NER to extract titles (of books, articles, etc)?

Question

Using Stanford CoreNLP/NER to extract titles (of books, articles, etc)?

1.2k views Asked by Jess At 06 December 2013 at 02:21

Is there some sequence of tags that could possibly indicate a title among a webpage? For example, extracting the title of the book from its amazon page, where other text/sentences may have similar sentence structures. I feel like this is an extremely fundamental task but cannot figure out exactly how to do it with Stanford's NER/CoreNLP.

Thanks in advance!

Original Q&A

There are 2 answers

**Vineet Kosaraju** · Answer 1 · 2013-12-06T03:14:50+00:00

A solution without using the CoreNLP library - If you are looking for a title on a webpage, why not parse the <title> tag?

For example, the title for the amazon book page for the Hunger Games (http://www.amazon.com/Hunger-Games-Trilogy-Boxset/dp/0545626382/ref=sr_1_2?s=books&ie=UTF8&qid=1386299491&sr=1-2&keywords=hunger+games) is:

The Hunger Games Trilogy Boxset: Suzanne Collins: 9780545626385: Amazon.com: Books

Of course, title tags depend on the website, and they can either relate to the page or just be generically the title of the overarching website.

**Yasen** · Answer 2 · 2013-12-30T20:56:10+00:00

Detecting a sequence of html tags is not really an NLP problem. See web scraping. You can write a set of regex / xquery / etc. rules to detect the titles in your specific corpus. Pdfs and other documents also have some sort of markup you can exploit, see the tika parser.

For scientific articles you can easily count on the title being the first thing before a couple of newlines, or something like that.

TechQA.

Using Stanford CoreNLP/NER to extract titles (of books, articles, etc)?

There are 2 answers

Related Questions in JAVA

Related Questions in NLP

Related Questions in STANFORD-NLP

Related Questions in NAMED-ENTITY-RECOGNITION

Related Questions in NAMED-ENTITY-EXTRACTION

Popular Questions

Popular Tags

Trending Questions