Detecting the start and end of dialog sections in prose

401 views Asked by At

I have looked through a lot of the open source NLP tools (OpenNLP primarily) and I do not see anything which automates the task of detecting the start and end of dialog.

The sentence detection tools find the boundaries of the full sentence. The tokenizers accurately tokenize the punctuation, but still don't detect start and end. I've read many scholarly articles (such as) where dialog detection is assumed. But I don't see any tools which automate this as general purpose dialog detection.

For instance, text like this:

"I am happy," she said.

Should have "I am happy," defined as dialog. Text like this:

"This is a really long piece of dialog spoken by a character.

"That spans across multiple paragraphs."

Should have the whole thing identified as dialog (even though the end of the first paragraph is missing the closing quotation mark). Also there are weirder ways of specifying dialog. Such as with dashes:

They were walking when Joe spoke up.
--I really like walking.

Plus, often internal dialog will be denoted with italics, such as:

Joe walked down the street. *I really hope I don't get hit by a bus.*

Is there an NLP tool that can detect dialog sections like this? Or a way to do this with OpenNLP that I just missed?

2

There are 2 answers

1
Igor On

I'm not aware of any tool that does this, out of the box, domain-independent. Probably for specific domains you could either train something. In a call transcript for example, it's quite likely that you have an A-B-A-B (etc.) structure, where two people take turns in talking. But when more people are involved in a dialog, things get a lot more complicated. Also, whether you can do this with orthographic features (like single/double quotes) or not, also depends on whether the people who constructed your text/corpus bothered to do this in a neat and consistent way or not.

I recently stumbled upon a tool that does discourse parsing: http://alt.qcri.org/tools/discourse-parser/

This provides you with something called a rhetorical structure tree, which is a representation of the input document that clarifies which sentence has which relation to another sentence. I haven't tried it for dialogs and have no idea about performance there. But it is/relies on a somewhat more semantically aware way of cutting up texts in pieces. Maybe that could help you. The tool is not as user-friendly as the corenlp/opennlp bunch though, and it requires (at least it did for me) quite some fiddling around to get up and running.

Anyway; probably (way) too much information, short answer; as far as I know, there is no easy to implement and use tool for this.

0
CleverPatrick On

After some searching, it looks like the Stanford NLP tools have a "QuoteAnnotator" that is exactly what I am looking for.