NLP of Legal Texts?

1.7k views Asked by At

I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.

I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:

For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.

I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.

3

There are 3 answers

0
Nandadeep On

I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.

To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data. This can be a really good start to incorporating NLP into your domain.

0
AudioBubble On

Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.

0
Gabriel M On

In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:

Use Frame Semantics: http://www.cs.cmu.edu/~ark/SEMAFOR/

Use Semantic Role Labeling (SRL): http://cogcomp.org/page/demo_view/srl

Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.