English word segmentation in NLP?

7.5k views Asked by At

I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,

http://ads.goole.com/appid/heads

Two constraints are put on my parsing,

  1. The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.

  2. The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.

I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".

I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??


UPDATE:

Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.

Thanks,

Jamin

2

There are 2 answers

1
Frank Riccobono On BEST ANSWER

Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.

I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.

http://jeremykun.com/2012/01/15/word-segmentation/

1
GrantJ On

The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Installation is easy with pip:

$ pip install wordsegment

Simply call segment to get a list of words:

>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']

As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.

>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']

I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.