I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin
Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/