Weighted disjunction in Perl Regular Expressions?

564 views Asked by At

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.

My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name

United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER

(Where the words in CAPS are what I have called "identifiers").

We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER

OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:

云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ; Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley

This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.

For China, the following are the "province-level" entities:

省 (Province) , 自治区 (Autonomous Region) , 市 (Municipality)

So my regex so far looks like this:

(.+?(?:(?:省)|(?:自治区)|(?:市)))

I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:

(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

So to match a province entity followed by a city entity:

(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

For the above, this yields:
$+{Province} = 云南省
$+{City} = 丽江市

This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.

The problem? If I have a pure disjunction of these elements, we have the following:

(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))

What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.

Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))

We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city" 2. "crap catelephant city" , where we wanted "crap cat" "elephant city"

Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.

Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:

广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District

(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)

The match for a greedy search would be
江门市开平市

This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.

Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?

If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?

In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness? (Incidentally, my work philosophy in college?)

1

There are 1 answers

1
Mark Byers On BEST ANSWER

How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl's regular expression engine) the alternation matches eagerly - that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.