In an iOS framework, I am searching through this 3.2 MB file for pronunciations: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic
I am using NSRegularExpression to search for an arbitrary set of words that are given as an NSArray. The search is done through the contents of the large file as an NSString. I need to match any word that appears bracketed by a newline and a tab character, and then grab the whole line, for example if I have the word "monday" in my NSArray I want to match this line within the dictionary file:
monday M AH N D IY
This line starts with a newline, the string "monday" is followed by a tab character, and then the pronunciation follows. The entire line needs to be matched by the regex for its ultimate output. I also need to find alternate pronunciations of the words which are listed as follows:
monday(2) M AH N D EY
The alternative pronunciations always begin with (2) and can go as high as (5). So I also search for iterations of the word followed by parentheses containing a single number bracketed by a newline and a tab character.
I have a 100% working NSRegularExpression method as follows:
NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.
// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("];
int firstRound = 0;
for(NSString *word in array) {
if(firstRound == 0) { // this is the first round
firstRound++;
} else { // After the first iteration we need an OR operator first.
[mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
}
[mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}
[mutablePatternString appendString:@")\\t.*$"];
// This results in this regex pattern:
// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$
NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
options:NSRegularExpressionAnchorsMatchLines
error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
options:0
range:NSMakeRange(rangeLocation, rangeLength)
usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
[matches addObject:[string substringWithRange:result.range]];
}];
[mutablePatternString release];
// matches array is returned to the caller.
My issue is that given the big text file, it isn't really fast enough on the iPhone. 8 words take 1.3 seconds on an iPhone 4, which is too long for the application. Given the following known factors:
• The 3.2 MB text file has the words to match listed in alphabetical order
• The array of arbitrary words to look up are always in alphabetical order when they get to this method
• Alternate pronunciations start with (2) in parens after the word, not (1)
• If there is no (2) there won't be a (3), (4) or more
• The presence of one alternative pronunciation is rare, occurring maybe 1 time in 8 on average. Further alternate pronunciations are even rarer.
Can this method be optimized, either by improving the regex or some aspect of the Objective-C? I'm assuming that NSRegularExpression is already optimized enough that it isn't going to be worthwhile trying to do it with a different Objective-C library or in C, but if I'm wrong here let me know. Otherwise, very grateful for any suggestions on improving the performance. I am hoping to make this generalized to any pronunciation file so I'm trying to stay away from solutions like calculating the alphabetical ranges ahead of time to do more constrained searches.
****EDIT****
Here are the timings on the iPhone 4 for all of the search-related answers given by August 16th 2012:
dasblinkenlight's create NSDictionary approach https://stackoverflow.com/a/11958852/119717: 5.259676 seconds
Ωmega's fastest regex at https://stackoverflow.com/a/11957535/119717: 0.609593 seconds
dasblinkenlight's multiple NSRegularExpression approach at https://stackoverflow.com/a/11969602/119717: 1.255130 seconds
my first hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.372215 seconds
my second hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.337549 seconds
The best time so far is the second version of my answer. I can't mark any of the answers best, since all of the search-related answers informed the approach that I took in my version so they are all very helpful and mine is just based on the others. I learned a lot and my method ended up a quarter of the original time so this was enormously helpful, thank you dasblinkenlight and Ωmega for talking it through with me.
Here is my hybrid approach of dasblinkenlight's and Ωmega's answers, which I thought I should add as an answer as well at this point. It uses dasblinkenlight's method of doing a forward search through the string and then performs the full regex on a small range in the event of a hit, so it exploits the fact that the dictionary and words to look up are both in alphabetical order and benefits from the optimized regex. Wish I had two best answer checks to give out! This gives the correct results and takes about half of the time of the pure regex approach on the Simulator (I have to test on the device later to see what the time comparison is on the iPhone 4 which is the reference device):
EDIT: here is another version which uses no regex and shaves a little bit more time off of the method: