I have to heuristically determine the format pattern strings by analyzing the formatted results.
For example I have these strings:
You have 3 unread messages.
You have 10 unread messages.
I'm sorry, Dave. I'm afraid I can't do that.
I'm sorry, Frank. I'm afraid I can't do that.
This statement is false.
I want to derive these format strings:
You have %s unread messages
I'm sorry, %s. I'm afraid I can't do that.
This statement is false.
Which approaches and/or algorithms could help me here?
My first thought was using machine learning stuff, but my guts tell me this could be a rather classic problem.
Some additional requirements:
- The type of the parameter is irrelevant, i.e. I don't need the information if the parameter originally was
%s
or%d
or if it was padded or aligned. - There can be more than one parameter (or none at all)
- Typically the data consists of thousands of formatted strings, but only tens of format patterns.
Cluster the strings by some metric of similarity (I'd try length of longest common subsequence, LCS). Determining the number of clusters is the hard part, if you don't know it beforehand.
Within each cluster, determine the LCS of all strings in it, recording the position of the gaps that occur. Replace the gaps with
%s
. (You may want to build a function that returns an LCS-based format string andfold
/reduce
that over the cluster.)The above is a greedy algorithm that, given {
foobar
,fooBaR
} producesfoo%sa%s
. You may want to replace any pair of occurrences of%s
separated by a single character (or a single non-whitespace char, etc) by a single%s
, recursively.