difflib with more than two file names

812 views Asked by At

I have several file names that I am trying to compare. Here are some examples:

files = ['FilePrefix10.jpg', 'FilePrefix11.jpg', 'FilePrefix21.jpg', 'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg']

What I need to do is extract "FilePrefix" from each file name, which changes depending on the directory. I have several folders containing many jpg's. Within each folder, each jpg has a FilePrefix in common with every other jpg in that directory. I need the variable portion of the jpg file name. I am unable to predict what FilePrefix is going to be ahead of time.

I had the idea to just compare two file names using difflib (in Python) and extract FilePrefix (and subsequently the variable portion) that way. I've run into the following issue:

>>>> comp1 = SequenceMatcher(None, files[0], files[1])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=11), Match(a=12, b=12, size=4), Match(a=16, b=16, size=0)]

>>>> comp1 = SequenceMatcher(None, files[1], files[2])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=10), Match(a=11, b=11, size=5), Match(a=16, b=16, size=0)]

As you can see, the first size does not match up. It's confusing the ten's and digit's place, making it hard for me to match a difference between more than two files. Is there a correct way to find a minimum size among all files within the directory? Or alternatively, is there a better way to extract FilePrefix?

Thank you.

2

There are 2 answers

5
abarnert On BEST ANSWER

It's not that it's "confusing the ten's and digit's place", it's that in the first matchup the ten's place isn't different, so it's considered part of the matching prefix.

For your use case, there seems to be a pretty easy solution to this ambiguity: just match all adjacent pairs, and take the minimum. Like this:

def prefix(x, y):
    comp = SequenceMatcher(None, x, y)
    matches = comp.get_matching_blocks()
    prefix_match = matches[0]
    prefix_size = prefix_match[2]
    return prefix_size

pairs = zip(files, files[1:])
matches = (prefix(x, y) for x, y in pairs)
prefixlen = min(matches)
prefix = files[0][:prefixlen]

The prefix function is pretty straightforward, except for one thing: I made it take a single tuple of two values instead of two arguments, just to make it easier to call with map. And I used the [2] instead of .size because there's an annoying bug in 2.7 difflib where the second call to get_matching_blocks may return a tuple instead of a namedtuple. This won't affect the code as-is, but if you add some debugging prints it will break.

Now, pairs is a list of all adjacent pairs of names, created by zipping together names and names[1:]. (If this isn't clear, print(zip(names, names[1:]). If you're using Python 3.x, you'll need to print(list(zip(names, names[1:])) instead, because zip returns a lazy iterator instead of a printable list.)

Now we just want to call prefix on each of the pairs, and take the smallest value we get back. That's what min is for. (I'm passing it a generator expression, which can be a tricky concept at first—but if you just think of it as a list comprehension that doesn't build the list, it's pretty simple.)

You could obviously compact this into two or three lines while still leaving it readable:

prefixlen = min(SequenceMatcher(None, x, y).get_matching_blocks()[0][2] 
                for x, y in zip(files, files[1:]))
prefix = files[0][:prefixlen]

However, it's worth considering that SequenceMatcher is probably overkill here. It's looking for the longest matches anywhere, not just the longest prefix matches, which means it's essentially O(N^3) on the length of the strings, when it only needs to be O(NM) where M is the length of the result. Plus, it's not inconceivable that there could be, say, a suffix that's longer than the longest prefix, so it would return the wrong result.

So, why not just do it manually?

def prefixes(name):
    while name:
        yield name
        name = name[:-1]

def maxprefix(names):
    first, names = names[0], names[1:]
    for prefix in prefixes(first):
        if all(name.startswith(prefix) for name in names):
            return prefix

prefixes(first) just gives you 'FilePrefix10.jpg', 'FilePrefix10.jp','FilePrefix10.j, etc. down to'F'`. So we just loop over those, checking whether each one is also a prefix of all of the other names, and return the first one that is.


And you can do this even faster by thinking character by character instead of prefix by prefix:

def maxprefix(names):
    for i, letters in enumerate(zip(*names)):
        if len(set(letters)) > 1:
            return names[0][:i]

Here, we're just checking whether the first character is the same in all names, then whether the second character is the same in all names, and so on. Once we find one where that fails, the prefix is all characters up to that (from any of the names).

The zip reorganizes the list of names into a list of tuples, where the first one is the first character of each name, the second is the second character of each name, and so on. That is, [('F', 'F', 'F', 'F'), ('i', 'i', 'i', 'i'), …].

The enumerate just gives us the index along with the value. So, instead of getting ('F', 'F', 'F', 'F') you get 0, ('F, 'F', F', 'F'). We need that index for the last step.

Now, to check that ('F', 'F', 'F', 'F') are all the same, I just put them in a set. If they're all the same, the set will have just one element—{'F'}, then {'i'}, etc. If they're not, it'll have multiple elements—{'1', '2'}—and that's how we know we've gone past the prefix.

1
MrWonderful On

The only way to be certain is to check ALL the filenames. So just iterate through them all, checking against the kept maximum matching string as you go.

You might try something like this:

files = ['FilePrefix10.jpg',
         'FilePrefix11.jpg',
         'FilePrefix21.jpg',
         'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg',
         'FileProtector354.jpg
         ]
prefix=files[0]
max = 0
for f in files:
    for c in range(0, len(prefix)):
        if prefix[:c] != f[:c]:
            prefix = f[:c-1]
            max = c - 1
print prefix, max

Please pardon the 'un-Pythonicness' of the solution, but I wanted the algorithm to be obvious to any level programmer.