I have several file names that I am trying to compare. Here are some examples:
files = ['FilePrefix10.jpg', 'FilePrefix11.jpg', 'FilePrefix21.jpg', 'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg']
What I need to do is extract "FilePrefix" from each file name, which changes depending on the directory. I have several folders containing many jpg's. Within each folder, each jpg has a FilePrefix in common with every other jpg in that directory. I need the variable portion of the jpg file name. I am unable to predict what FilePrefix is going to be ahead of time.
I had the idea to just compare two file names using difflib (in Python) and extract FilePrefix (and subsequently the variable portion) that way. I've run into the following issue:
>>>> comp1 = SequenceMatcher(None, files[0], files[1])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=11), Match(a=12, b=12, size=4), Match(a=16, b=16, size=0)]
>>>> comp1 = SequenceMatcher(None, files[1], files[2])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=10), Match(a=11, b=11, size=5), Match(a=16, b=16, size=0)]
As you can see, the first size
does not match up. It's confusing the ten's and digit's place, making it hard for me to match a difference between more than two files. Is there a correct way to find a minimum size
among all files within the directory? Or alternatively, is there a better way to extract FilePrefix?
Thank you.
It's not that it's "confusing the ten's and digit's place", it's that in the first matchup the ten's place isn't different, so it's considered part of the matching prefix.
For your use case, there seems to be a pretty easy solution to this ambiguity: just match all adjacent pairs, and take the minimum. Like this:
The
prefix
function is pretty straightforward, except for one thing: I made it take a single tuple of two values instead of two arguments, just to make it easier to call withmap
. And I used the[2]
instead of.size
because there's an annoying bug in 2.7difflib
where the second call toget_matching_blocks
may return atuple
instead of anamedtuple
. This won't affect the code as-is, but if you add some debuggingprint
s it will break.Now,
pairs
is a list of all adjacent pairs of names, created byzip
ping togethernames
andnames[1:]
. (If this isn't clear,print(zip(names, names[1:])
. If you're using Python 3.x, you'll need toprint(list(zip(names, names[1:]))
instead, becausezip
returns a lazy iterator instead of a printable list.)Now we just want to call
prefix
on each of the pairs, and take the smallest value we get back. That's whatmin
is for. (I'm passing it a generator expression, which can be a tricky concept at first—but if you just think of it as a list comprehension that doesn't build the list, it's pretty simple.)You could obviously compact this into two or three lines while still leaving it readable:
However, it's worth considering that
SequenceMatcher
is probably overkill here. It's looking for the longest matches anywhere, not just the longest prefix matches, which means it's essentially O(N^3) on the length of the strings, when it only needs to be O(NM) where M is the length of the result. Plus, it's not inconceivable that there could be, say, a suffix that's longer than the longest prefix, so it would return the wrong result.So, why not just do it manually?
prefixes(first)
just gives you'FilePrefix10.jpg'
,'FilePrefix10.jp',
'FilePrefix10.j, etc. down to
'F'`. So we just loop over those, checking whether each one is also a prefix of all of the other names, and return the first one that is.And you can do this even faster by thinking character by character instead of prefix by prefix:
Here, we're just checking whether the first character is the same in all names, then whether the second character is the same in all names, and so on. Once we find one where that fails, the prefix is all characters up to that (from any of the names).
The
zip
reorganizes the list of names into a list of tuples, where the first one is the first character of each name, the second is the second character of each name, and so on. That is,[('F', 'F', 'F', 'F'), ('i', 'i', 'i', 'i'), …]
.The
enumerate
just gives us the index along with the value. So, instead of getting('F', 'F', 'F', 'F')
you get0, ('F, 'F', F', 'F')
. We need that index for the last step.Now, to check that
('F', 'F', 'F', 'F')
are all the same, I just put them in aset
. If they're all the same, the set will have just one element—{'F'}
, then{'i'}
, etc. If they're not, it'll have multiple elements—{'1', '2'}
—and that's how we know we've gone past the prefix.