Loop through a list of files with a specific MIME type in sh

449 views Asked by At

I have a directory, and need to get a list of files with MIME types of application/pdf, which I can loop through and process with my CompressPdf function. The remaining files only need to be copied over the destination directory using cp, for which I need a loopable list as well.

The obvious obstacle is correctly handling UNIX filenames using NUL. So far I've come up with this:

find "dir-to-search" -type f -print0 | xargs -0 file -0 --mime-type -F " " | grep -zZ "application/pdf"

But grep doesn't handle the results correctly because file -0 inserts NUL right after the file name, with \n after the MIME information. It would return something like this:

0000000   .   /   f   i   l   e   1   .   p   d   f  \0                                                                                                                                                                                                      
0000010   a   p   p   l   i   c   a   t   i   o   n   /   p   d   f  \n                                                                                                                                                                                      
0000020   .   /   f   i   l   e   2   .   p   d   f  \0                                                                                                                                                                                                      
0000030   a   p   p   l   i   c   a   t   i   o   n   /   p   d   f  \n

Another obstacle is that putting everything in one line limit the ability to use several lines of code with each iteration. Calling xargs -I{} sh -c {} inline will spawn a new process, which is unable to call my CompressPdf function. I am using Dash and export -f CompressPdf does not work. Executing $0 recursively is my best bet.

Currently, my code is running well when processing several PDF files concurrently inside a single directory recursively. It prevents me from processing a large number of files at once, however.

Can someone help me with this? I'm trying to write in Dash instead of Bash for a little more performance, despite the fact that array is not available. I can switch to Bash if there is no other way.

1

There are 1 answers

1
KamilCuk On

Try this:

find . -type f -print0 |
xargs -0 file -0 -0 --mime-type |
sed -z 'N;/\x00application\/pdf$/s///p'

So first from man file:

-0, --print0

If this option is repeated more than once, then file prints just the filename followed by a NUL followed by the description (or ERROR: text) followed by a second NUL for each entry.

So specify it twice.

Then I use sed -z to read zero separated stream two lines at a time. -z is a gnu extension to sed. If two zero separated lines end with application/pdf, then this matched string is removed and the filename is printed.

You can always work around zero terminated strings with xxd:

find . -type f -print0 |
xargs -0 file -0 -0 --mime-type |
# convert to hex
xxd -p -c1 | tr '\n' ' ' | sed 's/00 /\n/g' |
# have filename and mime type on a single line
sed 'N;s/\n/00 /' |
# this is actually grep
# grep for application/pdf in hex
sed -n '/ 00 '"$(echo -n 'application/pdf' | xxd -p | tr -d '\n' | sed -r 's/(..)/\1\n/g' | paste -sd' ')"'/s// 00/p' |
# reverse the stream from hex to ascii
xxd -r -p