Why do scripts behave differently called from commandline vs git attribuites?

293 views Asked by At

Updated scripts attached below, these are now working on my sample document

Why do the following python scripts perform differently when called via git attributes or from command line?

What I have are two scripts that I modeled based on the mercurial zipdoc functionality. All I'm attempting to do is unzip docx files on store (filter.clean) and zip them on restore (filter.smudge). I have two scripts working well, but once I put them into git attribute they don't work and I don't understand why.

I've tested by doing the following

Testing the Scripts (git bash)

$ cat original.docx | python ~/Documents/pyscripts/unzip.py > uncompress.docx

$ cat uncompress.docx | python ~/Documents/pyscripts/zip.py > compress.docx

$ md5sum uncompress.docx compress.docx

I can open both the uncompressed and compressed files with Microsoft Word with no errors. The scripts work as expected.

Test Git Attributes

  1. I set both clean and scrub to cat, verified my file saves and restores w/o problem.
  2. I set clean to python ~/Documents/pyscripts/unzip.py. After a commit and checkout the file is now larger (uncompressed) but errors when opened in MS-Word. Also the md5 does not match the "script only" test above. Although the file size is identical.
  3. I set clean back to cat and set scrub to python ~/Documents/pyscripts/zip.py. After a commit and checkout the file is now smaller (compressed) but again errors when opened in MS-Word. Again the md5 differs from the "script only" test but the file size matches.
  4. Setting both clean and scrub to the python scripts produces an error, as expected.

I'm really lost here, I thought git Attributes simply provides input on stdin and reads it from stdout. I've tested both scripts to work with a pipe from cat and a redirect from the output just fine. I know the scripts are running b/c the files change size as expected, however a small change is introduced somewhere in the file.

Additional References

I'm using msgit on Win7, all commands above were typed into the git bash window.

Git Attributes Description

Uncompress Script

import fileinput
import sys
import zipfile

# Set stdin and stdout to binary read/write
if sys.platform == "win32":
    import os, msvcrt
    msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY)

try:
  from cStringIO import StringIO
except:
  from StringIO import StringIO

# Wrap stdio into a file like object
inString = StringIO(sys.stdin.read())
outString = StringIO()

# Store each member uncompressed
try:
    with zipfile.ZipFile(inString,'r') as inFile:
        outFile = zipfile.ZipFile(outString,'w',zipfile.ZIP_STORED)
        for memberInfo in inFile.infolist():
            member = inFile.read(memberInfo)
            memberInfo.compress_type = 0
            outFile.writestr(memberInfo,member)
        outFile.close()
except zipfile.BadZipfile:
    sys.stdout.write(inString.getvalue())

sys.stdout.write(outString.getvalue())

Compress Script

import fileinput
import sys
import zipfile

# Set stdin and stdout to binary read/write
if sys.platform == "win32":
    import os, msvcrt
    msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY)

try:
  from cStringIO import StringIO
except:
  from StringIO import StringIO

# Wrap stdio into a file like object
inString = StringIO(sys.stdin.read())
outString = StringIO()

# Store each member compressed
try:
    with zipfile.ZipFile(inString,'r') as inFile:
        outFile = zipfile.ZipFile(outString,'w',zipfile.ZIP_DEFLATED)
        for memberInfo in inFile.infolist():
            member = inFile.read(memberInfo)
            memberInfo.compress_type = zipfile.ZIP_DEFLATED
            outFile.writestr(memberInfo,member)
        outFile.close()
except zipfile.BadZipfile:
    sys.stdout.write(inString.getvalue())

sys.stdout.write(outString.getvalue())

Edit: Formatting Edit 2: Scripts updated to write to memory rather than stdout during file processing.

1

There are 1 answers

0
user1585512 On

I've found that using zipfile.ZipFile() with the target being stdout was causing a problem. Opening the zipfile with the target being a StringIO() and then at the end writing the full StringIO file into stdout has solved that problem.

I haven't tested this extensively and it's possible some .docx contents won't be handled well but only time will tell. My test files now open with out error, and as a bonus the .docx file in your working directory is smaller due to using higher compression than the standard .docx format.

I have confirmed that after performing several edits and commits on a .docx file I can open the file, edit one line, and commit with out a large delta added to the repo size. For example, a 19KB file, after 3 previous edits in the repo history, having a new line added at the top created a delta of only 1KB in the repo after performing garbage collection. Running the same test (as close as I could) with Mercurial resulted in a 9.3KB delta commit. I'm no Mercurial expert my understanding is there is no "gc" command for mercurial so none was run.