Is there any method to efficiently apply large git patches?

840 views Asked by At

We received a large patch with about 17000 files modified. Its size is 5.2G. When applying the patch with git apply -3, it didn't finish after 12 hours.

We split the patch into smaller patches per file and applied them one by one, so that at least we could see the progress.

Once again, it got stuck at one of the file patches, which is still as large as 111M. It modifies an HTML file.

We split this file patch into smaller patches per chunk and got about 57000 chunk patches. Each chunk patch takes around 2-3 seconds so it would take more time than applying the file patch. I'll try splitting it by more chunks.

Is there any method to efficiently apply such large patches? Thanks.

Update:

As @ti7 suggested, I tried patch and it solved the problem.

In my case, we have 2 kinds of large patches.

One is adding/removing a large binary and the content of the binary is contained as text in the patch. One of the binaries is 188M and the patch size that removes it is 374M.

The other is modifying a large text and has millions of deletions and insertions. One of the text files is 70M before and 162M after. The patch size is 181M and has 2388623 insertions and 426959 deletions.

After some tests, I think here "large" describes the number of the insertions and deletions.

For the binary patch,

  • git apply -3, 7 seconds
  • git apply, 6 seconds
  • patch, 5 seconds

For the text patch,

  • git apply -3, stuck, not finished after 10 minutes
  • git apply, stuck, not finished after 10 minutes
  • patch, 3 seconds

The binary has only 1 insertion and/or 1 deletion. git apply or patch can finish in seconds. All are acceptable.

The text has too many insertions and deletions. Obviously, patch is much better in this case. I read some posts on patch and got to know that some versions of patch could not work with adding/removing/renaming a file. Luckily, the patch on my machine works well.

So we split the all-in-one patch into smaller patches per file. We try timeout 10s git apply -3 file_patch first. If it cannot finish in 10 seconds, try timeout 10s patch -p1 < file_patch.

At last, it took about 1 and a half hours to apply all the 17000 patches. It's much better than applying the all-in-one patch and getting stuck for 12 hours with nothing done.

And I also tried patch -p1 < all_in_one_patch. It took only 1m27s. So I think we can improve our patch flow further more.

2

There are 2 answers

2
ti7 On BEST ANSWER

You may be able to use patch (Wikipedia) instead of git apply to speed up patching!

To my knowledge, patch directly spools out a new file by-lines, splicing in the changes as it goes, while git apply does additional context checking (and as @j6t notes in a comment, though I haven't confirmed it, will attempt to load and patch the entire file at once before writing it out)

2
VonC On

Another argument for patch: git apply is now officially limited to 1GB.

With Git 2.39 (Q4 2022), "git apply"(man) limits its input to a bit less than 1 GiB.

See commit f1c0e39 (25 Oct 2022) by Taylor Blau (ttaylorr).
(Merged by Taylor Blau -- ttaylorr -- in commit c41ec63, 30 Oct 2022)

apply: reject patches larger than ~1 GiB

Reported-by: 정재우
Suggested-by: Johannes Schindelin
Signed-off-by: Taylor Blau

The apply code is not prepared to handle extremely large files.
It uses "int" in some places, and "unsigned long" in others.

This combination leads to unfortunate problems when switching between the two types.
Using "int" prevents us from handling large files, since large offsets will wrap around and spill into small negative values, which can result in wrong behavior (like accessing the patch buffer with a negative offset).

Converting from "unsigned long" to "int" also has truncation problems even on LLP64 platforms where "long" is the same size as "int", since the former is unsigned but the latter is not.

To avoid potential overflow and truncation issues in git apply(man), apply similar treatment as in dcd1742 ("xdiff: reject files larger than ~1GB", 2015-09-24, Git v2.7.0-rc0 -- merge listed in batch #2), where the xdiff code was taught to reject large files for similar reasons.

The maximum size was chosen somewhat arbitrarily, but picking a value just shy of a gigabyte allows us to double it without overflowing 2^31-1 (after which point our value would wrap around to a negative number).
To give ourselves a bit of extra margin, the maximum patch size is a MiB smaller than a full GiB, which gives us some slop in case we allocate "(records + 1) * sizeof(int)" or similar.

Luckily, the security implications of these conversion issues are relatively uninteresting, because a victim needs to be convinced to apply a malicious patch.


As noted by Gabriel Devillers in the comments:

I tried to apply a patch of size 1.6 GB with Git 1.41 and got error:

git apply: failed to read: No such file or directory 

which is totally unclear.


With Git 2.42 (Q3 2023), "git apply"(man) punts when it is fed too large a patch input; the error message it gives when it happens has been clarified.

See commit 42612e1 (26 Jun 2023) by Phillip Wood (phillipwood).
(Merged by Junio C Hamano -- gitster -- in commit 84b889b, 06 Jul 2023)

apply: improve error messages when reading patch

Reported-by: Premek Vysoky
Signed-off-by: Phillip Wood

Commit f1c0e39 ("apply: reject patches larger than ~1 GiB", 2022-10-25, Git v2.39.0-rc0 -- merge listed in batch #9) added a limit on the size of patch that apply will process to avoid integer overflows.
The implementation re-used the existing error message for when we are unable to read the patch.
This is unfortunate because (a) it does not signal to the user that the patch is being rejected because it is too large and (b) it uses error_errno() without setting errno.

This patch adds a specific error message for the case when a patch is too large.
It also updates the existing message to make it clearer that it is the patch that cannot be read rather than any other file and marks both messages for translation.
The "git apply"(man) prefix is also dropped to match most of the rest of the error messages in apply.c (there are still a few error messages that prefixed with "git apply" and are not marked for translation after this patch).
The test added in f1c0e39 is updated accordingly.