Get large string without catastrophic backtracking regex

47 views Asked by At

I'm wanting to use Regex to get a specific file (e.g. package-lock.json) out of a git diff. The reason for this approach is because I'm getting a whole git diff via the Github API (Using Octocat js), therefore I can't just run the git diff on that specific file. (As far as I'm aware). Obviously the diff on a file like package-lock.json is very large so there's a lot of content). What I've noticed is that when I try to use a regular expression to get this content out it fails due to catastrophic backtracking.

Essentially the file structure looks like this

diff --git a/package-lock.json b/package-lock.json
lots of content

diff --git a/next-file b/next-file

Therefore my idea was to get everything between the two diff --git strings.

I figured I could just use this /(?<=diff --git )(.+?)(?=diff)/gs This works fine if the lookahead is not too far ahead, but after a long way through the file this stops working due to catastrophic backtracking.

I get why this is happening but just don't get how to get around it. Perhaps I should be sorting this some other way and just using Regex for more specific details?

Any help would be appreciated.

1

There are 1 answers

0
Andy Lester On BEST ANSWER

You're working with lines of data, and regexes don't work well like that, as you've found out. Use a tool like awk that can find ranges of lines.

Give this file foo.txt:

Here is stuff I don't care about
diff --git a/package-lock.json b/package-lock.json
lots of content

diff --git a/next-file b/next-file
Don't care about this either.

use awk to specify a range of lines you want to print:

$ awk '/^diff --git a\/package-lock/,/^diff --git a\/next-file/' foo.txt
diff --git a/package-lock.json b/package-lock.json
lots of content

diff --git a/next-file b/next-file