Prefer showing lines with only either additions or deletions in git diff

52 views Asked by At

As an example, if we have the following (meaningless) code committed:

if (x.some_property)
{
    dothing1();
    dothing2();
}

and we add a conditional branch on top and make the existing code the else branch:

if (x.something_else)
{
    now_other_here();
    second_here();
}
else if (x.some_property)
{
    dothing1();
    dothing2();
}

then git diff with all supported diffing algorithms shows:

-    if (x.some_property)
+    if (x.something_else)
+    {
+        now_other_here();
+        second_here();
+    }
+    else if (x.some_property)

This is correct. In my mind, there is another possible diff:

+    if (x.something_else)
+    {
+        now_other_here();
+        second_here();
+    }
-    if (x.some_property)
+    else if (x.some_property)

Why does git prefer the first? I find the second diff more easy to parse as a human for two reasons:

  • the changed line only contains additions/deletions and not both (i.e, no modifications); and
  • the changed line contains fewer modified characters.

My assumption is that git parses the file from top to bottom and thus prefers to show modifications earlier rather than later.

Is it possible to change git's behavior such that it will show the second diff?

1

There are 1 answers

1
jingx On

Writing this as an answer instead of a comment only because of the length limit. Pretty sure this is not the answer you are looking for. :-)

To achieve what you want, which would definitely yield more intuitive results, a diff algorithm would need to be able to do one of two things:

It needs to be able to recognize that:

else if (x.some_property)

was changed from

if (x.some_property)

That would require diff to look inside each pair of different lines, figure out how much different they are, and check against some sort of "similarity threshold". It would be a performance hit, and where the threshold is set can be very subjective - how many characters would need to have changed for diff to determine that two lines are completely unrelated?

Another solution would be to teach diff to actually parse the language. Then it would be able to see that the whole code block following the else if line did not change, and recognize that that else if line was the if line originally. However, again, there is going to be a big performance hit to turn diff from a "dumb" line-based text processor into an "intelligent" syntax aware parser. Is it worth it? I don't know.