Detecting adding/removal from string difference between texts

79 views Asked by At

I have two versions of a short text, e.g.:

old = "(a) The provisions of this article apply to machinery of class 6."
new = "(a) The provisions of this article apply to machinery of class 6, unless one of the following exceptions apply: (i) the owner depends on the vehicle, (ii) the vehicle is newer than 2 years of age, (iii) the city council grants a special permission"

I now want to compare the differences of these texts at a high level. I have accomplished a string-level comparison using the following code:

import difflib
differ = difflib.ndiff(old.splitlines(), new.splitlines())

summary = []

for line in differ:
    prefix = line[:2]
    if prefix == '  ':
        summary.append(line[2:])
    elif prefix == '- ':
        summary.append(f"Removed: {line[2:]}")
    elif prefix == '+ ':
        summary.append(f"Added: {line[2:]}")

However, I want a higher level summary of the changes as I have thousands of these diffs. For a given diff, I'm imagining the diff summary being something like diff_summary = "Adds exceptions, making article less restrictive".

I want to leverage text summarization (e.g. huggingface-based models) but I'm missing the entry point. How do I employ text summarization to summarize changes to texts?

1

There are 1 answers

0
alvas On

If you like some git-diff like functions, you can try:

from difflib import unified_diff

s1 = "(a) The provisions of this article apply to machinery of class 6."
s2 = "(a) The provisions of this article apply to machinery of class 6, unless one of the following exceptions apply: (i) the owner depends on the vehicle, (ii) the vehicle is newer than 2 years of age, (iii) the city council grants a special permission"

print('\n'.join(unified_diff(s1.split(), s2.split())))

[out]:

--- 

+++ 

@@ -9,4 +9,36 @@

 machinery
 of
 class
-6.
+6,
+unless
+one
+of
+the
+following
+exceptions
+apply:
+(i)
+the
+owner
+depends
+on
+the
+vehicle,
+(ii)
+the
+vehicle
+is
+newer
+than
+2
+years
+of
+age,
+(iii)
+the
+city
+council
+grants
+a
+special
+permission

But most probably you want something fancier that gives you phrasal/clausal highlights, e.g.

from difflib import unified_diff

s1 = "(a) The provisions of this article apply to machinery of class 6."
s2 = "(a) The provisions of this article apply to machinery of class 6, unless one of the following exceptions apply: (i) the owner depends on the vehicle, (ii) the vehicle is newer than 2 years of age, (iii) the city council grants a special permission"

current_diff = []
direction = None
for diff_line in unified_diff(s1.split(), s2.split()):
    if diff_line[0] not in ["-", '+'] or diff_line in ['--- \n', '+++ \n']:
        continue
    if diff_line[0] != direction:
        if current_diff:
            print(direction, current_diff)
        direction = diff_line[0]
        current_diff = []
        
    current_diff.append(diff_line[1:])
        
print(direction, current_diff)

[out]:

- ['6.']
+ ['6,', 'unless', 'one', 'of', 'the', 'following', 'exceptions', 'apply:', '(i)', 'the', 'owner', 'depends', 'on', 'the', 'vehicle,', '(ii)', 'the', 'vehicle', 'is', 'newer', 'than', '2', 'years', 'of', 'age,', '(iii)', 'the', 'city', 'council', 'grants', 'a', 'special', 'permission']

Caveat: You are relying on str.split() to find token differences.

And most probably to get the outputs into the shape that you want.

from difflib import unified_diff

s1 = "(a) The provisions of this article apply to machinery of class 6."
s2 = "(a) The provisions of this article apply to machinery of class 6, unless one of the following exceptions apply: (i) the owner depends on the vehicle, (ii) the vehicle is newer than 2 years of age, (iii) the city council grants a special permission"

meaningful = {'-': 'Removed:', '+': 'Added:'}

current_diff = []
direction = None
for diff_line in unified_diff(s1.split(), s2.split()):
    if diff_line[0] not in ["-", '+'] or diff_line in ['--- \n', '+++ \n']:
        continue
    if diff_line[0] != direction:
        if current_diff:
            print(meaningful[direction], ' '.join(current_diff))
        direction = diff_line[0]
        current_diff = []
        
    current_diff.append(diff_line[1:])
        
print(meaningful[direction], ' '.join(current_diff))

[out]:

Removed: 6.
Added: 6, unless one of the following exceptions apply: (i) the owner depends on the vehicle, (ii) the vehicle is newer than 2 years of age, (iii) the city council grants a special permission