I have two versions of a short text, e.g.:
old = "(a) The provisions of this article apply to machinery of class 6."
new = "(a) The provisions of this article apply to machinery of class 6, unless one of the following exceptions apply: (i) the owner depends on the vehicle, (ii) the vehicle is newer than 2 years of age, (iii) the city council grants a special permission"
I now want to compare the differences of these texts at a high level. I have accomplished a string-level comparison using the following code:
import difflib
differ = difflib.ndiff(old.splitlines(), new.splitlines())
summary = []
for line in differ:
prefix = line[:2]
if prefix == ' ':
summary.append(line[2:])
elif prefix == '- ':
summary.append(f"Removed: {line[2:]}")
elif prefix == '+ ':
summary.append(f"Added: {line[2:]}")
However, I want a higher level summary of the changes as I have thousands of these diffs. For a given diff, I'm imagining the diff summary being something like diff_summary = "Adds exceptions, making article less restrictive".
I want to leverage text summarization (e.g. huggingface-based models) but I'm missing the entry point. How do I employ text summarization to summarize changes to texts?
If you like some git-diff like functions, you can try:
[out]:
But most probably you want something fancier that gives you phrasal/clausal highlights, e.g.
[out]:
Caveat: You are relying on
str.split()to find token differences.And most probably to get the outputs into the shape that you want.
[out]: