Difflib get_opcodes returning weird results

36 views Asked by At

I'm using difflib's get_opcodes to transform one english paragraph to another. It works semi-well; however, the result has some weird errors in them. Here's an example:

Original:

Abstract: Efferocytosis is attenuated in vulnerable plaques of advanced atherosclerosis. T-cell immunoglobulin and mucin domain 4 (TIMD4) is a recognition receptor protein for efferocytosis. It may participate in atherosclerosis mouse models.

Target:

Abstract: Efferocytosis, the process of engulfing and removing apoptotic cells, is attenuated in vulnerable plaques of advanced atherosclerosis. T-cell immunoglobulin and mucin domain 4 (TIMD4) is a recognition receptor protein for efferocytosis that has been implicated in atherosclerosis mouse models.

Result:

Abstract: Efferocytosis, the process of engulfing and removing apoptotic cells, is attenuated in vulnerable plaques of advanced atherosclerosis. T-cell immunoglobulin and mucin domain 4 (TIMD4) is a recognition receptor protein for efferocytosis {thathat} has been implicated in atherosclerosis mouse models.

Notice how the words in the curly braces (not in output, just to highlight location) is inaccurate. So, I printed out the op codes (truncated):

insert, 0, 0, 0, 8
equal, 0, 23, 8, 31
insert, 23, 23, 31, 87
equal, 23, 189, 87, 253
delete, 189, 190, 253, 253
equal, 190, 191, 253, 254
replace, 191, 192, 254, 257
equal, 192, 194, 257, 259
equal, 192, 193, 254, 255
replace, 193, 194, 255, 269
equal, 194, 195, 269, 270
...

I notice that the position of the op codes go from 259 to 255 (decreasing) instead of always increasing. This coincides with the location where the inaccurate word is and seems to be the case for all further errors.

Any idea as to why this is happening? How can I fix this?

0

There are 0 answers