Is very known article Eugene W. Myers article “An O(ND) Difference Algorithm and Its Variations”. Also I see jcoglan blog about it. Base algorithm has square space complexity, whereas Divide and Conquer extension decreases space complexity to linear.
For comparison two sequences is possible many correct solutions, which some are more useful for users than others.
For example, comparing two texts:
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
return start <= chunk->length && n <= chunk->length - start;
}
and
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
return start <= chunk->length && n <= chunk->length - start;
}
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
base, square space algorithm give nicer solution: deletions and insertions are grouped together (this sample is simple swapped methods) whereas linear algorithm give worse solution although length Longest Common Sub-sequence is the same.
Furthermore, is needed heuristic maybe dependent on source type (c++ and Python source maybe must have differ heuristics). In Github mhagger/diff-slider-tools is corpus thousands human preferred diff solutions.
How to cause in linear space :
- group deletions and insertions
- enable heuristics