Say I have one large string and an array of substrings that when joined equal the large string (with small differences).
For example (note the subtle differences between the strings):
large_str = "hello, this is a long string, that may be made up of multiple
substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
"subsrings tat aproimately ", "match the orginal strng"]
How can I best align the strings to produce a new set of sub strings from the original large_str
? For example:
["hello, this is a long string", ", that may be made up of multiple",
"substrings that approximately ", "match the original string"]
Additional Info
The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.
An implementation using Python's difflib:
Output: