I'm going to go a bit in-depth with my problem, you can jump to the TL;DR if you don't want to read all of this
What I'm trying to do
I need to store a "file" (text document) which can be user-edited. If I have my original file (which could be huge)
Lorem ipsum dolor sit amet
and the user were to make a change:
Foo ipsum amet_ sit
Basically, I have the original string and the user-edited string. I want to find the differences, "edits". To prevent storing duplicates of very large strings. I want to store the original and the "edits". Then apply the edits to the original. Kind of like data de-duplication. The problem is that I have no idea how different edits can be and I also need to be able to apply those edits to the string.
Attempts
Because the text could be huge, I am wondering what would be the most "efficient" way to store edits to the text without storing two separate versions. My first guess was something along the lines of:
var str = 'Original String of text...'.split(' ') || [],
mod = 'Modified String of text...'.split(' ') || [], i, edits = [];
for (i = 0; i < str.length; i += 1) {
edits.push(str[i]===mod[i] ? undefined : mod[i]);
}
console.log(edits); // ["Modified", null, null, null] (desired output)
then to revert back:
for (i = 0; i < str.length; i += 1) {
str[i] = edits[i] || str[i];
}
str.join(' '); // "Modified String of text..."
Basically, I'm trying to split the text by spaces into arrays. Compare the arrays and store the differences. Then apply the differences to generate the modified version
Problems
But if the amount of spaces were to change, problems would occur:
str
: Original String of text...
mod
: OriginalString of text...
Output: OriginalString of text... text...
My desired output: OriginalString of text...
Even if I were to switch str.length
with mod.length
and edits.length
like:
// Get edits
var str = 'Original String of text...'.split(' ') || [],
mod = 'Modified String of text...'.split(' ') || [], i, edits = [];
for (i = 0; i < mod.length; i += 1) {
edits.push(str[i]===mod[i] ? undefined : mod[i]);
}
// Apply edits
var final = [];
for (i = 0; i < edits.length; i += 1) {
final[i] = edits[i] || str[i];
}
final = final.join(' ');
edits
would be: ["ModifiedString", "of", "text..."]
in result making the whole 'storing edits thing useless. And even worse if a word were to be added / removed. If str
were to become Original String of lots of text...
. The output would still be the same.
I can see that they're many flaws in the way I'm doing this, but I can't think of any other way.
Snippet:
document.getElementById('go').onclick = function() {
var str = document.getElementById('a').value.split(' ') || [],
mod = document.getElementById('b').value.split(' ') || [],
i, edits = [];
for (i = 0; i < mod.length; i += 1) {
edits.push(str[i] === mod[i] ? undefined : mod[i]);
}
// Apply edits
var final = [];
for (i = 0; i < edits.length; i += 1) {
final[i] = edits[i] || str[i];
}
final = final.join(' ');
alert(final);
};
document.getElementById('go2').onclick = function() {
var str = document.getElementById('a').value.split(' ') || [],
mod = document.getElementById('b').value.split(' ') || [],
i, edits = [];
for (i = 0; i < str.length; i += 1) {
edits.push(str[i] === mod[i] ? undefined : mod[i]);
}
for (i = 0; i < str.length; i += 1) {
str[i] = edits[i] || str[i];
}
alert(str.join(' ')); // "Modified String of text..."
};
Base String:
<input id="a">
<br/>Modified String:
<input id="b" />
<br/>
<button id="go">Second method</button>
<button id="go2">First Method</button>
TL;DR:
How would you find the changes between two strings?
I'm dealing with large pieces of text each could be about a megabyte hundred kilobytes. This is running on the browser
Running a proper diff using just JavaScript can be potentially slow, but it depends on the performance requirements and the quality of the diff, and of course how often it must be run.
One quite efficient way would be to track the edits when the user is actually editing the document and only store those changes just after the moment they are done. For this you can use for example ACE editor, or any other editor that supports change tracking.
http://ace.c9.io/
ACE is tracking the changes while the document is edited. The ACE editor tracks the commands in easily comprehensible format like:
You can hook to the changes of the ACE editor and listen to the change events:
To learn how it works just create some test runs which capture the changes. Basicly there are only those for commands:
Removing the newline from the text can be a bit tricky.
When you have this list of changes you are ready to replay the changes against the text file. You can even merge similar or overlapping changes into a single change - for example inserts to subsequent characters could be merged into a single change.
There will be some problems when you are testing this, composing the string back to text is not trivial but quite doable and should not be more than around 100 lines of code or so.
The nice thing is that when you are finished, you have also undo and redo commands easily available, so you can replay the whole editing process.