I have N translations of the same document, divided into parts (lets call them verses). Some translations have omitted some verses. No translation contains ALL of the verses.
I want to 'align' the translations (i.e. create records in a database or rows in a spreadsheet) based on content, by creating groups. Each group should contain M verses, where M is the number of translations in which the verse appears, and M < N. No verse may belong to more than one group.
What I have thus far (using various APIs available for Python):
- Construct a 1D list of all verses in all translations (keeping track of which verses comes from which translations)
- For each verse:
- Translate the verse to English using Google Translate
- Get the tf-idf similarity of the verse relative to all other verses
- Find the most similar verse in every other translation
In effect I end up with a graph with directional edges. Each edge has a likelihood (percentage) which shows the similarity of the verse that it points to, with the verse that it points from.
Example:
- N = 3 translations
- 2 verses in each translation
- Correct grouping (like a human would group them) is (A,B,C), (D,E,F)
- My algorithm gives:
The correct grouping is obvious to the human eye.
How can I expand this algorithm to achieve the grouping that I need? The results will be checked by humans, so it need not be perfect, but it has to be automated.
Some definitions to make the explanation easier:
P(x,y)- probability from nodeatob. ( e.g. above -P(a,b)=77andP(b,a)=85).CP(x,y)- combined probability. can beP(x,y) * P(y,x)orP(x,y) + P(y,x).The algorithm I'd suggest is as follows:
Find a couple
x, ywith the highestCP(x, y)and then treat them as one node (a.k.a.x_y). Re-calculate the graph so each edge to any of the two nodes is taken into account. This is done pretty efficiently using a matrix representation of the graph.Iterate this step until you have
Mgroups.