I have N translations of the same document, divided into parts (lets call them verses). Some translations have omitted some verses. No translation contains ALL of the verses.
I want to 'align' the translations (i.e. create records in a database or rows in a spreadsheet) based on content, by creating groups. Each group should contain M verses, where M is the number of translations in which the verse appears, and M < N. No verse may belong to more than one group.
What I have thus far (using various APIs available for Python):
- Construct a 1D list of all verses in all translations (keeping track of which verses comes from which translations)
- For each verse:
- Translate the verse to English using Google Translate
- Get the tf-idf similarity of the verse relative to all other verses
- Find the most similar verse in every other translation
In effect I end up with a graph with directional edges. Each edge has a likelihood (percentage) which shows the similarity of the verse that it points to, with the verse that it points from.
Example:
- N = 3 translations
- 2 verses in each translation
- Correct grouping (like a human would group them) is (A,B,C), (D,E,F)
- My algorithm gives: The correct grouping is obvious to the human eye.
How can I expand this algorithm to achieve the grouping that I need? The results will be checked by humans, so it need not be perfect, but it has to be automated.
Some definitions to make the explanation easier:
P(x,y)
- probability from nodea
tob
. ( e.g. above -P(a,b)=77
andP(b,a)=85
).CP(x,y)
- combined probability. can beP(x,y) * P(y,x)
orP(x,y) + P(y,x)
.The algorithm I'd suggest is as follows:
Find a couple
x, y
with the highestCP(x, y)
and then treat them as one node (a.k.a.x_y
). Re-calculate the graph so each edge to any of the two nodes is taken into account. This is done pretty efficiently using a matrix representation of the graph.Iterate this step until you have
M
groups.