Algorithm to group parts of documents that belong together

95 views Asked by At

I have N translations of the same document, divided into parts (lets call them verses). Some translations have omitted some verses. No translation contains ALL of the verses.

I want to 'align' the translations (i.e. create records in a database or rows in a spreadsheet) based on content, by creating groups. Each group should contain M verses, where M is the number of translations in which the verse appears, and M < N. No verse may belong to more than one group.

What I have thus far (using various APIs available for Python):

  1. Construct a 1D list of all verses in all translations (keeping track of which verses comes from which translations)
  2. For each verse:
    • Translate the verse to English using Google Translate
    • Get the tf-idf similarity of the verse relative to all other verses
    • Find the most similar verse in every other translation

In effect I end up with a graph with directional edges. Each edge has a likelihood (percentage) which shows the similarity of the verse that it points to, with the verse that it points from.

Example:

  • N = 3 translations
  • 2 verses in each translation
  • Correct grouping (like a human would group them) is (A,B,C), (D,E,F)
  • My algorithm gives: Graph of verses The correct grouping is obvious to the human eye.

How can I expand this algorithm to achieve the grouping that I need? The results will be checked by humans, so it need not be perfect, but it has to be automated.

2

There are 2 answers

1
shapiro yaacov On BEST ANSWER

Some definitions to make the explanation easier:
P(x,y) - probability from node a to b. ( e.g. above - P(a,b)=77 and P(b,a)=85 ).
CP(x,y) - combined probability. can be P(x,y) * P(y,x) or P(x,y) + P(y,x).

The algorithm I'd suggest is as follows:

Find a couple x, y with the highest CP(x, y) and then treat them as one node (a.k.a. x_y). Re-calculate the graph so each edge to any of the two nodes is taken into account. This is done pretty efficiently using a matrix representation of the graph.
Iterate this step until you have M groups.

0
Petr On

If the verses are ordered as you write in the comments, then this can easily be formulated as the edit distance problem.

Firstly assume you have only two languages. You can reformulate your problem as follows: you need to convert one translation (A) to another (B) with following operations: you can either delete a verse (this will mean that this verse is present in A, but not in B), you can insert a verse (meaning it is not present in A, but is present in B), or you can substitute one verse with another (meaning that you match these two verses). You can assign costs to each of these operations; the cost for substituting will depend on verses similarity that you have already calculated, and you will need to define the cost for insertion or deletion somehow (you might need to experiment in this). After this, you run the standard algorithm mentioned in the Wikipedia, and you get your matching in quadratic time.

If you have more then two languages, you can either use a similar exact algorithm, but it will run slower (O(N^k) with N begin the maximal number of verses and k begin the number of languages), or you might employ some approximate algorithm such as match two languages first, then add third language, etc.