Optimal string from segments with words and timestamps

49 views Asked by At

I have a list of segments of transcribed audio. These transcriptions have words with timestamps and confidence. For example:

 Goedemiddag, 1.13 1.54 0.94892578125
 welkom       1.68 1.98 0.9658203125
 bij          1.98 2.16 0.99853515625
 de           2.16 2.3  0.98291015625
 middag       2.3 2.84  0.858154296875
 show         2.84 3.58 0.81549072265625
 Ik           3.92 4.06 0.99365234375

And the next:

 Welkom       1.52 1.96 0.856689453125
 bij          1.96 2.12 0.99853515625
 de           2.12 2.3  0.9833984375
 middag       2.3 2.84  0.843994140625
 show         2.84 3.56 0.812255859375
 Ik           3.94 4.06 0.99267578125
 ben          4.06 4.24 0.9990234375

Etc.

There is some overlap (currently 4 seconds overlap for each 5 second transcription, but that will be less later on)

Is there a general algorithm that merges these sequences to one big optimal sequence*? By optimal I mean by taking into account the word-level confidence. In this case, the words are transcribed the same, but the transciber might have made a mistake, i.e. a word can be mis-transcribed so there doesn't need to be an exact string match.

[*] I looked into bio-informatics optimal sequence alignment algorithm, but that does not take into account the timestamp/position information that I already have, so it seems there must be an optimization possibility.

1

There are 1 answers

0
olegarch On

I do not know an obvious algorithm to solve this problem, but I think, maybe following proposal will be useful:

  1. You will create a vote structure, contains pair (letter, confidence).
  2. You will create 2d array of such structures, where is:
  • X is a "timeslot", in your case - 10ms timeslot position, like 1.13, 1.14, etc.
  • Y is a channel. In your example, you have 2 overlapped channels.

Thereafter, for each your chunk, you compute array position for each letter, and apply into vote[time][chan] a pair, generated from your chunk. When appying ends, you will scan an array by timeslots, and for each slot you select a latter with maximal vote level.