A visual representation of what I want to do I have a transformer encoder decoder structure, but want to jointly train with ctc. The encoder outputs(vis softmax) the ctc frame wise probabilities(batch x maxframes x vocab) , and the decoder outputs character probability distribution(batch x maxsequencelength x vocab). I want to combine them (joint decoding), how do I go about doing this?
What I tried: I tried to linearly combine them using (1-lambda)Pctc + lambda*Pdecoder, but they are of different sizes, I need to decode or collapse the ctc to character probabilities, like remove all the blanks or repetitions but have no clue on how to go about that.