Is Speech-to-Text-to-Translation an Impossible Dream?

212 views Asked by At

Theoretically, one could use a laptop's or tablet's or phone's microphone to capture spoken words, convert that to words on the screen and then, by accessing an API such as google translate, see "a" (not "the" - hardly ever, anyway) rough "draft" of a translation of those words (say, from English to Spanish or from Spanish to English).

I was thinking this would be useful in a courtroom - as a sort of "hands-free memo pad" for court interpreters.

Theoretically simple, but is it feasible? I see several potential problems:

The software would have to be told which is the target language and which is the source language. Otherwise, there might be a delay and sometimes it would even draw a wrong conclusion, if the device was left to its own devices (auto-detect).

Background noises and voices would have to be filtered out.

The translation (attempt) would only be valid once the speaker had finished their sentence - and how would the software know that? By length of the pauses? Some people pause within a sentence for a long time; some people barely pause between sentences, so...how would that work?

People not speaking clearly, or in hard-to-understand accents.

And this is not even mentioning (except here, obliquely) that context is often misconstrued by the robot underlord translators.

My intuition is that if Abraham Lincoln and Martin Luther King were speaking at the same time (which, even in a courtroom, does happen at times), the software would come up with something like this:

For score and seven years ago I am happy to join with you to day. Our fathers brought fourth on this continent, a new nation, in what will go down in history as the greatest conceived in Liberty, and. Dedicated to the perspiration that demonstration for freedom in all men are created equal. The history of our nation.

...and then be translated something like so:

Por puntuación y hace siete años que estoy encantado de unirme a ustedes hoy. Nuestros padres trajeron cuarto en este continente, una nueva nación, en lo que va a pasar a la historia como el mayor concebida en la libertad, y. Dedicada a la transpiración que la demostración por la libertad en todos los hombres son creados iguales. La historia de nuestra nación.

What I'm saying, I guess, is that humans "rock" when it comes to this sort of thing - at least compared to machines (software) in their current degree of sophistication, but do we, or will we, "rock" enough to overcome this problem? Is there a way to surmount these hurdles, at least to a sufficient extent for such a program to be worth the trouble to use? Perfection would be unattainable; matching human skill would also be, I believe, an unreachable goal, especially because of the context factor. Nevertheless: can Speech-to-Text-to-Context-to-Translation be done even relatively well and, if so, how?

1

There are 1 answers

3
Emil A. On BEST ANSWER

I believe it's possible and it can be done relatively well:

  • the device should be able to understand the context partially based on the data given from all kinds of sensors and memory, these would need to be finely tuned to give a good result, but isn't that something that people actually do all the time? We evaluate the context based on what we see, feel, where we are; what we've seen, what we felt and where we've been - a smart device should be able reproduce that

  • the device should be able to guess where the sentence ends/starts based on everything it knows about given language - people do the same,

If the device would have the same sensors, knowledge and memory that people do then it could theoretically do the same.

Even a blink of an eye can give a lot of context, I think it all boils down to the complexity and range of data the device accepts and uses to translate the text correctly. The more it knows, the better it is.