I'm building something like a "brainstorming" tool: A group of people can shout terms into a microphone. The input is translated into text (google speech to text) and displayed in a word cloud. The word cloud groups the same words (or terms). But I can't identify the individual terms correctly. Google can only split the input if a long silence is between them. If two people shout short after each other the different ideas are handled as one single idea. Thats not what I want. Any ideas? E.g. one person says "dark blue" and one person says "dark red". Google gives me one output "dark blue dark red".

Nikolay Shmyrev On

They have experimental speaker diarization function, it does not work very reliably though. Speaker separation is supported by other toolkits and APIs too.