I have tried Sox for removing silence and Noise from an audio file. I would like to know technical details of it to understand it. This is important to understand it before professional software can rely on it (I know it works great and has been used by many)
When Noise is sampled using Noise Profile, and then removed using Noisered, what is actually Sox doing in this process? Similarly when VAD effect is added. Is there technical explanation of that or some paper published which I can read to understand it.
I have a background in signal processing due to my studies (scientific basics of speech and music, communication sciences) and just had a look into the code of the noise reduction algorithm of sox.
Without analyzing it too deeply, it seems like it is doing an FFT of the noise profile and the original signal, then subtracts the first from the latter and performs an FFT synthesis again to re-create a signal similar to the original.
By this process it should reduce all the frequencies by the amount they appear in the noise signal.
The whole process seems to be done window-by-window which should allow streaming.
As I said, this is just based on my background knowledge and the short glance I took at the code, so there might be aspects which I didn't grasp.
EDIT:
I also had a glance at the VAD code; that one seems to monitor the spectrum for frequencies appearing in the specified range and if so, declares this as "voice". All parts (windows) not declared "voice" are then silenced (AFAICS). Effectively this shall remove all background noise in a pure-voice recording.