I would like to access various datasets from huggingface.co that contain audio data. To begin, I am using the GigaSpeech dataset.
I understand how to use an IterableDataset
(by including streaming=True
when calling load_dataset(...)
. However, this appears to download the entire audio file at once, as the returned item has an key audio
whose value has keys path
and array
, where array
appears to contain the sample data for the entire audio file.
I am using torchaudio.io.StreamReader
, which appears to support streaming from a URL (i.e. from a remote file). I am wondering if it might be possible to have the IterableDataset
(or something like it) iterate over the URLs to the audio files rather than downloading them directly.
If this is not possible: I've looked in the cache folder several times and I can't find the audio file or even the folder that path
seems to allude to. At any rate, since array
seems to contain the audio data from the file, reading the source file itself appears unnecessary. However, torchaudio.io.StreamReader
does not seem to support "streaming" from an array. I would like to know what the best method is to easily perform "streaming" with possible resampling over the array
(whose dtype
is torch.float64
, but will need to be converted to numpy.float32
at some point).
Obviously, I could implement my own windowing and resampling on the array, but it would be much better if I could use something pre-existing that works out of the box very similarly to the StreamReader.