I am trying to download audio files from the HuggingFace dataset using Google Colab as follows. But, I am getting the following error.
pip install datasets
from datasets import DatasetDict
from collections import defaultdict
from datasets import load_dataset
ds = load_dataset('imvladikon/hebrew_speech_kan')
a = ds['train'][0]['audio']['path']
print(a)
from huggingface_hub import hf_hub_download
audio_file_url = '/root/.cache/huggingface/datasets/downloads/extracted/8ce7402f6482c6053251d7f3000eec88668c994beb48b7ca7352e77ef810a0b6/train/e429593fede945c185897e378a5839f4198.wav'
hf_hub_download(audio_file_url)
Error:
---------------------------------------------------------------------------
HFValidationError Traceback (most recent call last)
<ipython-input-36-6fb2d1a885ee> in <cell line: 3>()
1 from huggingface_hub import hf_hub_download
2 audio_file_url = '/root/.cache/huggingface/datasets/downloads/extracted/8ce7402f6482c6053251d7f3000eec88668c994beb48b7ca7352e77ef810a0b6/train/e429593fede945c185897e378a5839f4198.wav'
----> 3 hf_hub_download(audio_file_url)
1 frames
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py in validate_repo_id(repo_id)
156
157 if repo_id.count("/") > 1:
--> 158 raise HFValidationError(
159 "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"
160 f" '{repo_id}'. Use `repo_type` argument if needed."
HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/.cache/huggingface/datasets/downloads/extracted/8ce7402f6482c6053251d7f3000eec88668c994beb48b7ca7352e77ef810a0b6/train/e429593fede945c185897e378a5839f4198.wav'. Use `repo_type` argument if needed.
Using Repo id and filename
from huggingface_hub import hf_hub_url
hf_hub_url(
repo_id="imvladikon/hebrew_speech_kan", filename="e429593fede945c185897e378a5839f4198.wav"
)
This outputs the url https://huggingface.co/imvladikon/hebrew_speech_kan/resolve/main/e429593fede945c185897e378a5839f4198.wav
. However, the HuggingFace website returns that this repository is not available.
Any help appreciated in advance.
The given function needs a
repo_id
andfilename
to run, so try this: