Whisper Inference

46 views Asked by At

why transcribe stage we remove N_FRAMES from mel and in for loop over the mel_segment it didn't take the last segment if it's less than 3000 frame why? let's suppose that he mel = [80,4100] first mel segment will be [80,3000], and [80,1100] the model will transcribe the first segment [80,3000] and in this [80,1100] it will not do any thing


# Pad 30-seconds of silence to the input audio, for slicing
    mel = log_mel_spectrogram(audio, model.dims.n_mels, padding=N_SAMPLES)
    content_frames = mel.shape[-1] - N_FRAMES # N_FRAMES  = 3000
    content_duration = float(content_frames * HOP_LENGTH / SAMPLE_RATE)

0

There are 0 answers