Running the fairseq-preprocess
script produces binary files with integer indices corresponding to token ids in a dictionary.
When I no longer have the original tokenized texts, what is the simplest way to explore the binarized dataset? The documentation does not say much about how a dataset can be loaded for debugging purposes.
I worked around this by loading the trained model and using it to decode the binarized sentences back to strings: