How to train a model in SageMaker Studio with .train and .test extension dataset files?

120 views Asked by At

I'm trying to implement ML models with Amazon SageMaker Studio, the thing is that the model that I want to implement is from hugging face and It uses a Dataset from CONLL Corpora.

Following the instructions from the Hugging Face documentation, I have to read a csv file with this instruction: train = pd.read_csv. But the problem comes with the dataset file extension because it's a .train and .test extension. The error I'm getting is: "ParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 3"

Is there a way to convert .test files to csv files? Or how should I read these files extensions?

Links

Dataset: https://www.kaggle.com/nltkdata/conll-corpora

Model: https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner

1

There are 1 answers

0
durga_sury On

The dataset in your link seem to be tab separated, not comma separated.

You can read it using the right delimiter, like df = pd.read_csv("<filename>", sep="\t")