How to save tokenizer DistilBertTokenizer after reading my x train values

Question

How to save tokenizer DistilBertTokenizer after reading my x train values

100 views Asked by GSandro_Strongs At 01 April 2023 at 19:29

I am using Transformers and DistilBert for text classification. My dataset is 700000 rows and It is a bit heavy. I am running my code on Google colab. I used this code before building my model.

X = dfreadtrain['review_text'].values
y = dfreadtrain['rating'].values
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=train_y, random_state=42, shuffle=True)
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
train_encodings = tokenizer(list(x_train),truncation=True,padding=True)
test_encodings = tokenizer(list(x_test), truncation=True, padding=True)
print(type(train_encodings))

It took me many hours to run this part but as you know Google colab stops the session and I loose them. Is it possible to record train_encodings and test_encodings on a file? Those are <class 'transformers.tokenization_utils_base.BatchEncoding'> objects.

Many thanks in advance.

Original Q&A

There are 1 answers

**TF_Chinmay** · Answer 1 · 2023-11-29T17:26:40+00:00

Colab has its limitations hence I used this procedure in my recent topic modelling project :

Set up tensorflow on cloud (I used GCP) -> Install CuDNN , Cuda etc.

Then use a Terminal Multiplexer to save your logs so that the logs dont go away if the SSH connection breaks.

Refer : https://www.tensorflow.org/install/pip

Create VM
SSH into it
Install Python
Install Tensorflow
Install Nvidia Driver
Install Cuda Toolkit (Use NVIDIA account / create new account)
Install CuDNN
Set up SSH with Local and VM
Upload files.

Once done, use the TMUX as mentioned here.

Open the terminal —> connect to the VM —> enter tmux and press ENTER . Then run your code there.

Close the terminal —> It will ask to terminate —> click on yes.

Next time you open the terminal —> check tmux sessions by doing tmux ls ——> Connect to your session using this command tmux a -t SessionName

TechQA.

How to save tokenizer DistilBertTokenizer after reading my x train values

There are 1 answers

Related Questions in TENSORFLOW

Related Questions in DISTILBERT

Popular Questions

Trending Questions