I want to divide the data into train_dataset
and test_dataset
variables. The function tokenize_and_split_data
did not work and utilities
library did not define. I am working on Python google colab.
import datasets
import tempfile
import logging
import random
import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlines
#from utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import AutoModelForCausalLM
logger = logging.getLogger(__name__)
global_config = None
model_name = "EleutherAI/pythia-70m"
training_config = {
"model": {
"pretrained_name": model_name,
"max_length" : 2048
},
"datasets": {
"use_hf": use_hf,
"path": dataset_path
},
"verbose": True
}
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_dataset, test_dataset = tokenize_and_split_data(training_config, tokenizer)
print(train_dataset)
print(test_dataset)
Above, is the code, I cannot install utilities
library, and this function tokenize_and_split_data
did not defined. Can you help me please.
Download "utilities.py" from here and paste it in your python folder which named "...\Lib\site-packages", you can find this path by 'cmd' command "python -v".