I am trying to use the sentencepiece to tokenize a large amount of source code files in several different languages.
# Train SentencePiece model
file_paths = []
for dir_name, _, file_list in os.walk(root_dir):
for filename in file_list:
if filename.endswith(('.cpp', '.php', '.cs', '.c', '.java')):
file_path = os.path.join(dir_name, filename)
file_path = str(file_path) # Convert to raw string
file_paths.append(file_path)
try:
with open(file_path, 'r') as file:
pass
except IOError:
print(f"Cannot open file: {file_path}")
file_paths = [file_path.replace(',', '_') for file_path in file_paths]
with open('file_names.txt', 'w') as file:
for file_path in file_paths:
file.write(file_path + '\n')
spm.SentencePieceTrainer.train(f'--input={",".join(file_paths)} --model_prefix=m --vocab_size=499 --model_type=bpe')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# Tokenize files
for file_path in file_paths:
tokens = tokenize_file(file_path, sp)
output_filename = os.path.basename(file_path)
# Note use 'a' for append mode
with open(os.path.join(output_dir, output_filename), 'w', encoding='utf-8') as file:
for token in tokens:
file.write(token + '\n')
I get an error like this:
OSError: Not found: unknown field name "2\2061-v1.0.0
\src\Uninitialized_variable_Datatype_pointer_good.cpp,C:\Users\Andrew\Desktop\C++" in TrainerSpec.
What I gathered from this error was that there was some problem with the way the files were being loaded and a comma or some sequence of characters in the file names was causing it to incorrectly parse the file names.
I added in a check to print out the files immediately before loading them into the trainer and they are exactly correct, no evidence of the misshapen file that was in the error. I also added the check in to remove all commas in the file path name and change them to underscores, but that didn't help.
I really couldn't find much in the documentation for sentencepiece about tokenizing source code files.