Sentencepiece tokenizer incorrectly concatenating input files

109 views Asked by At

I am trying to use the sentencepiece to tokenize a large amount of source code files in several different languages.

# Train SentencePiece model
    file_paths = []
    for dir_name, _, file_list in os.walk(root_dir):
        for filename in file_list:
            if filename.endswith(('.cpp', '.php', '.cs', '.c', '.java')):
                file_path = os.path.join(dir_name, filename)
                file_path = str(file_path)  # Convert to raw string
                file_paths.append(file_path)
    try:
        with open(file_path, 'r') as file:
            pass
    except IOError:
        print(f"Cannot open file: {file_path}")

    file_paths = [file_path.replace(',', '_') for file_path in file_paths]
    with open('file_names.txt', 'w') as file:
        for file_path in file_paths:
            file.write(file_path + '\n')

    spm.SentencePieceTrainer.train(f'--input={",".join(file_paths)} --model_prefix=m --vocab_size=499 --model_type=bpe')
 
    sp = spm.SentencePieceProcessor()
    sp.load('m.model')

    # Tokenize files
    for file_path in file_paths:
        tokens = tokenize_file(file_path, sp)
        output_filename = os.path.basename(file_path)
        # Note use 'a' for append mode
        with open(os.path.join(output_dir, output_filename), 'w', encoding='utf-8') as file:
            for token in tokens:
                file.write(token + '\n')

I get an error like this:

OSError: Not found: unknown field name "2\2061-v1.0.0
\src\Uninitialized_variable_Datatype_pointer_good.cpp,C:\Users\Andrew\Desktop\C++" in TrainerSpec.

What I gathered from this error was that there was some problem with the way the files were being loaded and a comma or some sequence of characters in the file names was causing it to incorrectly parse the file names.

I added in a check to print out the files immediately before loading them into the trainer and they are exactly correct, no evidence of the misshapen file that was in the error. I also added the check in to remove all commas in the file path name and change them to underscores, but that didn't help.

I really couldn't find much in the documentation for sentencepiece about tokenizing source code files.

0

There are 0 answers