This is my first time posting here, so be gentle, please.
I have written the following code:
import pandas as pd
import spacy
df = pd.read_csv('../../../Data/conll2003.dev.conll', sep='\t', on_bad_lines='skip', header=None)
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 1500000
## https://stackoverflow.com/questions/48169545/does-spacy-take-as-input-a-list-of-tokens
all_tokens = []
for token in df[0]:
all_tokens.append(str(token))
string = ' '.join(all_tokens)
doc = nlp(string)
token_tuples = tuple(enumerate(doc))
outfile = open('./conll2003.dev.syntax_corrupt.conll', 'w')
i = 0 ## initiate by looking at the first token in the doc
for x, token in enumerate(df[0]):
for num, tok in token_tuples[i:]: ## we add this step to ensure that the for loop always looks from the last token that was a match, since doc is longer
## than df[0], otherwise it would at some point start looking from earlier tokens since spacy has more tokens and if there is an accidental match, it
## would provide the wrong dep and head
if token == tok.text:
i = num ## get the number from the token tuples as new starting point
outfile.write(str(df[0][x]) + '\t' + str(df[1][x]) + '\t' + str(df[2][x]) + '\t' + str(df[3][x]) + '\t' + str(tok.dep_) + '\t' + str(tok.head.text) + '\n')
break
else:
outfile.write(str(df[0][x]) + '\t' + str(df[1][x]) + '\t' + str(df[2][x]) + '\t' + str(df[3][x]) + '\t' + 'no_dep' + '\t' + 'no_head' + '\n')
break
outfile.close()
The code is supposed to take data from the 2003conll-shared task on NER and first join the individual tokens to a string (as the data comes pre-tokenized) and then feed it into spaCy in order to make use of its dependency parsing. After that, I want to write the same lines that were in the original file + two new columns containing the dependency relation and the respective head noun.
SpaCy obviously tokenizes the text differently than what came pre-tokenized so I had to find a way that the correct relation would be attributed to the correct token as len(doc)!= len(df[0]).
It works fine if I do not include the else statement and it writes the correct relation with the token to the outfile. However, when I do include it, I would expect it to print one line with the values "no_dep" and "no_head" (for the token spaCy did not take into account) and then continue printing the tokens where there is information on the dependency relations (because the break statement should break the loop, yeah?). But it does not. It writes to every following token "no_dep" and "no_head" instead of going back to writing the actual relations.
In other words:
inputfile (snippet):
LONDON NNP B-NP B-LOC
1996-08-30 CD I-NP O
West NNP B-NP B-MISC
outputfile without else statement:
LONDON NNP B-NP B-LOC nmod Simmons
West NNP B-NP B-MISC nmod Indian
what I want with the else statement:
LONDON NNP B-NP B-LOC nmod Simmons
1996-08-30 CD I-NP no_dep no_head
West NNP B-NP B-MISC nmod Indian
what I get:
LONDON NNP B-NP B-LOC no_dep no_head
1996-08-30 CD I-NP O no_dep no_head
West NNP B-NP B-MISC no_dep no_head
(Note that the first line in the outputfile does have the correct dependency relation and head noun, the problem starts from the second line.)
Any ideas what it is that I'm doing wrong? Thanks!
You should preserve the original tokenization. To do this, manually create the
Doc
in order to skip the tokenizer in the pipeline: