Given a document string s
of a certain length and a language mask l
of the same length I would like to process each part (span
?) of the document with the according spacy language model.
say for example
s = 'As one would say in German: Wie man auf englisch zu sagen pflegt'
l = ['en'] * 27 + ['de'] * 37
I would like to construct a document out of
import spacy
nlp_de = spacy.load('de')
nlp_en = spacy.load('en')
d_de = nlp_de(u"".join([c for i,c in enumerate(s) if l[i] == "de"]))
d_en = nlp_en(u"".join([c for i,c in enumerate(s) if l[i] == "en"]))
And now I would somehow have to glue that two parts together. But unfortunately, the document in spacy holds information about the vocabulary. This would thus be ambiguous.
How should I model my multi-language documents with spacy?
2 thoughts regarding this:
If most of your text is more like your example, i would try to try and separate the text by languages (for your example i would yield 2 sentences and process each on its own).
If it's the other case, I'm not sure if spacy has built-in support for code-switch, and if not you'll need to build your own models (or just try to combine those of spacy) depends on your actual task