spaCy process document with multiple languages

3.7k views Asked by At

Given a document string s of a certain length and a language mask l of the same length I would like to process each part (span?) of the document with the according spacy language model.

say for example

s = 'As one would say in German: Wie man auf englisch zu sagen pflegt'
l = ['en'] * 27 + ['de'] * 37

I would like to construct a document out of

import spacy
nlp_de = spacy.load('de')
nlp_en = spacy.load('en')

d_de = nlp_de(u"".join([c for i,c in enumerate(s) if l[i] == "de"]))
d_en = nlp_en(u"".join([c for i,c in enumerate(s) if l[i] == "en"]))

And now I would somehow have to glue that two parts together. But unfortunately, the document in spacy holds information about the vocabulary. This would thus be ambiguous.

How should I model my multi-language documents with spacy?

1

There are 1 answers

1
lazary On

2 thoughts regarding this:

  1. code switch: which is the combination of more than 1 language into (mainly but not restricted to) spoken text. This is not exactly your example.
  2. Sentences like yours, which are kind of separable.

If most of your text is more like your example, i would try to try and separate the text by languages (for your example i would yield 2 sentences and process each on its own).

If it's the other case, I'm not sure if spacy has built-in support for code-switch, and if not you'll need to build your own models (or just try to combine those of spacy) depends on your actual task