In German, every job has a feminine and a masculine version. The feminine one is derived from the masculine one by adding an "-in" suffix. In the plural form, this turns into "-innen".
Example:
| English | German
------+------------------+-----------------------
masc. | teacher doctor | Lehrer Arzt
fem. | teacher doctor | Lehrerin Ärztin
masc. | teachers doctors | Lehrer Ärzte
fem. | teachers doctors | Lehrerinnen Ärztinnen
Currently, I'm using NLTK's nltk.stem.snowball.GermanStemmer
.
It returns these stems:
Lehrer -> lehr | Arzt -> arzt
Lehrerin -> lehrerin | Ärztin -> arztin
Lehrer -> lehr | Ärzte -> arzt
Lehrerinnen -> lehrerinn | Ärztinnen -> arztinn
Is there a way to make this stemmer return the same stems for all four versions, feminine and masculine ones? Alternatively, is there any other stemmer doing that?
Update
I ended up adding "-innen" and "-in" as the first entries in the step 1 suffix-tuple like so:
stemmer = GermanStemmer()
stemmer._GermanStemmer__step1_suffixes = ("innen", "in") + stemmer._GermanStemmer__step1_suffixes
This way all of the above words are stemmed to lehr
and arzt
respectively. Also, all other "job-forms" that I tried so far are stemmed correctly, meaning masculine and feminine forms have the same stem. Also, if the "job-form" is derived from a verb, like Lehrer/in
, they have the same stem as the verb.
The German Snowball stemmer follows a three step process:
ern
,em
,er
,en
,es
,e
,s
suffixesest
,en
,er
,st
suffixesisch
,lich
,heit
,keit
,end
,ung
,ig
,ik
suffixesNot knowing a lot about German grammar, it seems that
in
would belong to the same class as the step 3 suffixes (these are referred to as "derivational suffixes" in the NLTK source). It would seem that addingin
to this list of suffixes should force the Snowball stemmer to remove it but there are two problems.The first problem is that from your examples I see that
in
becomesinn
when followed byen
. This could be worked around by adding bothin
andinn
to the list of step 3 suffixes, but that doesn't solve the second problem.Looking at the
GermanStemmer.stem()
source, each step will only remove a single suffix. Thus, if there is more than one derivational suffix (i.e.in
plus any of the suffixes listed above], only the one will be removed.In such cases (and I don't know enough about German to know if this can actually happen), you'd need to manually edit
GermanStemmer.stem()
to add a fourth "in
removal" step. This would also allow finer control in the case of plurals. But honestly, at that point it's probably better to just ad hoc removein
by wrapping yourGermanStemmer.stem()
call. For example:--Edit--
If you wanted to add
in
to one of the Snowball Stemmer steps, you can do so using:Note the comma after
"in"
. This code will not work without it. You can also replace the3
with whichever step you wish to modify. I'm not entirely sure why it's_GermanStemmer__step3_suffixes
and not just__step3_suffixes
but I've verified that this code works on Python 3.6.4 and NLTK 3.2.5.I would not recommend this approach, though, as it will not properly deal with
innen
. Also, since each step removes a maximum of one suffix, it will not properly deal with words likeLehrerinnen
which haveen
,in
, ander
(step 3 doesn't check forer
). I think your best bet is to just copy and paste the entirety ofGermanStemmer
(found in the source code link above. Usectrl+f
) and add a step 2.5 tostem()
that checks for and removesin/inn
.