In German, every job has a feminine and a masculine version. The feminine one is derived from the masculine one by adding an "-in" suffix. In the plural form, this turns into "-innen".
Example:
| English | German
------+------------------+-----------------------
masc. | teacher doctor | Lehrer Arzt
fem. | teacher doctor | Lehrerin Ärztin
masc. | teachers doctors | Lehrer Ärzte
fem. | teachers doctors | Lehrerinnen Ärztinnen
Currently, I'm using NLTK's nltk.stem.snowball.GermanStemmer.
It returns these stems:
Lehrer -> lehr | Arzt -> arzt
Lehrerin -> lehrerin | Ärztin -> arztin
Lehrer -> lehr | Ärzte -> arzt
Lehrerinnen -> lehrerinn | Ärztinnen -> arztinn
Is there a way to make this stemmer return the same stems for all four versions, feminine and masculine ones? Alternatively, is there any other stemmer doing that?
Update
I ended up adding "-innen" and "-in" as the first entries in the step 1 suffix-tuple like so:
stemmer = GermanStemmer()
stemmer._GermanStemmer__step1_suffixes = ("innen", "in") + stemmer._GermanStemmer__step1_suffixes
This way all of the above words are stemmed to lehr and arzt respectively. Also, all other "job-forms" that I tried so far are stemmed correctly, meaning masculine and feminine forms have the same stem. Also, if the "job-form" is derived from a verb, like Lehrer/in, they have the same stem as the verb.
The German Snowball stemmer follows a three step process:
ern,em,er,en,es,e,ssuffixesest,en,er,stsuffixesisch,lich,heit,keit,end,ung,ig,iksuffixesNot knowing a lot about German grammar, it seems that
inwould belong to the same class as the step 3 suffixes (these are referred to as "derivational suffixes" in the NLTK source). It would seem that addinginto this list of suffixes should force the Snowball stemmer to remove it but there are two problems.The first problem is that from your examples I see that
inbecomesinnwhen followed byen. This could be worked around by adding bothinandinnto the list of step 3 suffixes, but that doesn't solve the second problem.Looking at the
GermanStemmer.stem()source, each step will only remove a single suffix. Thus, if there is more than one derivational suffix (i.e.inplus any of the suffixes listed above], only the one will be removed.In such cases (and I don't know enough about German to know if this can actually happen), you'd need to manually edit
GermanStemmer.stem()to add a fourth "inremoval" step. This would also allow finer control in the case of plurals. But honestly, at that point it's probably better to just ad hoc removeinby wrapping yourGermanStemmer.stem()call. For example:--Edit--
If you wanted to add
into one of the Snowball Stemmer steps, you can do so using:Note the comma after
"in". This code will not work without it. You can also replace the3with whichever step you wish to modify. I'm not entirely sure why it's_GermanStemmer__step3_suffixesand not just__step3_suffixesbut I've verified that this code works on Python 3.6.4 and NLTK 3.2.5.I would not recommend this approach, though, as it will not properly deal with
innen. Also, since each step removes a maximum of one suffix, it will not properly deal with words likeLehrerinnenwhich haveen,in, ander(step 3 doesn't check forer). I think your best bet is to just copy and paste the entirety ofGermanStemmer(found in the source code link above. Usectrl+f) and add a step 2.5 tostem()that checks for and removesin/inn.