I am using nltk.word_tokenize
in Dari language. The problem is that we have space between one word.
For example the word "زنده گی"
which means life. And the same; we have many other words. All words which end with the character "ه"
we have to give a space for it, otherwise, it can be combined such as "زندهگی"
.
Can anyone help me using [tag:regex]
or any other way that should not tokenize the words that a part of one word ends with "ه"
and after that, there will be the "گ "
character.
To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیمفاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
As I know Dari is very similar to Persian. So first of all you should correct all the words like
زنده گی
toزندهگی
and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see
Match 5
you will see that is correct)For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write
زندهگی
asزندگی
and it can not correct this word for you. But the other words likeمی شود
would easily corrects and converts toمیشود
. Also you can add custom words to the database.