How to split Arabic Text properly with the Pause Mark programmatically

156 views Asked by At

I tried to split a Arabic string based on if it finds space between words. However, it split the pause mark separately even though there is no space with a word with the pause mark. For example, "ذَٰلِكَ ٱلْكِتَـٰبُ لَا رَيْبَ ۛ فِيهِ ۛ هُدًى لِّلْمُتَّقِينَ", here two pause mark are given in this text. One with the text فِيهِ with no space, and other the رَيْبَ.

Code: ("ذَٰلِكَ ٱلْكِتَـٰبُ لَا رَيْبَ ۛ فِيهِ ۛ هُدًى لِّلْمُتَّقِينَ").split(" ")

However, when I tried to split, it gave me a result like this. [ذَٰلِكَ, الْكِتَابُ, لَا, رَيْبَ, ۛ, فِيهِ, ۛ, هُدًى, لِّلْمُتَّقِينَ]

I am expecting to have a result like this.

[ذَٰلِكَ, الْكِتَابُ, لَا, رَيْبَ ۛ, فِيهِ ۛ, هُدًى, لِّلْمُتَّقِينَ]

1

There are 1 answers

0
Andj On

The annotation is a non-spacing mark using a space character as the base. The best way to handle this is to use a regex pattern that optionally includes the space and the pause mark in question.

I will use the regex module rather than the re, since it has better Unicode support. I will use the pattern r'([\p{Block=Arabic}]+(?:\u0020\u06DB)?)'. This looks for a string of characters from the Arabic block, optionally ending in \u0020\u06DB.

And then use regex.findall() to get all the matches.

It is possible that you may need to expand the blocks to include additional blocks for Quaranic text in the [\p{Block=Arabic}]+ of the pattern.

Additionally the second component of the pattern could be expanded to include additional pause marks.

import regex
text = 'ذَٰلِكَ ٱلْكِتَـٰبُ لَا رَيْبَ ۛ فِيهِ ۛ هُدًى لِّلْمُتَّقِينَ'
pattern = r'([\p{Block=Arabic}]+(?:\u0020\u06DB)?)'
results = regex.findall(pattern, text)
print(results)
# ['ذَٰلِكَ', 'ٱلْكِتَـٰبُ', 'لَا', 'رَيْبَ ۛ', 'فِيهِ ۛ', 'هُدًى', 'لِّلْمُتَّقِينَ']