How to split Arabic Text properly with the Pause Mark programmatically

Question

How to split Arabic Text properly with the Pause Mark programmatically

164 views Asked by Tarikul Islam Tuhin At 02 November 2023 at 14:27

I tried to split a Arabic string based on if it finds space between words. However, it split the pause mark separately even though there is no space with a word with the pause mark. For example, "ذَٰلِكَ ٱلْكِتَـٰبُ لَا رَيْبَ ۛ فِيهِ ۛ هُدًى لِّلْمُتَّقِينَ", here two pause mark are given in this text. One with the text فِيهِ with no space, and other the رَيْبَ.

Code: ("ذَٰلِكَ ٱلْكِتَـٰبُ لَا رَيْبَ ۛ فِيهِ ۛ هُدًى لِّلْمُتَّقِينَ").split(" ")

However, when I tried to split, it gave me a result like this. [ذَٰلِكَ, الْكِتَابُ, لَا, رَيْبَ, ۛ, فِيهِ, ۛ, هُدًى, لِّلْمُتَّقِينَ]

I am expecting to have a result like this.

[ذَٰلِكَ, الْكِتَابُ, لَا, رَيْبَ ۛ, فِيهِ ۛ, هُدًى, لِّلْمُتَّقِينَ]

Original Q&A

There are 1 answers

**Andj** · Answer 1 · 2023-11-04T17:51:21+00:00

The annotation is a non-spacing mark using a space character as the base. The best way to handle this is to use a regex pattern that optionally includes the space and the pause mark in question.

I will use the regex module rather than the re, since it has better Unicode support. I will use the pattern r'([\p{Block=Arabic}]+(?:\u0020\u06DB)?)'. This looks for a string of characters from the Arabic block, optionally ending in \u0020\u06DB.

And then use regex.findall() to get all the matches.

It is possible that you may need to expand the blocks to include additional blocks for Quaranic text in the [\p{Block=Arabic}]+ of the pattern.

Additionally the second component of the pattern could be expanded to include additional pause marks.

import regex
text = 'ذَٰلِكَ ٱلْكِتَـٰبُ لَا رَيْبَ ۛ فِيهِ ۛ هُدًى لِّلْمُتَّقِينَ'
pattern = r'([\p{Block=Arabic}]+(?:\u0020\u06DB)?)'
results = regex.findall(pattern, text)
print(results)
# ['ذَٰلِكَ', 'ٱلْكِتَـٰبُ', 'لَا', 'رَيْبَ ۛ', 'فِيهِ ۛ', 'هُدًى', 'لِّلْمُتَّقِينَ']

TechQA.

How to split Arabic Text properly with the Pause Mark programmatically

There are 1 answers

Related Questions in PYTHON

Related Questions in JAVA

Related Questions in FLUTTER

Related Questions in DART

Related Questions in ARABIC-SUPPORT

Popular Questions

Popular Tags

Trending Questions