Get Python characters from Asian text

66 views Asked by At

Hello everyone I have a problem I have this word "बन्दूक" that in the counting notepad there are 3 characters but with the following code "_charaters = list(line)" there are 6 characters. How could I get only the 3 characters?

Example:

  • न्दू
2

There are 2 answers

1
JosefZ On

Maybe you are looking for pyuegc module:

An implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into extended grapheme clusters (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation”

Example (partially commented, string "बन्दूक" hard-coded):

from pyuegc import EGC

def _output(unistr, egc):
    return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""


unistr = "बन्दूक"
egcs = EGC(unistr)
print(_output(unistr, egcs))

# above code basically copied from https://pypi.org/project/pyuegc/
# below code for deeper insight into the GEC results

import json
print( '\n' + json.dumps(unistr))
print( json.dumps(egcs) + '\n')

import unicodedata
for egc in egcs:
    print(f'\nEGC  {egc}  {json.dumps(egc)}')
    for uch in egc:
        print( f'char {uch}  {json.dumps(uch)}  {unicodedata.name(uch, "???")}')

Result: .\SO\78102711.py

# String: बन्दूक
# Length of string: 6
# EGC: ['ब', 'न्दू', 'क']
# Length of EGC: 3


"\u092c\u0928\u094d\u0926\u0942\u0915"
["\u092c", "\u0928\u094d\u0926\u0942", "\u0915"]


EGC  ब  "\u092c"
char ब  "\u092c"  DEVANAGARI LETTER BA

EGC  न्दू  "\u0928\u094d\u0926\u0942"
char न  "\u0928"  DEVANAGARI LETTER NA
char ्  "\u094d"  DEVANAGARI SIGN VIRAMA
char द  "\u0926"  DEVANAGARI LETTER DA
char ू  "\u0942"  DEVANAGARI VOWEL SIGN UU

EGC  क  "\u0915"
char क  "\u0915"  DEVANAGARI LETTER KA
0
Andj On

An alternative approach is to use pyicu for grapheme segmentation using a break iterator. ICU4C provides grapheme, word and sentence break iterators for a range of locales.

import icu 

def get_boundaries(loc, s):
    bi = icu.BreakIterator.createCharacterInstance(loc)
    bi.setText(s)
    boundaries = [*bi]
    boundaries.insert(0, 0)
    return boundaries

def get_graphemes(loc, text):
    boundary_indices = get_boundaries(loc, text)
    return [text[boundary_indices[i]:boundary_indices[i+1]] for i in range(len(boundary_indices)-1)]

print(get_graphemes(icu.Locale('hi'), "बन्दूक"))
# ['ब', 'न्दू', 'क']