Trying to pull back ONLY foreign script in a large file using Python 3

Question

Trying to pull back ONLY foreign script in a large file using Python 3

51 views Asked by user23311916 At 28 January 2024 at 20:52

I have a very large file with intermingled English and foreign language material (mostly Arabic script and some other languages). I'm not concerned with detecting or identifying the language. I'm new to Python and struggling to write this code. I want my code to read the .txt file, ignore anything in English/Latin/ascii-encodable characters, and print just the foreign language material to a new .txt file.

So far, I have:

sourceDoc = open("test_doc.txt", 'r')
destDoc = open("test_doc_dest.txt", 'w')
for line in sourceDoc:
    try:
        line.isascii()
    except:
        destDoc.write(line)
sourceDoc.close()
destDoc.close()

One thing I know is probably wrong with my code is that I'm asking it to find foreign language material and write the whole line to the new file. I really just want to write the foreign language material to the new file and discard the rest of the line if it's in English/Latin characters.

I keep getting a traceback for line 3 in the code and for line 23 in the source document, which is where the first instance of foreign language material appears.

Original Q&A

There are 1 answers

**Andj** · Answer 1 · 2024-02-11T23:22:37+00:00

I'd be inclined to use a regex pattern matching non-Latin characters (this would also included characters with a script value of Common and Inherited, i.e. keep punctuation and whitespace.

To use Unicode POSIX or Perl style notation, I need to use the regex module rather than re module.

Install module: pip install -U regex.

Both re and regex have a method re.fullmatch or regex.fullmatch which only returns a match if the whole string matches the pattern.

The code below can be written different ways, the key points are 1) using regex module instead or re; and 2) match non-Latin text using \P{Latin}.

Both re.fullmatch(r'\P{Latin}+', line) or re.match(r'^\P{Latin}+$', line) would be appropriate. Everything else is fluid, and code as it suits you.

import regex as re
pattern = re.compile(r'\P{Latin}+')
with open("multi-lang.txt", "r", encoding="utf-8") as f:
    lines = (line.rstrip() for line in f)
    lines_subset = [line for line in lines if bool(re.fullmatch(pattern, line))]
with open("multi-lang-out.txt", "w", encoding="utf-8") as o:
    o.write('\n'.join(lines_subset))

If you want to only exclude pure ASCII lines (on the potentially mistaken assumption that only English is ASCII only) and keep everything else, search and match any lines containing non-ASCII characters:

p = r'[\P{Basic Latin}]'   # or r'[^[:ASCII:]]'
with open("multi-lang.txt", "r", encoding="utf-8") as f:
    lines = (line.rstrip() for line in f)
    lines_subset = [line for line in lines if bool(re.search(p, line))]
with open("multi-lang-out.txt", "w", encoding="utf-8") as o:
    o.write('\n'.join(lines_subset))

TechQA.

Trying to pull back ONLY foreign script in a large file using Python 3

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in ARABIC

Related Questions in FILE-WRITING

Popular Questions

Trending Questions