I have a very large file with intermingled English and foreign language material (mostly Arabic script and some other languages). I'm not concerned with detecting or identifying the language. I'm new to Python and struggling to write this code. I want my code to read the .txt file, ignore anything in English/Latin/ascii-encodable characters, and print just the foreign language material to a new .txt file.
So far, I have:
sourceDoc = open("test_doc.txt", 'r')
destDoc = open("test_doc_dest.txt", 'w')
for line in sourceDoc:
try:
line.isascii()
except:
destDoc.write(line)
sourceDoc.close()
destDoc.close()
One thing I know is probably wrong with my code is that I'm asking it to find foreign language material and write the whole line to the new file. I really just want to write the foreign language material to the new file and discard the rest of the line if it's in English/Latin characters.
I keep getting a traceback for line 3 in the code and for line 23 in the source document, which is where the first instance of foreign language material appears.
I'd be inclined to use a regex pattern matching non-Latin characters (this would also included characters with a script value of
CommonandInherited, i.e. keep punctuation and whitespace.To use Unicode POSIX or Perl style notation, I need to use the regex module rather than re module.
Install module:
pip install -U regex.Both re and regex have a method
re.fullmatchorregex.fullmatchwhich only returns a match if the whole string matches the pattern.The code below can be written different ways, the key points are 1) using regex module instead or re; and 2) match non-Latin text using
\P{Latin}.Both
re.fullmatch(r'\P{Latin}+', line)orre.match(r'^\P{Latin}+$', line)would be appropriate. Everything else is fluid, and code as it suits you.If you want to only exclude pure ASCII lines (on the potentially mistaken assumption that only English is ASCII only) and keep everything else, search and match any lines containing non-ASCII characters: