Detecting corrupt characters in UTF-8 encoded text file

Question

Detecting corrupt characters in UTF-8 encoded text file

5.5k views Asked by user2056389 At 09 June 2015 at 17:30

I have a text file that was edited with the wrong character encoding and thus has some mojibake and corrupt characters in some of the strings when I open it using UTF-8. What scripting language would be the most efficient at detecting these corrupt characters? Perl is not an option. I am basically trying to find a way to scan through a text file using a script and output the line numbers and possibly offset where a corrupted character is found. How do I go about this? I was thinking about using AWk, but I don't know what regular expression to use in searching for the corrupted characters. If I could be pointed in the right direction, that would be great.

More Comprehensive Input:

I want the script to tell me the line number that has the corrupted characters which would be the fifth line in the above example. Also, there are different languages in the text file. I have English Chinese, French, Spanish, Russian, Portuguese, Turkish, French_Euro, German, Dutch, Flemish, Korean, Portuguese_Moz. And I have a few special characters also like # and ! and ***

I used this if statement to get the above output:

if($1 ~ /[^\x00-\x7F]/){
print NR ":" , $0 > "output.txt";
count++;
}

Original Q&A

There are 1 answers

**Ed Morton** · Accepted Answer · 2015-06-10T18:52:59+00:00

This finds all chars outside of the ASCII range:

$ awk '/[^\x00-\x7F]/{ print NR ":", $0 }' file
1: Interruptor EC nÃ£o estÃ¡ em DESLOCAR
4: è¾…åŠ©é©¾é©¶å®¤é—¨å…³é—
5: Porte cab. aux. fermÃ©e
7: Ð”Ð²ÐµÑ€ÑŒ Ð°Ð¿Ð¿Ð°Ñ€Ð°Ñ‚Ð½Ð¾Ð¹ ÐºÐ°Ð¼ÐµÑ€Ñ‹ Ð·Ð°ÐºÑ€Ñ‹Ñ‚Ð°
13: é«˜åŽ‹ä¿æŠ¤æ‰‹æŸ„å‘ä¸‹
14: BarriÃ¨re descendue
16: ÐžÐ³Ñ€Ð°Ð½Ð¸Ñ‡. ÐŸÐ»Ð°Ð½ÐºÐ° Ð’Ð’Ðš Ð¾Ð¿ÑƒÑ‰.
19: Barra de separaÃ§Ã£o descida
22: DPæœªå¯åŠ¨
23: Puiss. rÃ©p. non activÃ©e
25: !!! Ð’Ð½ÐµÑˆÐ½ÑÑ Ð¼Ð¾Ñ‰Ð½Ð¾ÑÑ‚ÑŒ Ð½Ðµ Ð²ÐºÐ»ÑŽÑ‡ÐµÐ½Ð°
26: PotÃªncia Dist NÃ£o Ativada
28: PotÃªncia dist nÃ£o activada
31: æœºè½¦æœªç§»åŠ¨
33: Motor no se estÃ¡ moviendo
34: Ð›Ð¾ÐºÐ¾Ð¼Ð¾Ñ‚Ð¸Ð² Ð½ÐµÐ¿Ð¾Ð´Ð²Ð¸Ð¶ÐµÐ½
35: Auto NÃ£o se Movendo
37: A nÃ£o se move
40: æœºè½¦çŠ¶å†µå…è®¸è‡ªåŠ¨åœæœº
41: Conditions auto\npermettent arrÃªt auto
43: Ð£ÑÑ‚Ð°Ð½Ð¾Ð²ÐºÐ¸ Ð»Ð¾ÐºÐ¾Ð¼Ð¾Ñ‚Ð¸Ð²Ð°\nÐŸÑ€ÐµÐ´ÑƒÑÐ¼Ð°Ñ‚Ñ€Ð¸Ð²Ð°ÑŽÑ‚ Ð     °Ð²Ñ‚Ð¾Ð¼Ð°Ñ‚Ð¸Ñ‡ÐµÑÐºÑƒÑŽ Ð¾ÑÑ‚Ð°Ð½Ð¾Ð²ÐºÑƒ
44: CondiÃ§Ãµes da moto\nPermitem Auto Parada

Is that good enough? If not please edit your question to show more comprehensive sample input including cases for which the above does not work.

TechQA.

Detecting corrupt characters in UTF-8 encoded text file

There are 1 answers

Related Questions in REGEX

Related Questions in ENCODING

Related Questions in AWK

Related Questions in UTF-8

Related Questions in SCRIPTING

Popular Questions

Popular Tags

Trending Questions