Detecting corrupt characters in UTF-8 encoded text file

5.5k views Asked by At

I have a text file that was edited with the wrong character encoding and thus has some mojibake and corrupt characters in some of the strings when I open it using UTF-8. What scripting language would be the most efficient at detecting these corrupt characters? Perl is not an option. I am basically trying to find a way to scan through a text file using a script and output the line numbers and possibly offset where a corrupted character is found. How do I go about this? I was thinking about using AWk, but I don't know what regular expression to use in searching for the corrupted characters. If I could be pointed in the right direction, that would be great.

More Comprehensive Input:

I want the script to tell me the line number that has the corrupted characters which would be the fifth line in the above example. Also, there are different languages in the text file. I have English Chinese, French, Spanish, Russian, Portuguese, Turkish, French_Euro, German, Dutch, Flemish, Korean, Portuguese_Moz. And I have a few special characters also like # and ! and ***

I used this if statement to get the above output:

if($1 ~ /[^\x00-\x7F]/){
print NR ":" , $0 > "output.txt";
count++;
}
1

There are 1 answers

4
Ed Morton On BEST ANSWER

This finds all chars outside of the ASCII range:

$ awk '/[^\x00-\x7F]/{ print NR ":", $0 }' file
1: Interruptor EC não está em DESLOCAR
4: 辅助驾驶室门关闭
5: Porte cab. aux. fermée
7: Дверь аппаратной камеры закрыта
13: 高压ä¿æŠ¤æ‰‹æŸ„å‘下
14: Barrière descendue
16: Огранич. Планка ВВК опущ.
19: Barra de separação descida
22: DP未å¯åŠ¨
23: Puiss. rép. non activée
25: !!! ВнешнÑÑ Ð¼Ð¾Ñ‰Ð½Ð¾ÑÑ‚ÑŒ не включена
26: Potência Dist Não Ativada
28: Potência dist não activada
31: 机车未移动
33: Motor no se está moviendo
34: Локомотив неподвижен
35: Auto Não se Movendo
37: A não se move
40: 机车状况å…许自动åœæœº
41: Conditions auto\npermettent arrêt auto
43: УÑтановки локомотива\nПредуÑматривают Ð     °Ð²Ñ‚оматичеÑкую оÑтановку
44: Condições da moto\nPermitem Auto Parada

Is that good enough? If not please edit your question to show more comprehensive sample input including cases for which the above does not work.