I have a text file that was edited with the wrong character encoding and thus has some mojibake and corrupt characters in some of the strings when I open it using UTF-8. What scripting language would be the most efficient at detecting these corrupt characters? Perl is not an option. I am basically trying to find a way to scan through a text file using a script and output the line numbers and possibly offset where a corrupted character is found. How do I go about this? I was thinking about using AWk, but I don't know what regular expression to use in searching for the corrupted characters. If I could be pointed in the right direction, that would be great.
More Comprehensive Input:
I want the script to tell me the line number that has the corrupted characters which would be the fifth line in the above example. Also, there are different languages in the text file. I have English Chinese, French, Spanish, Russian, Portuguese, Turkish, French_Euro, German, Dutch, Flemish, Korean, Portuguese_Moz. And I have a few special characters also like # and ! and ***
I used this if statement to get the above output:
if($1 ~ /[^\x00-\x7F]/){
print NR ":" , $0 > "output.txt";
count++;
}
This finds all chars outside of the ASCII range:
Is that good enough? If not please edit your question to show more comprehensive sample input including cases for which the above does not work.