Issues going from UTF-8 to Windows Latin1 and back

72 views Asked by At

I am working within a rather complicated system where data is entered by a respondent and stored in an xml file that uses UTF-8 encoding. That data is then uploaded to an Oracle database that uses latin1 encoding which is causing strange characters to show up such as upside down question marks for a small percentage of our data (~0.01%) that type non-Latin characters. I then take this data and build NLP models using fasttext (which expects UTF-8) data.

Obviously there are an abundance of issues here. My first question is what are the ramifications of feeding fasttext a file in latin1 when it is expecting UTF-8? For example, I might have a response that shows up on my database as:

HE'S

If I read this file in R but tell it the encoding is UTF-8 it then shows up as

HE<U+0092>S

Is fasttext essentially doing the same thing? Will fasttext read it as one word with 11 characters?

I am trying to convince my database administrators it is worthwhile switching our Oracle database to UTF-8 but they just see only 0.01% of characters showing up as upside down question marks and think it isn't worth the risk. I guess my question is, its more far reaching than just upside down question marks in this situation right?

A second question is, we no longer have the original xml files in UTF-8 that the database was populated using, is it possible to revert back from the latin1 database we have to UTF-8 without them?

0

There are 0 answers