Saving an HTML Blob file produces weird text inside

1.1k views Asked by At

So i have a comma delimited file that im saving to a blob. Im using the latest Chrome based Edge browser. This particular code (typescript) that I have has not changed for many months now. But suddenly, i noticed that if i save the file with a particular datetime string in it, then i get a weird output for that. Basically, i see the weird text instead of the datetime string.

Here is the datetime string im saving (and fully expect to see in the saved file):

‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM

Here is the weird text that appears instead:

‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM

Now judging by the fact that i couldn't simply copy & paste this weird string into this edit window (it thinks im trying to paste an image), im guessing it is binary. Which is probably a huge hint, but it's not ringing any bells for me.

So question is: why is this binary when im certain im writing out a string?

After some digging around I was able to determine that there seems to be an encoding issue. Still not sure why. In addition, upon closer inspection of the weird string, the date is actually in there. It just looks strange because each component is padded with this weird string "‎".

1

There are 1 answers

0
Kaiido On BEST ANSWER

Your string is full of Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E).

const text = `‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM`;
console.log( text.replace( /\u200e/g, "[LTR]" ) );

Somehow, you are reading your file as Windows-1252 (you don't say how you are reading it, so it's hard to tell you what you did wrong, but note it is the default encoding when opening a text file directly in most browsers), and when the reader finds the UTF-8 0xe2 0x80 0x8e sequence, it doesn't map well in Windows-1252 (unlike the other ASCII characters) and this character gets read as ‎:

const text = "\u200e9\u200e/\u200e26\u200e/\u200e2020\u200e \u200e7\u200e:\u200e00\u200e:\u200e00\u200e \u200eAM";
const blob = new Blob( [ text ] ); // here 'text' is encoded as UTF-8
const reader = new FileReader();
reader.onload = (evt) => {
  console.log( reader.result );
  const OPs_result = "‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM";
  console.log( "is same as OP's result?", OPs_result === reader.result );
};
reader.readAsText( blob, "Windows-1252" );

However, reading this same file as UTF-8 would render these characters correctly:

const text = "\u200e9\u200e/\u200e26\u200e/\u200e2020\u200e \u200e7\u200e:\u200e00\u200e:\u200e00\u200e \u200eAM";
const blob = new Blob( [ text ] ); // here 'text' is encoded as UTF-8
blob.text() // reads as UTF-8
  .then( console.log );

And if you want to help your browser to open this text file as UTF-8 instead of the default Windows-1252, you can prepend a BOM to this file, as demonstrated in this answer:

const text = "\u200e9\u200e/\u200e26\u200e/\u200e2020\u200e \u200e7\u200e:\u200e00\u200e:\u200e00\u200e \u200eAM";
const without_BOM = new Blob( [ text ] );
const BOM = new Uint8Array([0xEF,0xBB,0xBF]);
const with_BOM = new Blob( [ BOM, text ] );

document.getElementById( "without_BOM" ).href = URL.createObjectURL( without_BOM );
document.getElementById( "with_BOM" ).href = URL.createObjectURL( with_BOM );
<a id="without_BOM">Open the file without BOM</a><br>
<a id="with_BOM">Open the file with BOM</a>

And if you wish to encode your csv files as Windows-1252, then you can check this answer.