I collected a bunch of Tweets and output them to the command line, here is what I got:
The tweets are in different languages, so I suspect I also have arabic ones. Can control characters be responsible for this output? There are a few thousand lines, that somehow get contracted into one, and as far as I can tell, characters overlay each other.
What is going on?
Depending on the default text encoding and the locale of the system, your data will be interpreted when printed to a console.
I'd rather have a look at the hex data you receive i.e: 0x4142430d0a... instead of Unicode, UTF or whatever text encoding your system is using.
an introduction on different text encoding could be found even on http://en.wikipedia.org/wiki/Character_encoding