I am currently running a honeypot to catch forum spammers, and I have been having problems with storing non Latin characters in my database, I have utf8_unicode_ci set on database and table level and I use mysql_query("SET NAMES 'utf8'") to make sure the information is sent as utf8.
Information such as time is stored as int. IP, username and such is stored as Varchar and text, the only difference with the spam data is that I use base64_encode(htmlspecialchars()) before I insert the data, and that the spam column is stored in medium blob and I use COMPRESS() in the query for that column.
With Latin characters it returns the correct data, but with non-Latin characters such as Russian and Thai it does not return the correct data.
For example:
Уровня конечного начальники или не
Will return as:
Ð£Ñ€Ð¾Ð²Ð½Ñ ÐºÐ¾Ð½ÐµÑ‡Ð½Ð¾Ð³Ð¾ начальнÐ
or just diamonds with question marks in them.
I managed to store this information correctly years ago when I created a forum but I can not remember how I managed to get it to store correctly, I have been searching all day and have not been able to find a solution that worked for me.
Edit: Extra info if its any help.
- Apache/2.2.14 (Ubuntu)
- MySQL client version: 5.1.41
- PHP extension: php5-mysql
Turns out that the page that sends spam submissions from my domains to the main hub didn't have
header("Content-Type: text/html; charset=utf-8");
So when a query was made to the page it was getting corrupted there.