I'm looking for a datasets with all the Chinese character Mandarin pronunciations in bopomofo and/or pinyin. Also, I need open source datasets that I can copy into my own code bases.
Where can I find Chinese character bopomofo/pinyin data?
2.2k views Asked by Nathan Breit At
2
There are 2 answers
0
On
this is a bit of a late entry but I was searching for the same thing last year and ended up compiling my own character/bopomofo database based on a bunch of different data sets. I have put enough work into this thing to thoroughly call it my own though so you should check it out! its part of a rubygem I made to sort by bopomofo (I had a system that would not let me change the database colaltion settings) https://github.com/nallan/a-b-chi
It sounds like you might be looking for the Unihan Database. The Unihan Database is maintained by the Unicode Consortium.
For an example, here is the data for 爱.
Here is the description of the organization and content of the Unihan Database. Be sure to read that to understand what the data is referring to.
If this is the information you want, you can download the ZIP archive that contains all this data.
The Unihan Database doesn't have Bopomofo (Zhuyin) pronunciations, but it has Pinyin readings. Converting from Pinyin to Zhuyin is simple; there are a lot of online tools that can do it for you.
As for licensing issues, the Unihan Database data files have a liberal copyright notice. So, you shouldn't run into any problems using that data in your own software.