Where can I find Chinese character bopomofo/pinyin data?

2.2k views Asked by At

I'm looking for a datasets with all the Chinese character Mandarin pronunciations in bopomofo and/or pinyin. Also, I need open source datasets that I can copy into my own code bases.

2

There are 2 answers

1
tsroten On BEST ANSWER

It sounds like you might be looking for the Unihan Database. The Unihan Database is maintained by the Unicode Consortium.

The Unihan database is the repository for the Unicode Consortium’s collective knowledge regarding the CJK Unified Ideographs contained in the Unicode Standard. It contains mapping data to allow conversion to and from other coded character sets and additional information to help implement support for the various languages which use the Han ideographic script.

For an example, here is the data for 爱.

Here is the description of the organization and content of the Unihan Database. Be sure to read that to understand what the data is referring to.

If this is the information you want, you can download the ZIP archive that contains all this data.

The Unihan Database doesn't have Bopomofo (Zhuyin) pronunciations, but it has Pinyin readings. Converting from Pinyin to Zhuyin is simple; there are a lot of online tools that can do it for you.

As for licensing issues, the Unihan Database data files have a liberal copyright notice. So, you shouldn't run into any problems using that data in your own software.

0
NallaN On

this is a bit of a late entry but I was searching for the same thing last year and ended up compiling my own character/bopomofo database based on a bunch of different data sets. I have put enough work into this thing to thoroughly call it my own though so you should check it out! its part of a rubygem I made to sort by bopomofo (I had a system that would not let me change the database colaltion settings) https://github.com/nallan/a-b-chi