How can I convert all Japanese hiragana to katakana characters in Python?

6k views Asked by At

From hiragana and katakana charts, it looks like it should be possible to "normalize" japanese text into hiragana or katakana. It's pretty straight-forward to build a table and implement a dictionary/regex table for search/replace. Does anyone know where the work's already been done?

3

There are 3 answers

2
diverscuba23 On BEST ANSWER

Why would you want to do this though? Katakana is traditionally used for words borrowed from other languages, while hiragana is used for the Japanese native language. By normalizing the japanese text to one form or another you could actually be hindering the reading of it (at least to me it would be harder since I am loosing context by having it normalized).

But in answer to your question, this seems to do what your asking: JCONV

3
John Machin On

You could do what you want to do very quickly using str.translate.

However it is not readily apparent why you would want to do that.

What I would call normalising in a language written in a Latin-based alphabet would include lowercasing, normalising whitespace, and stripping accents etc so that the result was ASCII. The purpose of doing that would be not for display but for comparing user-entered text in some kind of fuzzy search/match/lookup scenario. The point being that mistakes of accent etc are quite common even with native writers of the languages in question.

Given the role that Hiragana plays in the Japanese writing system (words often have a Kanji stem and Hiragana suffixes) I can't imagine any use for changing Hiragana characters to Katakana ... please enlighten me.

0
SmoothKen On

Here is a way without loading an extra package. The range of commonly used hiragana in UTF-8 is from 3041 to 3096. The range for katakana is from 30a1 to 30f6.

Hence do the following:

hira_start = int("3041", 16)
hira_end = int("3096", 16)
kata_start = int("30a1", 16)

hira_to_kata = dict()
for i in range(hira_start, hira_end+1):
    hira_to_kata[chr(i)] = chr(i-hira_start+kata_start)
    print(chr(i), chr(i-hira_start+kata_start))