How to convert from Java ASCII properties to UTF8 (Java 9) properties

34 views Asked by At

I have an ASCII encoded Java properties file with unicode escapes (\u0123) in them that I need to convert to the new Java 9 UTF-8 format. So control character escapes (\r, \n, ...) need to stay but e.g. \u00E4 should become ä (UTF-8 encoded).

The motivation to convert them to UTF-8 is that it simplifies the workflow with translators.

I've tried multiple options with iconv and uconv (from ICU) but was unable to get a good result. Asking chat GPT also didn't yield a fully working solution.

This is not about troubles with UTF-8 Java properties files in editors or how to get UTF-8 properties files to work in Java pre-9.

1

There are 1 answers

1
Florian On
cat messages.properties | awk -v RS='\\\\u[0-9a-fA-F]{4}' '{ORS=""; print $0; printf "%c",strtonum("0x"substr(RT,3)) } END {print ""}'

RS='\\\u[0-9a-fA-F]{4}' sets the record separator to a regex which matches unicode escapes

The {...} END {print ""} block is then run for every record (string without unicode escape plus one unicode escape).

ORS="" makes sure the record separator is not replaced by anything when printing $0

print $0 prints the string before the record separator

printf "%c",strtonum("0x"substr(RT,3)) First extracts the hex value from \u0123 (the substr part), then prepends '0x' so strtonum interprets it as hex, converts this to a number (strtonum) and prints it as a character code (printf "%c").

END {print ""} is a no-op but needed to close the previous {} block