Is there known URI scheme or URN namespace for Unicode characters?

346 views Asked by At

I need to reference to a Unicode character with a URI. Following IANA references list multiple schemes and namespaces but do not mention anything about identifiers for the Unicode characters. Does anyone know if something like this exists already?

I hoped to find something like

  • unicode://U+0394
  • urn:unicode://0394
  • http://unicode.org/unicode/0394

for the greek capital letter delta Δ.

If someone wonders, this is for a semantic web like application that uses URIs as identifiers for concepts, including concepts of the Unicode characters.

3

There are 3 answers

0
Jukka K. Korpela On BEST ANSWER

I’m afraid there is no URL or URN for referring authoritative information on a Unicode character in general. In the Unicode Standard, information about individual characters is partly in the so-called character database (mostly plain text files in specific formats), partly in the Code Charts (PDF files). Neither of them offers a way to point at an individual character. Moreover, the information there is not exhaustive: there are important remarks on individual characters information scattered around the standard.

The Decodeunicode site has individually addressable items, such as

http://www.decodeunicode.org/en/u+0394

but its information content varies a lot and is generally very limited. It is not official, and it currently contains Unicode 5.0 only.

The Fileformat.info site is much more systematic, but it, too, is unofficial. It is basically limited to formal properties and data derivable from them, plus comments extracted from the Code Charts, plus instructions on typing the character in Windows, plus information about support in fonts—but that’s quite a lot! Example:

http://www.fileformat.info/info/unicode/char/0394/

0
IS4 On

Since this is also tagged , I will try to pick URIs that are easily (and permanently) dereferenceable and cannot be mistaken for a document describing that character: the data: scheme. Not only can that refer to a character in Unicode, but any encoding, and also any string thereof.

data:;charset=utf-8,%CE%94

Attempting to open this URI should result in a text/plain file with the single character as its content.

If the system accepts IRIs (as many semantic web applications do), the character can be included directly:

data:;charset=utf-8,Δ

This is mapped to the same URI as shown above, and your browser may convert it directly. Specifying UTF-8 is necessary in this case, since the mapping is not defined for other encodings.

0
global uuid database On

[ EDIT ] : found this URL matching your needs : http://unicode.org/cldr/utility/character.jsp?a=1F40F

.

Well, there is an URL referencing the authoritative information on the Unicode database, even though it does not describe (as said in the other answer) all the information on one specific character.

You have the following URL, pointing to the latest Unicode database. This is a simple list of existing valid Unicode characters. Some upcoming characters are missing (㋿), and you should expect it to be mutable.

The contents looks like the following, which isn't so practical to use as-is.

$ grep -ai kangaroo UnicodeData.txt -C 7
1F991;SQUID;So;0;ON;;;;;N;;;;;
1F992;GIRAFFE FACE;So;0;ON;;;;;N;;;;;
1F993;ZEBRA FACE;So;0;ON;;;;;N;;;;;
1F994;HEDGEHOG;So;0;ON;;;;;N;;;;;
1F995;SAUROPOD;So;0;ON;;;;;N;;;;;
1F996;T-REX;So;0;ON;;;;;N;;;;;
1F997;CRICKET;So;0;ON;;;;;N;;;;;
1F998;KANGAROO;So;0;ON;;;;;N;;;;;
1F999;LLAMA;So;0;ON;;;;;N;;;;;
1F99A;PEACOCK;So;0;ON;;;;;N;;;;;
1F99B;HIPPOPOTAMUS;So;0;ON;;;;;N;;;;;
1F99C;PARROT;So;0;ON;;;;;N;;;;;
1F99D;RACCOON;So;0;ON;;;;;N;;;;;
1F99E;LOBSTER;So;0;ON;;;;;N;;;;;
1F99F;MOSQUITO;So;0;ON;;;;;N;;;;;

You could build up a hacky « hash-based » namespace with a suffix like this, but that's definitely non-standard.