As is common to most Unicode-related tutorials I have seen on the net (and including documents such as GNU’s Libc manual), there are a number of very important gotchas that programmers are just going to have to know about; the best place to read up on these issues is by browsing the Unicode Technical Documents and Unicode Standard Annexes; the first that I would suggest is the Unicode Standard Annex regarding Normalisation as this has a major impact on correct code.
Briefly, there are four normal forms, but you only need to know about two of them: Normalisation Form C (NFC) and Normalisation Form KC (NFKC). Put very simply, Unicode text sent out (eg. to the network) SHOULD be in NFC. Any string comparisons — such as for things like filenames, usernames or any string you wish to sort on — need to be normalised into NFKC prior to comparison. NFKC remaps the string so any “compatibility characters” are canonicalised, meaning their memory order is consistent and so can be compared. Here is an example:
>>> "Richard IV" == "Richard \u2163" False
The strings "Richard IV" and "Richard Ⅳ" are considered identical for the purposes of human consideration. The first string is composed of 'I' + 'V', but the second string is composed of a single code-point U+2163 ROMAN NUMERAL FOUR. This is a ‘compatibility character’. It generally shouldn’t be used but some systems may automatically have mapped ‘I’ + ‘V’ into U+2163 (or it may have been transcoded from another character set). The NFKC normalisation process essentially changes occurrences of U+2163 to ‘I’ + ‘V’. Other examples in European scripts typically come from ligatures such as U+0133 LATIN SMALL LIGATURE IJ (ij), which under NFKC would be re-composed to ‘i’ + ‘j’; there are plenty more examples in the Normalisation document.
Another major issue to do with normalisation is that normalisation is not closed under concatenation, which means the string formed by NFKC(string1) + NFKC(string2) is not guaranteed to be NFKC normalised, although an optimised function such as NFKC_concat(string1, string2) can be defined.
Another very important document to read is the Security Considerations document, which very broadly speaking covers two themes: visual security, as illustrated by paypa<capital-i>.com, and technical security issues. This document also gives user-agents a number of security related recommendations.
I think I shall eventually put together a reading list for people wanting to correctly get into Unicode — I’m just getting into it myself and been exploring the various normative documents — but that may be a little time in coming.
People with an interest in network protocols should also arm themselves with a knowledge of standards such as stringprep (RFC 3435) which allows us to pick and choose rules (creating what is known as a stringprep ‘profile’) to limit how particular strings, such as usernames, may be represented.
In conclusion, the programming world is in for a nasty shake-up; there is a definite requirement for CORRECT training resources to be available to the programming world at large.
Great, but you left out a major gotcha
Briefly, there are four normal forms, but you only need to know about two of them: Normalisation Form C (NFC) and Normalisation Form KC (NFKC). Put very simply, Unicode text sent out (eg. to the network) SHOULD be in NFC. Any string comparisons — such as for things like filenames, usernames or any string you wish to sort on — need to be normalised into NFKC prior to comparison. NFKC remaps the string so any “compatibility characters” are canonicalised, meaning their memory order is consistent and so can be compared. Here is an example:
False
The strings "Richard IV" and "Richard Ⅳ" are considered identical for the purposes of human consideration. The first string is composed of 'I' + 'V', but the second string is composed of a single code-point U+2163 ROMAN NUMERAL FOUR. This is a ‘compatibility character’. It generally shouldn’t be used but some systems may automatically have mapped ‘I’ + ‘V’ into U+2163 (or it may have been transcoded from another character set). The NFKC normalisation process essentially changes occurrences of U+2163 to ‘I’ + ‘V’. Other examples in European scripts typically come from ligatures such as U+0133 LATIN SMALL LIGATURE IJ (ij), which under NFKC would be re-composed to ‘i’ + ‘j’; there are plenty more examples in the Normalisation document.
Another major issue to do with normalisation is that normalisation is not closed under concatenation, which means the string formed by NFKC(string1) + NFKC(string2) is not guaranteed to be NFKC normalised, although an optimised function such as NFKC_concat(string1, string2) can be defined.
Another very important document to read is the Security Considerations document, which very broadly speaking covers two themes: visual security, as illustrated by paypa<capital-i>.com, and technical security issues. This document also gives user-agents a number of security related recommendations.
I think I shall eventually put together a reading list for people wanting to correctly get into Unicode — I’m just getting into it myself and been exploring the various normative documents — but that may be a little time in coming.
People with an interest in network protocols should also arm themselves with a knowledge of standards such as stringprep (RFC 3435) which allows us to pick and choose rules (creating what is known as a stringprep ‘profile’) to limit how particular strings, such as usernames, may be represented.
In conclusion, the programming world is in for a nasty shake-up; there is a definite requirement for CORRECT training resources to be available to the programming world at large.