[Nickle] Nickle does not handle Unicode grapheme clusters containing a combining character properly

Bart Massey bart at cs.pdx.edu
Sun Oct 27 01:18:58 PDT 2013


Annex 29 of the Unicode specification gives rules for deciding where
to break grapheme clusters (characters). A combining-character
combination such as "n\u0303" is supposed to be treated as a single
grapheme cluster. However, Nickle treats it as a pair of characters.
This causes strlen() to yield a wrong count and string indexing to
yield a partial or wrong grapheme cluster. It is not obvious what
should happen in the latter case, since Nickle currently automatically
converts from UTF-8 to UCS-32 upon referencing an index in a string:
returning an appended two-code-point UCS-32 value would be confusing
if not outright wrong. --Bart


More information about the Nickle mailing list