Internationalization support in Picolibc

There are two major internationalization APIs in the C library: locales and iconv. Iconv is an isolated component which only performs charset conversion in ways that don't interact with anything else in the library. Locales affect pretty much every API that deals with strings and covers charset conversion along with a huge range of localized information from character classification to formatting of time, money, people's names, addresses and even standard paper sizes.

Picolibc inherits it's implementation of both of these from newlib. Given that embedded applications rarely need advanced functionality from either these APIs, I hadn't spent much time exploring this space.

Newlib locale code

When run on Cygwin, Newlib's locale support is quite complete as it leverages the underlying Windows locale support. Without Windows support, everything aside from charset conversion and character classification data is stubbed out at the bottom of the stack. Because the implementation can support full locale functionality, the implementation is designed for that, with large data structures and lots of code.

Charset conversion and character classification data for locales is all built-in; none of that can be loaded at runtime. There is support for all of the ISO-8859 charsets, three JIS variants, a bunch of Windows code pages and a few other single-byte encodings.

One oddity in this code is that when using a JIS locale, wide characters are stored in EUC-JP rather than Unicode. Every other locale uses Unicode. This means APIs like wctype are implemented by mapping the JIS-encoded character to Unicode and then using the underlying Unicode character classification tables. One consequence of this is that there isn't any Unicode to JIS mapping provided as it isn't necessary.

When testing the charset conversion and Unicode character classification data, I found numerous minor errors and a couple of pretty significant ones. The JIS conversion code had the most serious issue I found; most of the conversions are in a 2d array which is manually indexed with the wrong value for the length of each row. This led to nearly every translated value being incorrect.

The charset conversion tables and Unicode classification data are now generated using python charset support and the standard Unicode data files. In addition, tests have been added which compare Picolibc to the system C library for every supported charset.

Newlib iconv code

The iconv charset support is completely separate from the locale charset support with a much wider range of supported targets. It also supports loading charset data from files at runtime, which reduces the size of application images.

Because the iconv and locale implementations are completely separate, the charset support isn't the same. Iconv supports a lot more charsets, but it doesn't support all of those available to locales. For example, Iconv has Big5 support which locale lacks. Conversely, locale has Shift-JIS support which iconv does not.

There's also a difference in how charset names are mapped in the two APIs. The locale code has a small fixed set of aliases, which doesn't include things like US-ASCII or ANSI X3.4. In contrast, the iconv code has an extensive database of charset aliases which are compiled into the library.

Picolibc has a few tests for the iconv API which verify charset names and perform some translations. Without an external reference, it's hard to know if the results are correct.

POSIX vs C internationalization

In addition to including the iconv API, POSIX extends locale support in a couple of ways:

Exposing locale objects via the newlocale, uselocale, duplocale and freelocale APIs.
uselocale sets a per-thread locale, rather than the process-wide locale.

Goals for Picolibc internationalization support

For charsets, supporting UTF-8 should cover the bulk of embedded application needs, and even that is probably more than what most applications require. Most (all?) compilers use Unicode for wide character and string constants. That means wchar_t needs to be Unicode in every locale.

Aside from charset support, the rest of the locale infrastructure is heavily focused on creating human-consumable strings. I don't think it's a stretch to say that none of this is very useful these days, even for systems with sophisticated user interactions. For picolibc, the cost to provide any of this would be high.

Having two completely separate charset conversion datasets makes for a confusing and error-prone experience for developers. Replacing iconv with code that leverages the existing locale support for translating between multi-byte and wide-character representations will save a bunch of source code and improve consistency.

Embedded systems can be very sensitive to memory usage, both read-only and read-write. Applications not using internationalization capabilities shouldn't pay a heavy premium even when the library binary is built with support. For the most sensitive targets, the library should be configurable to remove unnecessary functionality.

Picolibc needs to be conforming with at least the C language standard, and as much of POSIX as makes sense. Fortunately, the requirements for C are modest as it only includes a few locale-related APIs and doesn't include iconv.

Finally, picolibc should test these APIs to make sure they conform with relevant standards, especially character set translation and character classification. The easiest way to do this is to reference another implementation of the same API and compare results.

Switching to Unicode for JIS wchar_t

This involved ripping the JIS to Unicode translations out of all of the wide character APIs and inserting them into the translations between multi-byte and wide-char representations. The missing Unicode to JIS translation was kludged by iterating over all JIS code points until a matching Unicode value was found. That's an obvious place for a performance improvement, but at least it works.

Tiny locale

This is a minimal implementation of locales which conforms with the C language standard while providing only charset translation and character classification data. It handles all of the existing charsets, but splits things into three levels

ASCII
UTF-8
Extended, including any or all of: a. ISO 8859 b. Windows code pages and other 8-bit encodings c. JIS (JIS, EUC-JP and Shift-JIS)

When built for ASCII-only, all of the locale support is short-circuited, except for error checking. In addition, support in printf and scanf for wide characters is removed by default (it can be re-enabled with the -Dio-wchar=true meson option). This offers the smallest code size. Because the wctype APIs (e.g. iswupper) are all locale-specific, this mode restricts them to ASCII-only, which means they become wrappers on top of the ctype APIs with added range checking.

When built for UTF-8, character classification for wide characters uses tables that provide the full Unicode range. Setlocale now selects between two locales, "C" and "C.UTF-8". Any locale name other than "C" selects the UTF-8 version. If the locale name contains "." or "-", then the rest of the locale name is taken to be a charset name and matched against the list of supported charsets. In this mode, only "us_ascii", "ascii" and "utf-8" are recognized.

Because a single byte of a utf-8 string with the high-bit set is not a complete character, all of the ctype APIs in this mode can use the same implementation as the ASCII-only mode. This means the small ctype implementation is available.

Calling setlocale(LC_ALL, "C.UTF-8") will allow the application to use the APIs which translate between multi-byte and wide-characters to deal with UTF-8 encoded strings. In addition, scanf and printf can read and write UTF-8 strings into wchar_t strings.

Locale names are converted into locale IDs, an enumeration which lists the available locales. Each ID implies a specific charset as that's the only thing which differs between them. This means a locale can be encoded in a few bytes rather than an array of strings.

In terms of memory usage, applications not using locales and not using the wctype APIs should see only a small increase in code space. That's due to the wchar_t support added to printf and scanf which need to translate between multi-byte and wide-character representations. There aren't any tables required as ASCII and UTF-8 are directly convertible to Unicode. On ARM-v7m, The added code in printf and scanf add up to about 1kB and another 32 bytes of RAM is used.

The big difference when enabling extended charset support is that all of the charset conversion and character classification operations become table driven and dependent on the locale. Depending on the extended charsets supported, these can be quite large. With all of the extended charsets included, this adds an additional 30kB of code and static data and uses another 56 bytes of RAM.

There are two known gaps in functionality compared with the newlib code:

Locale strings that encode different locales for different categories. That's nominally required by POSIX as LC_ALL is supposed to return a string sufficient to restore the locale, but the only category which actually matters is LC_CTYPE.
No nl_langinfo support. This would be fairly easy to add, returning appropriate constant values for each parameter.

Tiny locale was merged to picolibc main in this PR

Tiny iconv

Replacing the bulky newlib iconv code was far easier than swapping locale implementations. Essentially all that iconv does is compute two functions, one which maps from multi-byte to wide-char in one locale and another which maps from wide-char to multi-byte in another locale.

Once the JIS locales were fixed to use Unicode, the new iconv implementation was straightforward. POSIX doesn't provide any _l version of mbrtowc or wcrtomb, so using standard C APIs would have been clunky. Instead, the implementation uses the internal APIs to compute the correct charset conversion functions. The entire implementation fits in under 200 lines of code.

Tiny iconv is in process in this PR

Future directions

Right now, both of these new bits of code sit in the source tree parallel to the old versions. I'm not seeing any particular reason to keep the old versions around; they have provided a useful point of comparison in developing the new code, but I don't think they offer any compelling benefits going forward.