Picolibc Updates

I thought work on picolibc would slow down at some point, but I keep finding more things that need work. I spent a few weeks working in libm and then discovered some important memory allocation bugs in the last week that needed attention too.

Cleaning up the Picolibc Math Library

Picolibc uses the same math library sources as newlib, which includes code from a range of sources:

SunPro (Sun Microsystems). This forms the bulk of the common code for the math library, with copyright dates stretching back to 1993. This code is designed for processors with FPUs and uses 'float' for float functions and 'double' for double functions.
NetBSD. This is where the complex functions came from, with Copyright dates of 2007.
FreeBSD. fenv support for aarch64, arm and sparc
IBM. SPU processor support along with a number of stubs for long-double support where long double is the same as double
Various processor vendors have provided processor-specific code for exceptions and a few custom functions.
Szabolcs Nagy and Wilco Dijkstra (ARM). These two re-implemented some of the more important functions in 2017-2018 for both float and double using double precision arithmetic and fused multiply-add primitives to improve performance for systems with hardware double precision support.

The original SunPro math code had been split into two levels at some point:

IEEE-754 functions. These offer pure IEEE-754 semantics, including return values and exceptions. They do not set the POSIX errno value. These are all prefixed with __ieee754_ and can be called directly by applications if desired.
POSIX functions. These can offer POSIX semantics, including setting errno and returning expected values when errno is set.

New Code Sponsored by ARM

Szabolcs Nagy and Wilco Dijkstra's work in the last few years has been to improve the performance of some of the core math functions, which is much appreciated. They've adopted a more modern coding style (C99) and written faster code at the expense of a larger memory foot print.

One interesting choice was to use double computations for the float implementations of various functions. This makes these functions shorter and more accurate than versions done using float throughout. However, for machines which don't have HW double, this pulls in soft double code which adds considerable size to the resulting binary and slows down the computations, especially if the platform does support HW float.

The new code also takes advantage of HW fused-multiply-add instructions. Those offer more precision than a sequence of primitive instructions, and so the new code can be much shorter as a result.

The method used to detect whether the target machine supported fma operations was slightly broken on 32-bit ARM platforms, where those with 'float' fma acceleration but without 'double' fma acceleration would use the shorter code sequence, but with an emulated fma operation that used the less-precise sequence of operations, leading to significant reductions in the quality of the resulting math functions.

I fixed the double fma detection and then also added float fma detection along with implementations of float and double fma for ARM and RISC-V. Now both of those platforms get fma-enhanced math functions where available.

Errno Adventures

I'd submitted patches to newlib a while ago that aliased the regular math library names to the __ieee754_ functions when the library was configured to not set errno, which is pretty common for embedded environments where a shared errno is a pain anyways.

Note the use of the word “can” in remark about the old POSIX wrapper functions. That's because all of these functions are run-time switchable between “_IEEE_” and “_POSIX_” mode using the _LIB_VERSION global symbol. When left in the usual _IEEE_ mode, none of this extra code was ever executed, so these wrapper functions never did anything beyond what the underlying __ieee754_ functions did.

The new code by Nagy and Dijkstra changed how functions are structured to eliminate the underlying IEEE-754 api. These new functions use tail calls to various __math_ error reporting functions. Those can be configured at library build time to set errno or not, centralizing those decisions in a few functions.

The result of this combination of source material is that in the default configuration, some library functions (those written by Nagy and Dijkstra) would set errno and others (the old SunPro code) would not. To disable all errno references, the library would need to be compiled with a set of options, -D_IEEE_LIBM to disable errno in the SunPro code and -DWANT_ERRNO=0 to disable errno in the new code. To enable errno everywhere, you'd set -D_POSIX_MODE to make the default value for _LIB_VERSION be _POSIX_ instead of _IEEE_.

To clean all of this up, I removed the run-time _LIB_VERSION variable and made that compile-time. In combination with the earlier work to alias the __ieee754_ functions to the regular POSIX names when _IEEE_LIBM was defined this means that the old SunPro POSIX functions now only get used when _IEEE_LIBM is not defined, and in that case the _LIB_VERSION tests always force use of the errno setting code. In addition, I made the value of WANT_ERRNO depend on whether _IEEE_LIBM was defined, so now a single definition (-D_IEEE_LIBM) causes all of the errno handling from libm to be removed, independent of which code is in use.

As part of this work, I added a range of errno tests for the math functions to find places where the wrong errno value was being used.

Exceptions

As an alternative to errno, C also provides for IEEE-754 exceptions through the fenv functions. These have some significant advantages, including having independent bits for each exception type and having them accumulate instead of sharing errno with a huge range of other C library functions. Plus, they're generally implemented in hardware, so you get exceptions for both library functions and primitive operations.

Well, you should get exceptions everywhere, except that the GCC soft float libraries don't support them at all. So, errno can still be useful if you need to know what happened in your library functions when using soft floats.

Newlib has recently seen a spate of fenv support being added for various architectures, so I decided that it would be a good idea to add some tests. I added tests for both primitive operations, and then tests for library functions to check both exceptions and errno values. Oddly, this uncovered a range of minor mistakes in various math functions. Lots of these were mistakes in the SunPro POSIX wrapper functions where they modified the return values from the __ieee754_ implementations. Simply removing those value modifications fixed many of those errors.

Fixing Memory Allocator bugs

Picolibc inherits malloc code from newlib which offers two separate implementations, one big and fast, the other small and slow(er). Selecting between them is done while building the library, and as Picolibc is expected to be used on smaller systems, the small and slow one is the default.

Contributed by someone from ARM back in 2012/2013, nano-mallocr reminds me of the old V7 memory allocator. A linked list, sorted in address order, holds discontiguous chunks of available memory.

Allocation is done by searching for a large enough chunk in the list. The first one large enough is selected, and if it is large enough, a chunk is split off and left on the free list while the remainder is handed to the application. When the list doesn't have any chunk large enough, sbrk is called to get more memory.

Free operations involve walking the list and inserting the chunk in the right location, merging the freed memory with any immediately adjacent chunks to reduce fragmentation.

The size of each chunk is stored just before the first byte of memory used by the application, where it remains while the memory is in use and while on the free list. The free list is formed by pointers stored in the active area of the chunk, so the only overhead for chunks in use is the size field.

Something Something Padding

To deal with the vagaries of alignment, the original nano-mallocr code would allow for there to be 'padding' between the size field and the active memory area. The amount of padding could vary, depending on the alignment required for a particular chunk (in the case of memalign, that padding can be quite large). If present, nano-mallocr would store the padding value in the location immediately before the active area and distinguish that from a regular size field by a negative sign.

The whole padding thing seems mysterious to me -- why would it ever be needed when the allocator could simply create chunks that were aligned to the required value and a multiple of that value in size. The only use I could think of was for memalign; adding this padding field would allow for less over-allocation to find a suitable chunk. I didn't feel like this one (infrequent) use case was worth the extra complexity; it certainly caused me difficulty in reading the code.

A Few Bugs

In reviewing the code, I found a couple of easy-to-fix bugs.

calloc was not checking for overflow in multiplication. This is something I've only heard about in the last five or six years -- multiplying the size of each element by the number of elements can end up wrapping around to a small value which may actually succeed and cause the program to mis-behave.
realloc copied new_size bytes from the original location to the new location. If the new size was larger than the old, this would read off the end of the original allocation, potentially disclosing information from an adjacent allocation or walk off the end of physical memory and cause some hard fault.

Time For Testing

Once I had uncovered a few bugs in this code, I decided that it would be good to write a few tests to exercise the API. With the tests running on four architectures in nearly 60 variants, it seemed like I'd be able to uncover at least a few more failures:

Error tests. Allocating too much memory and make sure the correct errors were returned and that nothing obviously untoward happened.
Touch tests. Just call the allocator and validate the return values.
Stress test. Allocate lots of blocks, resize them and free them. Make sure, using 'mallinfo', that the malloc arena looked reasonable.

These new tests did find bugs. But not where I expected them. Which is why I'm so fond of testing.

GCC Optimizations

One of my tests was to call calloc and make sure it returned a chunk of memory that appeared to work or failed with a reasonable value. To my surprise, on aarch64, that test never finished. It worked elsewhere, but on that architecture it hung in the middle of calloc itself. Which looked like this:

void * nano_calloc(malloc_size_t n, malloc_size_t elem)
{
    ptrdiff_t bytes;
    void * mem;

    if (__builtin_mul_overflow (n, elem, &bytes))
    {
    RERRNO = ENOMEM;
    return NULL;
    }
    mem = nano_malloc(bytes);
    if (mem != NULL) memset(mem, 0, bytes);
    return mem;
}

Note the naming here -- nano_mallocr uses nano_ prefixes in the code, but then uses #defines to change their names to those expected in the ABI. (No, I don't understand why either). However, GCC sees the real names and has some idea of what these functions are supposed to do. In particular, the pattern:

foo = malloc(n);
if (foo) memset(foo, '\0', n);

is converted into a shorter and semantically equivalent:

foo = calloc(n, 1);

Alas, GCC doesn't take into account that this optimization is occurring inside of the implementation of calloc.

Another sequence of code looked like this:

chunk->size = foo
nano_free((char *) chunk + CHUNK_OFFSET);

Well, GCC knows that the content of memory passed to free cannot affect the operation of the application, and so it converted this into:

nano_free((char *) chunk + CHUNK_OFFSET);

Remember that nano_mallocr stores the size of the chunk just before the active memory. In this case, nano_mallocr was splitting a large chunk into two pieces, setting the size of the left-over part and placing that on the free list. Failing to set that size value left whatever was there before for the size and usually resulted in the free list becoming quite corrupted.

Both of these problems can be corrected by compiling the code with a couple of GCC command-line switches (-fno-builtin-malloc and -fno-builtin-free).

Reworking Malloc

Having spent this much time reading through the nano_mallocr code, I decided to just go through it and make it easier for me to read today, hoping that other people (which includes 'future me') will also find it a bit easier to follow. I picked a couple of things to focus on:

All newly allocated memory should be cleared. This reduces information disclosure between whatever code freed the memory and whatever code is about to use the memory. Plus, it reduces the effect of un-initialized allocations as they now consistently get zeroed memory. Yes, this masks bugs. Yes, this goes slower. This change is dedicated to Kees Cook, but please blame me for it not him.
Get rid of the 'Padding' notion. Every time I read this code it made my brain hurt. I doubt I'll get any smarter in the future.
Realloc could use some love, improving its efficiency in common cases to reduce memory usage.
Reworking linked list walking. nano_mallocr uses a singly-linked free list and open-codes all list walking. Normally, I'd switch to a library implementation to avoid introducing my own bugs, but in this fairly simple case, I think it's a reasonable compromise to open-code the list operations using some patterns I learned while working at MIT from Bob Scheifler.
Discover necessary values, like padding and the limits of the memory space, from the environment rather than having them hard-coded.

Padding

To get rid of 'Padding' in malloc, I needed to make sure that every chunk was aligned and sized correctly. Remember that there is a header on every allocated chunk which is stored before the active memory which contains the size of the chunk. On 32-bit machines, that size is 4 bytes. If the machine requires allocations to be aligned on 8-byte boundaries (as might be the case for 'double' values), we're now going to force the alignment of the header to 8-bytes, wasting four bytes between the size field and the active memory.

Well, the existing nano_mallocr code also wastes those four bytes to store the 'padding' value. Using a consistent alignment for chunk starting addresses and chunk sizes has made the code a lot simpler and easier to reason about while not using extra memory for normal allocation. Except for memalign, which I'll cover in the next section.

realloc

The original nano_realloc function was as simple as possible:

mem = nano_malloc(new_size);
if (mem) {
    memcpy(mem, old, MIN(old_size, new_size));
    nano_free(old);
}
return mem;

However, this really performs badly when the application is growing a buffer while accumulating data. A couple of simple optimizations occurred to me:

If there's a free chunk just after the original location, it could be merged to the existing block and avoid copying the data.
If the original chunk is at the end of the heap, call sbrk() to increase the size of the chunk.

The second one seems like the more important case; in a small system, the buffer will probably land at the end of the heap at some point, at which point growing it to the size of available memory becomes quite efficient.

When shrinking the buffer, instead of allocating new space and copying, if there's enough space being freed for a new chunk, create one and add it to the free list.

List Walking

Walking singly-linked lists seem like one of the first things we see when learning pointer manipulation in C:

for (element = head; element; element = element->next)
    do stuff ...

However, this becomes pretty complicated when 'do stuff' includes removing something from the list:

prev = NULL;
for (element = head; element; element = element->next)
    ...
    if (found)
        break;
    ...
    prev = element

if (prev != NULL)
    prev->next = element->next;
else
    head = element->next;

An extra variable, and a test to figure out how to re-link the list. Bob showed me a simpler way, which I'm sure many people are familiar with:

for (ptr = &head; (element = *ptr); ptr = &(element->next))
    ...
    if (found)
        break;

*ptr = element->next;

Insertion is similar, as you would expect:

for (ptr = &head; (element = *ptr); ptr = &(element->next))
    if (found)
        break;

new_element->next = element;
*ptr = new_element;

In terms of memory operations, it's the same -- each 'next' pointer is fetched exactly once and the list is re-linked by performing a single store. In terms of reading the code, once you've seen this pattern, getting rid of the extra variable and the conditionals around the list update makes it shorter and less prone to errors.

In the nano_mallocr code, instead of using 'prev = NULL', it actually used 'prev = free_list', and the test for updating the head was 'prev == element', which really caught me unawares.

System Parameters

Any malloc implementation needs to know a couple of things about the system it's running on:

Address space. The maximum range of possible addresses sets the limit on how large a block of memory might be allocated, and hence the size of the 'size' field. Fortunately, we've got the 'size_t' type for this, so we can just use that.
Alignment requirements. These derive from the alignment requirements of the basic machine types, including pointers, integers and floating point numbers which are formed from a combination of machine requirements (some systems will fault if attempting to use memory with the wrong alignment) along with a compromise between memory usage and memory system performance.

I decided to let the system tell me the alignment necessary using a special type declaration and the 'offsetof' operation:

typedef struct {
    char c;
    union {
    void *p;
    double d;
    long long ll;
    size_t s;
    } u;
} align_t;

#define MALLOC_ALIGN        (offsetof(align_t, u))

Because C requires struct fields to be stored in order of declaration, the 'u' field would have to be after the 'c' field, and would have to be assigned an offset equal to the largest alignment necessary for any of its members. Testing on a range of machines yields the following alignment requirements:

Architecture	Alignment
x86_64	8
RISC-V	8
aarch64	8
arm	8
x86	4

So, I guess I could have just used a constant value of '8' and not worried about it, but using the compiler-provided value means that running picolibc on older architectures might save a bit of memory at no real cost in the code.

Now, the header containing the 'size' field can be aligned to this value, and all allocated blocks can be allocated in units of this value.

memalign

memalign, valloc and pvalloc all allocate memory with restrictions on the alignment of the base address and length. You'd think these would be simple -- allocate a large chunk, align within that chunk and return the address. However, they also all require that the address can be successfully passed to free. Which means that the allocator needs to do some tricks to make it all work. Essentially, you allocate 'lots' of memory and then arrange that any bytes at the head and tail of the allocation can be returned to the free list.

The tail part is easy; if it's large enough to form a free chunk (which must contain the size and a 'next' pointer for the free list), it can be split off. Otherwise, it just sits at the end of the allocation being wasted space.

The head part is a bit tricky when it's not large enough to form a free chunk. That's where the 'padding' business came in handy; that can be as small as a 'size_t' value, which (on 32-bit systems) is only four bytes.

Now that we're giving up trying to reason about 'padding', any extra block at the start must be big enough to hold a free block, which includes the size and a next pointer. On 32-bit systems, that's just 8 bytes which (for most of our targets) is the same as the alignment value we're using. On 32-bit systems that can use 4-byte alignment, and on 64-bit systems, it's possible that the alignment required by the application for memalign and the alignment of a chunk returned by malloc might be off by too small an amount to create a free chunk.

So, we just allocate a lot of extra space; enough so that we can create a block of size 'toosmall + align' at the start and create a free chunk of memory out of that.

This works, and at least returns all of the unused memory back for other allocations.

Sending Patches Back to Newlib

I've sent the floating point fixes upstream to newlib where they've already landed on master. I've sent most of the malloc fixes, but I'm not sure they really care about seeing nano_mallocr refactored. If they do, I'll spend the time necessary to get the changes ported back to the newlib internal APIs and merged upstream.