Float/String Conversion in Picolibc: Enter “Ryū”

I recently wrote about this topic having concluded that the best route for now was to use the malloc-free, but imprecise, conversion routines in the tinystdio alternative.

A few days later, Sreepathi Pai pointed me at some very recent work in this area:

This is amazing! Thirty years after the papers referenced in the previous post, Ulf Adams came up with some really cool ideas and managed to reduce the math required for 64-bit conversion to 128 bit integers. This is a huge leap forward; we were doing long multi-precision computations before, and now it's all short enough to fit in registers (ok, a lot of registers, but still).

Getting the Ryū Code

The code is available on github: https://github.com/ulfjack/ryu. Reading through it, it's very clear that the author focuses on performance with lots of tuning for common cases. Still, it's quite readable, especially compared with the newlib multi-precision based code.

Picolibc String/Float conversion interface

Picolibc has some pretty basic needs for the float/string conversion code, it wants four functions:

  1. __dtoa_engine

    int
    __dtoa_engine(double x, struct dtoa *dtoa, uint8_t max_digits, uint8_t max_decimals);
    

    This converts the double x to a string of decimal digits and a decimal exponent stored inside the 'dtoa' struct. It limits the total number of digits to max_digits and, optionally (when max_decimals is non-zero), limits the number of fractional digits to max_decimals - 1. This latter supports 'f' formats. Returns the number of digits stored, which is <= max_digits. Less if the number can be accurately represented in fewer digits.

  2. __ftoa_engine

    int
    __ftoa_engine(float x, struct ftoa *ftoa, uint8_t max_digits, uint8_t max_decimals);
    

    The same as __dtoa_engine, except for floats.

  3. __atod_engine

    double
    __atod_engine(uint64_t m10, int e10);
    

    To avoid needing to handle stdio inside the conversion function, __atod_engine receives fully parsed values, the base-10 significand (m10) and exponent (e10). The value to convert is m10 * pow(10, e10).

  4. __atof_engine

    float
    __atof_engine(uint32_t m10, int e10);
    

    The same as __atod_engine, except for floats.

With these, it can do printf, scanf, ecvt, fcvt, gcvt, strtod, strtof and atof.

Porting Ryū to Picolibc

The existing Ryū float-to-string code always generates the number of digits necessary for accurate output. I had to hack it up to generate correctly rounded shorter output when max_digits or max_decimals were smaller. I'm not sure I managed to do that correctly, but at least it appears to be passing all of the test cases I have. In normal operation, Ryū iteratively removes digits from the answer that aren't necessary to disambiguate with neighboring values.

What I changed was to keep removing digits using that method until the answer had few enough digits to fit in the desired length. There's some tricky rounding code that adjusts the final result and I had to bypass that if I'd removed extra digits.

That was about the only change necessary to the core algorithm. I also trimmed the code to only include the general case and not the performance improvements, then wrapped it with code to provide the _engine interface.

On the string-to-float side, most of what I needed to do was remove the string parsing bits at the start of the function and switch from performance-optimized to space-optimized versions of a couple of internal routines.

Correctness Results

Because these new functions are now 'exact', I was able to adjust the picolibc tests to compare all of the bits for string/float conversion instead of having to permit a bit of slop in the answers. With those changes, the picolibc test suite passes, which offers some assurance that things aren't completely broken.

Size Results

Snek uses the 32-bit float versions of the conversion routines, and for that, the size difference is:

   text    data     bss     dec     hex filename
  59068      44   37968   97080   17b38 snek-qemu-riscv-orig.elf
  59430      44   37968   97442   17ca2 snek-qemu-riscv-ryu.elf
    362

362 bytes added to gain accurate printf/strtof results seems like a good trade-off in this case.

Performance

I haven't measured performance at all, but I suspect that it won't be nearly as problematic on most platforms as the source code makes it appear. And that's because Ryū is entirely integer arithmetic with no floating point at all. This avoids using the soft fp code for platforms without hardware float support.

Pointers to the Code

I haven't merged this to picolibc master yet, it's on the ryu branch:

Review, especially of the hack above to return short results, would be greatly appreciated!

Thanks again to Ulf Adams for creating this code and to Sreepathi Pai for sending me a note about it!