Deferred formatting in C

Background

My friend recently started learning embedded rust for personal reasons, and stumbled upon the wonderful defmt crate. It manages to speed up logging by a lot, and it reduces flash size needed for log strings. It does this by not storing the format strings in flash, but instead stores a “string ID”, which it sends over RTT / UART / something else. The computer instead becomes responsible for formatting the string using printf, and displaying the results.

This is in my opinion insanely cool, and so I naturally wondered: can this be implemented in standard C?

The short answer is, not exactly. At least, if you treat your compilation / linking as a black box.

If instead, we give ourselves the ability to perform arbitrary pre/post-processing on our object files and the resulting executable, it turns out that this becomes doable.

Macros - C vs Rust

The rust crate that inspired this makes heavy use of macros and is implemented entirely within the constraints of the rust compiler. Is this possible in C? No. While rust’s macro system is complex and extremely powerful, C’s macro system is basically a text search and replace, but with a few extra steps. So we need to somehow “extend” the compilation pipeline. For example: if we can somehow construct a macro that can emit some extra metadata about the current call to printf and friends, we can do the rest of the processing outside of the confines of GCC.

Worth noting is that this should be fully possible in C++, since we can define a template function for custom compile-time string constants. We can implement a small compile-time parser for printf format specifiers within said template, and then wrap this in a few macros. This is left as an exercise for the reader.

We’re (un)fortunately stuck with C, due to it being my preferred language, and the canonical language for microcontroller programming.

Varargs and encoding

Our first problem is the metadata. Exactly what information do we need to store, and how?

Recall how functions with an arbitrary argument count (varargs) work in C. The following is the function signature for printf:

int printf(const char* format, ...);

The ... signifies that this function can take any number of extra arguments. However, the function doesn’t get to know the amount and type of the extra arguments. Depending on the ABI, some arguments live in registers and some live on the stack. Since some of these arguments potentially live on the stack, it is very important that we call va_arg for each and every argument before returning (try intentionally not doing this if you like SEGFAULTs). So, the function needs some way to determine how many arguments to grab and what type each argument is (to ensure the right amount of bytes are popped from the stack). In the case of printf, this information is baked in to the format-string.

The size of each argument matters (again, we don’t want to misalign the stack, that would be terrible). Every type smaller than int (4 bytes on most systems) gets implicitly promoted to int, while types larger than int keep their size. So we not only need to store the unique ID for a given format string, but also the sizes and types of each argument.

Storing an enum containing the entire type is a lot of information per format argument, but fortunately we don’t need that much information. First, only builtin types (like unsigned int, long long*, and such) can be passed as varargs. Secondly, there’s only two sizes of argument, 4 bytes and 8 bytes (due to the aforementioned promotion, also disregarding architectures where >8byte types exist… like x86). The code running on the microcontroller doesn’t even need to know the signedness of the type, since promotions do sign-extension implicitly.

The only thing we need to treat differently is %s, which corresponds to a char*. We can’t really send a raw char* over to the computer and expect it to be able to properly use it, since it points to memory in the microcontroller and might point to volatile memory (like the stack). So we have 3 cases:

  • 4-byte integer/FP
  • 8-byte integer/FP
  • null-terminated string

This means we can pack the data for 4 format arguments into a single byte (reserving one value to mark the end of the argument list).

For the string ID, we can borrow a page out of defmt and store it as a LEB-128 encoded integer. This is a variable length coding scheme, where the highest bit indicates more data, and the lower 7 bits are data. This wastes 1 bit per byte, but also allows the amount of bytes allocated for storing the string ID to dynamically adjust (storing 4 bytes per ID if we only have 4 log strings wastes a lot of storage)

Our implementation of printf on the microcontroller then takes a pointer to this data, and should do the following:

  • Send the string ID as-is.
  • Parse the data for each format argument, and send the corresponding raw data (using va_arg to extract it)

The plan

Now, let’s begin with the most complicated part: transforming each printf-call into metadata, and inventing a way to inject the compressed string later in the build process.

Consider the following C macro (Note that this uses some GCC specific extensions):

#define INTERN(x)                                  \
    ({                                             \
        const char __attribute__((used)) _x[] = x; \
        extern const char _y[];                    \
        _y;                                        \
    })

This will take a string constant (labelled x), and then “transform” that into a reference to the variable _y. Side note: this uses a GCC extension called “statement expressions”. The statements will be evaluated in order, and then the last statement will be the value of the entire expression. Read more here

Under normal circumstances, this would produce a linker error (after all, _y is never defined anywhere). However, if we can extract all pairs of _x and _y before linking, we can define _y ourselves, and do just about anything. Unfortunately, it turns out that this is a bit more difficult than it might look initially.

For one, we need a way to actually find the variables, and to extract their values. Notice how both _x and _y are declared as arrays instead of pointers? This makes GCC store their raw values, instead of pointers to their raw values.

Unique variable names

The above macro would only work once, in a single file. This is due to _y having the same name everywhere, and thus it can only be defined once. The linker would also whine about _x being defined multiple times. A naive way of trying to solve this would be generating variable names based on __COUNTER__ and __LINE__, but this is prone to conflicts. Another idea is to use __FILE__, but this is unfortunately not possible. __FILE__ expands to a string constant, not a symbol, and to my knowledge there is no “unstringification” operator in C. Instead, we can define a new macro, __FILE_UID__, in our compiler command-line. If we define this to be some sort of hash derived from the contents of the current file (example: md5sum file.c), we can generate globally unique variables based on this combined with __COUNTER__, like so:

#define VAR_NAME __interned_ ## __FILE_UID__ ## _ ## __COUNTER__

## is the token-pasting operator. It’s function is to take two distinct tokens in a macro and merge them, so that they are treated as a single identifier / token. This allows us to construct variable names from parts, and is one of the useful “magic operators” in C macros.

But, this doesn’t work. The token-pasting operator can only work on macro arguments, so we need the following:

#define _VAR_NAME(x, y) __interned_ ## x ## _ ## y
#define VAR_NAME _VAR_NAME(__FILE_UID__, __COUNTER__)

This also doesn’t work, and in fact VAR_NAME always expands to __interned___FILE_UID_____COUNTER__, which definitely is not what we intended. So, what gives?

The answer is argument prescan. Basically, when the C compiler is expanding macros, it first expands the arguments to the macro, then substitutes them into the macro body, then expands the resulting macro body again. However, the expansion of an argument only happens when said argument is not token-pasted or stringized. So in order to correctly expand __FILE_UID__ and __COUNTER__, we need another layer of macros, like this:

#define _VAR_NAME2(x, y) __interned_ ## x ## _ ## y
#define _VAR_NAME(x, y) _VAR_NAME2(x, y)
#define VAR_NAME _VAR_NAME(__FILE_UID__, __COUNTER__)

This works as intended.

Since we use __COUNTER__ in our macro, we need to be careful and only expand VAR_NAME once per variable group, to ensure they get the same base name. Argument prescan is important again here, since we will use VAR_NAME for token-pasting purposes, and we want it expanded.

So, now we have:

#define _INTERN2(x, var_name)                                      \
    ({                                                             \
        const char __attribute__((used)) var_name ## _data [] = x; \
        extern const char var_name ## _generated [];               \
        var_name ## _generated;                                    \
    })
    
#define _INTERN(x, var_name) _INTERN2(x, var_name)
#define INTERN(x) _INTERN(x, VAR_NAME)

Using the definition of VAR_NAME from above. However, this is still not enough.

ELF Sections

Remember one of our goals was to reduce flash usage? Well, turns out we’ve done exactly the opposite. Since we declared var_name##_data with __attribute__((used)), GCC will not remove it during optimization (which is what we want, because we need to inspect it). And because it’s a constant, it will be placed in .rodata by default, which is located in flash (on reasonable systems). We somehow need to delete the constant ourselves, before GCC has a chance to link it into the resulting binary.

We can do this by forcing each variable into a unique section (using __attribute__((section(...)))). This has the added benefit of making the variables easier to locate, since then we don’t need to manually resolve ELF relocations. We can then just remove the entire section from the resulting object file before linking.

#define _STR(x) #x
#define STR(x) _STR(x)

#define _INTERN2(x, var_name)                                     \
    ({                                                            \
        static const char                                         \
            __attribute__((used, section(STR(var_name) "_data"))) \
            var_name ## _data [] = x;                             \
        extern const char                                         \
            __attribute((section(STR(var_name) "_gen")))          \
            var_name ## _generated [];                            \
        var_name ## _generated;                                   \
    })
    
#define _INTERN(x, var_name) _INTERN2(x, var_name)
#define INTERN(x) _INTERN(x, VAR_NAME)

Call information

So we’ve seen that we can transform a single constant string into a bunch of metadata, which we can then extract from the object file. But what metadata can we really store without it ending up in flash? The basic ones like __FILE__ and __LINE__ are a given, but could we do the same with __func__?

The short answer is no, but in theory it might be possible. __LINE__ and __FILE__ are defined by GCC to be macros that expand to an integer constant and a string constant, respectively. However __func__ is different. The C preprocessor has no idea what function it expands a given macro in, it just expands macros. So __func__ is instead implemented in the compiler, where it acts like the following:

static const char __func__[] = "function_name";
void function_name() {
    ...
}

static const char __func__[] = "function_name_2";
void function_name_2() {
    ...
}

So __func__ acts like a normal variable, and not a string constant. This means that we cannot do the following:

void function_name() {
    static const char __attribute__((section("x"))) x[] = __func__;
}

Instead, we have to declare x as a char*. This makes GCC place a pointer to __func__ in section x, and __func__ gets placed in .rodata. Notice how we don’t have any control of what section __func__ gets placed in. I experimented with this, and concluded that GCC unfortunately doesn’t optimize away __func__, despite it not being referenced anywhere, so it takes up a bunch of space in flash. So for each call to printf, we can only really store __FILE__ and __LINE__ without impacting flash usage.

In theory, if you compile with -fdata-sections, you might be able to look at the section name (which would look like .rodata.__func__.x where x is a number), and delete these sections as well, but since they are referenced by relocations from the pointer-containing sections it might be difficult to do without ruining the object file.

LTO

Unfortunately this still isn’t enough. When we compile code normally, GCC spits out one object file per C file (called a Translation Unit in this context). These object files are in reality relocatable ELF-files, and contain normal assembly code, along with some annotations indicating what addresses need to be corrected by the linker when doing the final memory layout.

When compiling and linking with Link-Time Optimization (-flto) however, GCC doesn’t generate assembly for the object files anymore. Instead, it just spits out an intermediate representation (GIMPLE in this case), and lets the linker do the final optimization passes and code generation.

This is actually a great thing, as it allows inlining and constant folding for function calls between different C files, but it also unfortunately makes the object file completely opaque. GIMPLE is not documented, very internal, may change with the lunar cycle, and really is not fit for consumption by anything other than GCC. This would not be a problem since you may think it only affects code, and not constants. But alas, GIMPLE is emitted for all constant data as well.

So, our deferred formatting won’t work with -flto then? We can’t have that.

It turns out you can tell GCC to generate both GIMPLE and normal assembly for object files, if you pass -ffat-lto-objects to it. This makes the compilation time significantly longer (around 20-50% according to my experiments), but it is well worth it. Since we are only reading from the object file and using this information to generate a C file containing all the encoded string data, we thankfully never need to modify the GIMPLE representation.

Putting it together

We’ve now solved most of our problems, and can start putting it all together.

Firstly, our macro to save metadata for printf calls:

// Convert a symbol to a string (nested due to argument prescan)
#define _STR(x) #x
#define STR(x) _STR(x)

// Generate the base variable name
// __FILE_UID__ defined on the command line, __COUNTER__ is builtin
#define _VAR_NAME2(x, y) __interned_ ## x ## _ ## y
#define _VAR_NAME(x, y) _VAR_NAME2(x, y)
#define VAR_NAME _VAR_NAME(__FILE_UID__, __COUNTER__)

// COMBINE(a, b) == a_b
// This is used to generate variable names
// later
#define _COMBINE(x, y) x ## y
#define COMBINE(x, y) _COMBINE(x, y)

// Attributes for the metadata vars
// 'used' means GCC won't optimize it away
// 'section' places it in a special ELF section
// Note that ("x" "y") == "xy"
// (GCC combines nearby string constants into one)
#define METADATA_ATTRS(var_name, suffix) \
    __attribute__((used, section(STR(var_name) STR(suffix))))
    
// Generate a metadata definition
#define METADATA(type, var_name, suffix) \
    const type METADATA_ATTRS(var_name, suffix) \
    COMBINE(var_name, suffix)
    
// Prototype for our custom printf function
void interned_printf(const uint8_t* data, ...);

// Magic sauce
#define _INTERNED_PRINTF2(str, var_name, ...)             \
    do {                                                  \
        METADATA(char, var_name, _str[]) = str;           \
        METADATA(char, var_name, _file[]) = __FILE__;     \
        METADATA(int, var_name, _line) = __LINE__;        \
        extern uint8_t COMBINE(var_name, _generated[]);   \
        interned_printf(                                  \
            COMBINE(var_name, _generated)                 \
            __VA_OPT__(,) __VA_ARGS__                     \
        );                                                \
    } while (0)
    
// Wrapper to inject base variable name
// again nested due to argument prescan
#define _INTERNED_PRINTF(str, var_name, ...) \
    _INTERNED_PRINTF2(str, var_name __VA_OPT__(,) __VA_ARGS__)
#define INTERNED_PRINTF(str, ...) \
    _INTERNED_PRINTF(str, VAR_NAME __VA_OPT__(,) __VA_ARGS__)

Let’s try expanding INTERNED_PRINTF("Hello, %d!", 1337), assuming that __FILE_UID__ = abcd:

void test() {
    // This invocation of INTERNED_PRINTF
    INTERNED_PRINTF("Hello, %d!", 1337);
}

void test_expanded() {
    // Expands to this (attributes omitted for brevity)
    // The variables will be placed in unique sections
    // corresponding to their names.
    do {
        const char __interned_abcd_0_str[] = "Hello, %d!";
        const char __interned_abcd_0_file[] = __FILE__;
        const int __interned_abcd_0_line = __LINE__;
        extern const uint8_t __interned_abcd_0_generated[];
        interned_printf(
            __interned_abcd_0_generated, 1337
        );
    } while (0);
}

Great! Our string is now stored as metadata (along with the file and line number), and our printf replacement gets some new data that we can generate ourselves.

The next thing to implement is interned_printf:

#include <stdarg.h>
#include <stdint.h>

// Assume the following functions exist for emitting data
// (this could be over UART / RTT / SWO / some other interface)
// I would recommend emitting uint32_t and uint64_t in native endianness
// and translating on the computer, as this is faster
void emit_byte(uint8_t byte);
void emit_u32(uint32_t data);
void emit_u64(uint64_t data);
void emit_string(const char* str);

// data is encoded like this:
// [LEB-128 encoded id] [0-terminated list of all argument sizes]
// This example assumes each argument length is in it's own byte
// compressing this further is left as an excercise for the reader

void interned_printf(const uint8_t* data, ...) {
    // Set up varargs, see documentation for <stdarg.h>
    va_list args;
    va_start(args, data);

    // Emit the LEB-128 encoded ID directly
    while (*data & 0x80) {
        emit_byte(*data++);
    }
    emit_byte(*data++); // Last byte of LEB-128 data
    
    // data now points at the beginning of the argument lengths
    // When we encounter a 0-length argument, we're done
    while (*data) {
        uint8_t sz = *data++;
        
        switch (sz) {
            case 4: // A 4 represents a 4-byte argument
                emit_u32(va_arg(args, uint32_t));
                break;
            case 8: // An 8 represents a 8-byte argument
                emit_u64(va_arg(args, uint64_t));
                break;
            case 15: 
                // 15 represents a string argument, 
                // which needs to be emitted as a 
                // null-terminated string
                emit_string(va_arg(args, char*));
                break;
        }
    }
}

Perfect!

Now, the question is. How do we integrate this in our build process? The method I went for is to wrap the linker in a python script that does the following:

  • Parse the command line arguments for all object files
  • Find all interned strings in said object files, and extract metadata (by looking for sections named .__interned_<valid id>_str)
  • Remove all such sections (aside from __interned_<valid id>_generated) from all object files
  • Parse and encode the format string
  • Generate a file that contains all metadata in a lookup table (for example a json file)
  • Generate a C-file that contains definitions for all __interned_<valid id>_generated, corresponding to their encoded string data
  • Compile said C-file to an object file (using the same compiler options that the other object files used)
  • Link as normal, passing all arguments through to gcc or ld

We can then use Ghidra and readelf to make sure GCC doesn’t try to do anything funny.

This python script is too long to include directly in this post. I will push all the examples to Github under the MIT license at a later point, and update this post to link to said repo.

Side note: if you haven’t before, try reverse-engineering code that you wrote / have the source-code for. You pick up a lot of information about how the compiler behaves by doing this. I would recommend having some fun in godbolt, and maybe trying out some of the example code in this post to see what happens.

Results

This code, tested on the firmware at my workplace, showed approximately a 90% size reduction in flash usage for log strings, and shaved off about 1kB with even just a single log string (due to not requiring an actual printf implementation).

I’ve not measured the performance impact, but I would expect this to be significantly faster due to not having to do formatting calculations.

A more detailed performance evaluation and flash usage comparison will be released once I’ve pushed up the example code.

Wrap-up

The build process is often treated as a magical black-box, where you feed it C files and it spits out a working executable. But so many interesting solutions to difficult problems can be created when you start poking at your elf-files and object files externally. This is always a bit scary to do, especially in a production environment, but for development tools this opens some pretty interesting doors.

Satrn's Blog

CTF Writeups, Personal projects, and everything inbetween


Building an implementation of deferred formatting in C, and showcasing the wonders of not treating your build system as a black box

By Satrn, 2024-11-18


On this page: