📜 ⬆️ ⬇️

Implementing a hot boot of C ++ code in Linux and macOS: digging deeper


* Link to the library and demo video at the end of the article. To understand what is happening, and who all these people are, I recommend reading the previous article .


In the last article, we familiarized ourselves with an approach that allows for a hot reboot of c ++ code. The “code” in this case is the functions, data, and their coordinated work with each other. There are no special problems with functions, we redirect the flow of execution from the old function to the new one, and everything works. The problem arises with data (static and global variables), namely with the strategy of their synchronization in the old and new code. In the first implementation, this strategy was very clumsy: we simply copy the values ​​of all static variables from the old code to the new one, so that the new code, referring to the new variables, works with the values ​​from the old code. Of course, this is incorrect, and today we will try to correct this flaw by simultaneously solving a number of small but interesting problems.


The article omits details regarding mechanical work, such as reading characters and relocations from elf and mach-o files. The emphasis is on the subtle points that I encountered in the implementation process, and which may be useful to someone who, like me recently, is looking for answers.


The essence


Let's imagine that we have a class (synthetic examples, please do not look for meaning in them, only the code is important):


// Entity.hpp class Entity { public: Entity(const std::string& description); ~Entity(); void printDescription(); static int getLivingEntitiesCount(); private: static int m_livingEntitiesCount; std::string m_description; }; // Entity.cpp int Entity::m_livingEntitiesCount = 0; Entity::Entity(const std::string& description) : m_description(description) { m_livingEntitiesCount++; } Entity::~Entity() { m_livingEntitiesCount--; } int Entity::getLivingEntitiesCount() { return m_livingEntitiesCount; } void Entity::printDesctiption() { std::cout << m_description << std::endl; } 

Nothing special, except a static variable. Now imagine that we want to change the printDescription() method to:


 void Entity::printDescription() { std::cout << "DESCRIPTION: " << m_description << std::endl; } 

What happens after reloading the code? In the library with the new code, in addition to the methods of the class Entity , will get the static variable m_livingEntitiesCount . Nothing bad will happen if we just copy the value of this variable from the old code to the new one, and continue to use the new variable, forgetting about the old one, because all the methods that use this variable directly are in the library with the new code.


C ++ is very flexible and rich. And let the elegance of solving some problems in c ++ borders on smelly code, I love this language. For example, imagine that rtti is not used in your project. At the same time, you need to have an implementation of the class Any with some type-safe interface:


 class Any { public: template <typename T> explicit Any(T&& value) { ... } template <typename T> bool is() const { ... } template <typename T> T& as() { ... } }; 

We will not go into the details of the implementation of this class. What is important for us is that for implementation we need some kind of mechanism for unambiguous mapping of the type (compile-time entity) into the value of a variable, for example, uint64_t (runtime entity), that is, "number" types. When using rtti, we have access to such things as type_info and, more type_index , type_index . But we do not have rtti. In this case, a fairly common hack (or an elegant solution?) Is such a function:


 template <typename T> uint64_t typeId() { static char someVar; return reinterpret_cast<uint64_t>(&someVar); } 

Then the implementation of the Any class will look something like this:


 class Any { public: template <typename T> explicit Any(T&& value) : m_typeId(typeId<std::decay<T>::type>()) // copy or move value somewhere {} template <typename T> bool is() const { return m_typeId == typeId<std::decay<T>::type>(); } template <typename T> T& as() { ... } private: uint64_t m_typeId = 0; }; 

For each type, the function will be instantiated exactly 1 time, respectively, each version of the function will have its own static variable, obviously with its own unique address. What happens when we reload code using this feature? Calls to the old version of the function will be redirected to the new one. The new one will have its own static variable, already initialized (we copied the value and the guard variable). But we are not interested in the value, we use only the address. And the address of the new variable will be different. Thus, the data became inconsistent: in the already created instances of the Any class, the address of the old static variable will be stored, and the is() method will compare it with the address of the new one, and "this Any no longer be the same Any " ©.


Plan


To solve this problem, you need something smarter than just copying. Having spent a couple of evenings on googling, reading documentation, source codes and system api, the following plan was drawn up in my head:


  1. After the assembly of the new code, we are passing through relocations .
  2. From these relocations, we get all the places in the code that use static (and sometimes global) variables.
  3. Instead of addresses for new versions of variables, we substitute addresses of old versions into the relocation place.

In this case, there will be no links to new data, the entire application will continue to work with old versions of variables up to the address. That should work. It can't fail.


Relocation


When the compiler generates a machine code, it inserts several bytes sufficient for writing a real address of a variable or function to each place where a function call or loading a variable address occurs, and also generates a relocation. He cannot immediately write down the real address, since at this stage he does not know this address. Functions and variables after linking can appear in different sections, in different places of sections, in the end sections can be loaded at different addresses during execution.


Relocation contains information:



In different operating systems, relocations are represented differently, but in the end they all work on the same principle. For example, in elf (Linux), relocations are located in special .rela sections (in the 32-bit version, this is .rel ), which refer to the section with the address that needs to be fixed (for example, .rela.text is the section where relocations are located, applied to the .text section), and each record stores information about the symbol whose address is to be inserted into the relocation location. In mach-o (macOS), everything is slightly the opposite; there is no separate section for relocations, instead, each section contains a pointer to a relocation table that should be applied to this section, and in each record of this table there is a reference to the relocated symbol.
For example, for such a code (with the -fPIC option):


 int globalVariable = 10; int veryUsefulFunction() { static int functionLocalVariable = 0; functionLocalVariable++; return globalVariable + functionLocalVariable; } 

the compiler will create this section with Linux relocations:


 Relocation section '.rela.text' at offset 0x1a0 contains 4 entries: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000007 0000000600000009 R_X86_64_GOTPCREL 0000000000000000 globalVariable - 4 000000000000000d 0000000400000002 R_X86_64_PC32 0000000000000000 .bss - 4 0000000000000016 0000000400000002 R_X86_64_PC32 0000000000000000 .bss - 4 000000000000001e 0000000400000002 R_X86_64_PC32 0000000000000000 .bss - 4 

and the following relocation table on macOS:


 RELOCATION RECORDS FOR [__text]: 000000000000001b X86_64_RELOC_SIGNED __ZZ18veryUsefulFunctionvE21functionLocalVariable 0000000000000015 X86_64_RELOC_SIGNED _globalVariable 000000000000000f X86_64_RELOC_SIGNED __ZZ18veryUsefulFunctionvE21functionLocalVariable 0000000000000006 X86_64_RELOC_SIGNED __ZZ18veryUsefulFunctionvE21functionLocalVariable 

And here is the veryUsefulFunction() function (in Linux):


 0000000000000000 <_Z18veryUsefulFunctionv>: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: 48 8b 05 00 00 00 00 mov rax,QWORD PTR [rip+0x0] b: 8b 0d 00 00 00 00 mov ecx,DWORD PTR [rip+0x0] 11: 83 c1 01 add ecx,0x1 14: 89 0d 00 00 00 00 mov DWORD PTR [rip+0x0],ecx 1a: 8b 08 mov ecx,DWORD PTR [rax] 1c: 03 0d 00 00 00 00 add ecx,DWORD PTR [rip+0x0] 22: 89 c8 mov eax,ecx 24: 5d pop rbp 25: c3 ret 

and so after linking the object library to the dynamic library:


 00000000000010e0 <_Z18veryUsefulFunctionv>: 10e0: 55 push rbp 10e1: 48 89 e5 mov rbp,rsp 10e4: 48 8b 05 05 21 00 00 mov rax,QWORD PTR [rip+0x2105] 10eb: 8b 0d 13 2f 00 00 mov ecx,DWORD PTR [rip+0x2f13] 10f1: 83 c1 01 add ecx,0x1 10f4: 89 0d 0a 2f 00 00 mov DWORD PTR [rip+0x2f0a],ecx 10fa: 8b 08 mov ecx,DWORD PTR [rax] 10fc: 03 0d 02 2f 00 00 add ecx,DWORD PTR [rip+0x2f02] 1102: 89 c8 mov eax,ecx 1104: 5d pop rbp 1105: c3 ret 

There are 4 places in it, in which 4 bytes are reserved for the address of real variables.


On different systems, the set of possible relocations is yours. On Linux, x86-64 as many as 40 types of relocations . On macOS on x86-64 there are only 9 of them . All types of relocations can be divided into 2 groups:


  1. Link-time relocations - relocations used in the process of linking object files to an executable file or dynamic library
  2. Load-time relocations - relocations used at the time of loading the dynamic library into the process memory

The second group includes relocations of exported functions and variables. When a dynamic library is loaded into the process memory, for all dynamic relocations (including global global relocations), the linker searches for the definition of characters in all the libraries already loaded, including the program itself, and the address of the first suitable character is used for relocation. Thus, you don’t need to do anything with these relocations, the linker himself will find the variable from our application, because it will fall to it earlier in the list of loaded libraries and programs, and substitute its address into the new code, ignoring the new version of this variable.


There is a subtle point related to macOS and its dynamic linker. MacOS implements the so-called two-level namespace mechanism. If it is rough, then when loading a dynamic library, the linker will first look for characters in this library, and if he does not find it, he will search for others. This is done for performance reasons, so that relocations are resolved quickly, which is, in general, logical. But it breaks our flow regarding global variables. Fortunately, in ld on macOS there is a special flag - -flat_namespace , and if you build a library with this flag, the character search algorithm will be identical to that in Linux.


The first group includes the relocations of static variables - exactly what we need. The only problem is that these relocations are not in the compiled library, since they are already resolved by the linker. Therefore, we will read them from the object files from which the library was assembled.
The possible types of relocations are also limited by whether the assembled position-dependent code is or not. Since we collect our code in the PIC mode (position-independent code), only the relative relocations are used. The total relocation we are interested in is:



The subtle point associated with the __common section. Linux also has a similar *COM* section. Global variables can fall into this section . But, while I was testing and compiling a bunch of code fragments, on Linux, the relocation of characters from the *COM* section was always dynamic, like normal global variables. At the same time, in macOS such symbols were sometimes relocated during linking, if the function and the symbol are in the same file. Therefore, on macOS, it makes sense to take this section into account when reading symbols and relocations.


Great, now we have a set of all the relocations we need, what to do with them? The logic here is simple. When the linker links the library, he writes the symbol address calculated by a specific formula to the relocation address. For our relocations on both platforms, this formula contains the symbol address as a term. Thus, the calculated address, already recorded in the function body, has the form:


 resultAddr = newVarAddr + addend - relocAddr 

At the same time, we know the addresses of both versions of the variables — old, already living in the application, and new. It remains for us to change it according to the formula:


 resultAddr = resultAddr - newVarAddr + oldVarAddr 

and write it to the relocation address. After that, all functions in the new code will use the already existing versions of the variables, and the new variables will simply lie and do nothing. What you need! But there is one subtle point.


Loading library with new code


When the system loads the dynamic library into the process memory, it is free to place it in any place of the virtual address space. On Ubuntu 18.04, my application is loaded at 0x00400000 , and our dynamic libraries immediately after ld-2.27.so at addresses in the region of 0x7fd3829bd000 . The distance between the program and library load addresses is much larger than the number that would fit into the signed 32-bit integer. And in link-time relocations, only 4 bytes are reserved for addresses of target characters.


Having smoked the documentation for compilers and linkers, I decided to try the -mcmodel=large option. It makes the compiler generate a code without any assumptions about the distance between characters, thus all addresses are 64-bit. But this option is not friendly with PIC, as if -mcmodel=large cannot be used with -fPIC , at least on macOS. I still do not understand what the problem is, perhaps on macOS there are no suitable relocations for this situation.


In the library under windows, this problem is solved as follows. Hands allocated a piece of virtual memory near the place of loading the application, sufficient to accommodate the necessary sections of the library. Then the sections are loaded into it by hands, the necessary rights are set up for the memory pages with the corresponding sections, all relocations are resolved by hands, and the rest is patched. I'm lazy. I really did not want to do all this work with load-time relocations, especially on Linux. And why do something that a dynamic linker can already do? After all, the people who wrote it know much more than I do.


Fortunately, the documentation found the necessary options that allow you to specify where to load our dynamic library:



These options need to be passed to the linker at the time of linking the dynamic library. There are 2 difficulties.
The first is related to GNU ld. In order for these options to work, you need to:



That is, if the linker set the alignment to 0x10000000 , then this library cannot be loaded at 0x10001000 , even considering that the address is aligned to the page size. If one of these conditions fails, the library will load "as usual." I have GNU ld 2.30 in my system, and, unlike LLVM lld, it defaults to aligning the PT_LOAD segment to 0x20000 , which is very much out of the general picture. To get around this, you need in addition to the option -Ttext-segment=... specify -z max-page-size=0x1000 . I spent the day until I understood why the library is not loaded where it should be.


The second difficulty is that the download address must be known at the linking stage of the library. It is not very difficult to organize. In Linux, just parse the pseudo-file /proc/<pid>/maps , find the unallocated piece closest to the program that the library will fit into, and use the address of the beginning of this nub when linking. The size of the future library can be roughly estimated by looking at the sizes of the object files, or by parsing them and calculating the sizes of all sections. In the end, we need not an exact number, but an approximate size with a margin.


In macOS there is no /proc/* , instead, it is proposed to use the vmmap utility. The output of the vmmap -interleaved <pid> command contains the same information as proc/<pid>/maps . But then another difficulty arises. If an application creates a child process that executes this command, and the current process identifier is specified as the <pid> , the program will hang tightly. As I understand it, vmmap stops the process to read its memory mappings, and apparently, if this is the calling process, then something goes wrong. In this case, you need to specify an additional flag -forkCorpse , so that vmmap creates an empty child process from our process, vmmap mapping from it and kills it, thereby not interrupting the program.


In general, this is all we need to know.


Putting it all together


With these modifications, the final code reload algorithm looks like this:


  1. Compile new code into object files.
  2. According to the object files, we estimate the size of the future library.
  3. We read from object files of relocation
  4. We are looking for a free piece of virtual memory next to the application.
  5. We collect dynamic library with the necessary options, we dlopen through dlopen
  6. Patch code according to link-time relocations
  7. Patch functions
  8. Copy static variables that did not participate in step 6

In step 8, only guard variables of static variables are included, so they can be safely copied (thereby preserving the "initialization" of the static variables themselves).


Conclusion


Since this is exclusively a development tool, not intended for any production, the worst thing that can happen if another library with a new code does not fit into memory, or accidentally loads at a different address, it is a restart of the application being debugged. When running the tests into memory, 31 libraries are loaded in turn with the updated code.


For the sake of completeness, the implementation lacks 3 more weighty pieces:


  1. Now the library with the new code is loaded into memory next to the program, although it can get code from another dynamic library that was loaded far. For fixing, it is necessary to track the belonging of the translation units to these or those libraries and the program, and split the library with the new code if necessary.
  2. Reloading code in a multi-threaded application is still unreliable (you can safely reload only code that runs on the same thread as the runloop library). For fixing, it is necessary to bring part of the implementation into a separate program, and this program, before patching, should stop the process with all threads, perform patching, and return it to work. I do not know how to do this without an external program.
  3. Prevent accidental application crashes after reloading code. Having fixed the code, you can accidentally dereference an invalid pointer in a new code, after which you will have to restart the application. It's okay, but still. Sounds like black magic, I'm still in thought.

But already the current implementation began to bring benefits to me personally, it is enough for use in my main work. Need a little getting used to, but the flight is normal.
If I get to these three points and find a sufficient amount of interesting things in their implementation, I will definitely share it.


Demo


Since the implementation allows you to add new broadcast units on the fly, I decided to record a small video in which I write from scratch an indecently simple game about a spaceship plying the universe and shooting square asteroids. I tried to write not in the style of “all in one file”, but, if possible, putting everything on the shelves, thus generating a lot of small files (that is why there was so much writing). Of course, for drawing, inputs, windows and other things, the framework is used, but the code of the game itself was written from scratch.
The main feature is that I only launched the application 3 times: at the very beginning, when there was only an empty stage in it, and 2 times after falling due to my carelessness. The whole game was incrementally added in the process of writing code. Real time - about 40 minutes. In general, you are welcome.



As always, I will be glad to any criticism, thanks!


Reference to implementation



Source: https://habr.com/ru/post/437312/