📜 ⬆️ ⬇️

Under the hood of Screeps - virtualization in the MMO sandbox for programmers

In this article I will talk about one little-known technology, which has found a key application in our online game for programmers. In order not to drag the tires for a long time, the spoiler immediately: it seems that such a shamanism in the native Node.js code, to which we came after several years of development, no one had done before us. The engine of isolated virtual machines (open-source), which works under the hood of the project, was written specifically for its needs, and is currently being used in production by us and another startup. And the possibilities of isolation, which he gives, are unique and deserve to be told about them.


But let's get everything in order.


Prehistory


Do you like programming? Not the routine enterprise coding, which many of us have to do 40 hours a week, fighting procrastination, pouring in liters of coffee and professionally burning out; and programming is an incomparable magical process of transforming thoughts into a working program, receiving pleasure from the fact that the code you just wrote is embodied on the screen and begins to live the life that the creator tells it. At such moments, the word "Creator" I want to write with a capital letter - so much feeling arising in the process, sometimes it is close to awe.



It is a pity that very few real projects related to daily earnings can offer such feelings to their developers. Most often, in order not to lose the passion for programming, enthusiasts have to start an intrigue on the side: a programmer's hobby, a pet project, a trendy open-source, just a python script to automate their smart home ... or a character’s behavior in some popular online the game.


Yes, it is online games that often provide an inexhaustible source of inspiration for programmers. Already the very first games in this genre (Ultima Online, Everquest, not to mention all kinds of MUDs) attracted quite a few craftsmen, who were not so much interested in playing the role and enjoying the fantasy of the world, as they were using their talents to automate everything and everyone virtual gaming space. To this day, it remains a special discipline of the online MMO games Olympiad: to excel so write your bot in order to go unnoticed by the administration and get the maximum profit compared to other players. Or other bots - such as, for example, in EVE Online, where trading in densely populated markets is a little less than fully controlled by trading scripts, just like on real exchanges.


The idea of ​​an online game, initially and completely oriented towards programmers , was in the air. Such a game in which writing a bot is not a punishable act, but the essence of the gameplay. Where the task would be not to perform the same actions "kill X monsters and find Y items" from time to time, but to write a script capable of competently performing these actions on your behalf. And since it implies an online game in the MMO genre, the rivalry occurs with the scripts of other players in real time in a single common game world.


So in 2014, the game Screeps (from the words "Scripts" and "creeps") appeared - a strategic real-time MMO sandbox with a single large persistent world , in which players have no influence on what happens except through writing AI scripts for their gaming units. . All the mechanics of an ordinary strategic game — resource extraction, the creation of units, the construction of a base, the seizure of territories, production and trade — are required to be programmed by the player himself through the JavaScript API provided by the game world. The difference from different AI writing competitions is that the world of the game, as it should be in the online gaming world, constantly works and lives its own life in real time 24/7 for the last 4 years, launching each player's AI every game tact.


So, enough about the game itself - this should be quite enough so that you can further understand the essence of the technical problems that we encountered during development. More presentation can be obtained from this video, but this is optional:


Video trailer

Technical problems


The essence of the mechanics of the game world is as follows: the whole world is divided into rooms that are connected by exits on four sides of the world. One room is an atomic unit of processing the state of the game world. There may be some objects in the room (for example, units), which have their own state, and at each game tact they receive commands from the players. The server handler takes one room at a time, executes these commands, changing the state of the objects, and commits the new state of the room to the database. This system scales horizontally well: you can add more processors to the cluster, and since the rooms are architecturally isolated from each other, as many rooms can be processed in parallel as many processors are running.



At the moment we have 42 060 rooms in the game. The server cluster of 36 quad-core physical machines contains 144 processors. We use Redis to form the queues, the whole backend is written in Node.js.


This was one stage of the game tact. But where do the teams come from? The specificity of the game is that there is no interface where you could click on a unit and tell it to go to a certain point or build a certain structure. The maximum that can be done in the interface - put an intangible flag in the right place in the room. In order for a unit to come to this place and do the necessary action, your script needs to do something like the following for several game cycles:


module.exports.loop = function() { let creep = Game.creeps['Creep1']; let flag = Game.flags['Flag1']; if(!creep.pos.isEqualTo(flag.pos)) { creep.moveTo(flag.pos); } } 

It turns out that at each game tact you need to take the player’s loop function, execute it in the full-fledged JavaScript environment of this particular player (in which there is a Game object created for it), get a set of orders for units, and give them to the next stage of processing. It seems pretty simple.



Problems begin when it comes to implementation nuances. At the moment we have 1600 active players in the world. Individual player scripts cannot be called "scripts" - some of them contain up to 25k lines of code , compile from TypeScript or even from C / C ++ / Rust via WebAssembly (yes, we support wasm!), And implement the concept of true miniature OS, in which players have developed their own pool of game tasks-processes and their management through the core, which takes as many tasks as can be done on a given game tact, executes them, and unfulfilled puts it in the queue until the next cycle. Since the CPU and the player’s memory resources are limited at each clock cycle, this model works well. Although it is not mandatory - to start the game, it is enough for a beginner to take a script of 15 lines, which is also written in the tutorial.


But now let's remember that the player script should work in a real JavaScript machine. And that the game works in real time - that is, the JavaScript machine of each player must constantly exist, working with a certain preset pace, so as not to slow down the game as a whole. The stage of executing gaming scripts and forming orders for units works approximately on the same principle as processing rooms - each player script is a task that one processor handles from the pool, many parallel processors work in a cluster. But unlike the stage of processing rooms, there are already many difficulties.


First, it is impossible to simply distribute tasks to handlers at random on each clock cycle, as it is possible to do in the case of rooms. The player's JavaScript machine should work non-stop, each next clock cycle is just a new loop function call, but the global context should continue to be the same. Roughly speaking, the game is allowed to do something like this:


 let counter = 0; let song = ['EX-', 'TER-', 'MI-', 'NATE!']; module.exports.loop = function () { Game.creeps['DalekSinger'].say(song[counter]); counter++; if(counter == song.length) { counter = 0; } } 


Such a creep will sing one line of the song each playing beat. The line number of the song counter is stored in a global context that is stored between bars. If every time you execute the script of this player in the new handler process, then the context will be lost. This means that all players should be distributed to specific handlers, and should be changed as little as possible. But how to deal with load balancing? One player may spend 500ms of execution on this node, and another player 10ms, and it is very difficult to predict this in advance. If 20 players of 500 ms each suddenly fall on one node, then the work of such a node will take 10 seconds, during which everyone else will wait for it to finish and stand idle. And in order to rebalance these players and transfer to other nodes, you have to lose their context.


Secondly, the player’s environment must be well isolated from other players and from the server environment. And this concerns not only security, but also comfort for the users themselves. If a neighboring player running on the same node in the cluster as me creates anything, generates a lot of garbage, and behaves inappropriately, then I shouldn't feel it. Since the CPU resource in the game is the execution time of the script (it is calculated from the start until the end of the loop method), the waste of resources on extraneous tasks during the execution of my script can be very sensitive, because CPU resources are spent from my budget.


In trying to cope with these problems, we have come up with several solutions.


First version


The first version of the game engine was based on two basic things:



It looked like this. On each machine in the cluster, there were 4 (by the number of cores) process handlers for game scripts. When receiving a new task from the queue of game scripts, the handler requested the necessary data from the database and transferred them to a child process, which was maintained in a constantly running state, restarted in the event of a failure, and reused by different players. The child process, being isolated from the parent (which contained the cluster business logic), knew only one thing: to create a Game object from the received data and start the player’s virtual machine. We used the vm module in Node.js.


Why was this decision imperfect? Strictly speaking, the above two problems were not solved here.


vm works in the same single-threaded mode as Node.js itself. Therefore, to have four parallel processors on each core on a 4-core machine, you need 4 processes. Moving a player “living” in one process to another process leads to a complete re-creation of the global context, even if it occurs within the same machine.



In addition, vm does not actually create a completely isolated virtual machine. What it does is only create an isolated context , or scope, but execute the code in the same instance of the JavaScript virtual machine from which vm.runInContext is called. So - in the same instance in which other players are launched. Although the players are divided into isolated global contexts, but being part of the same virtual machine, they share a common heap memory, a common garbage collector, and generate garbage together. If player “A” generated a lot of garbage during the execution of his gaming script, finished the work, and control passed to player “B”, then at that moment the collection of all the garbage in the process may well be caused, and player “B” will pay its CPU time for collecting someone else's garbage. Not to mention the fact that all contexts work in the same event loop, and it is theoretically possible to execute someone else's promis at any time, although we tried to prevent it. Also, vm does not allow controlling how much heap memory is allocated for script execution, all process memory is available.


isolated-vm


There lives in the world such a wonderful man named Marcel Laverde. For some, he once became remarkable in that he wrote the node-fibers library, for others - that he hacked Facebook and was hired to work there . And for us, he is remarkable because he generously participated in our very first crowdfunding campaign and to this day is a big fan of Screeps.


Our project has been open source for several years now - the game server is published on GitHub. Although the official client is sold for free through Steam, there are alternative versions of it, and the server itself is available for study and modification at any scale, which we strongly encourage.


And once Marcel wrote to us: “Guys, I have a good experience in the native development of C / C ++ under Node.js, and I like your game, but not in everything I like the way it works - let's write a completely new one with you technology to run virtual machines under Node.js specifically for Screeps? ".


Since Marcel did not ask for money, we could not refuse. After a few months of our cooperation, the isolated-vm library was born. And it changed everything.


isolated-vm differs from vm in that it isolates not the context , but isolate in terms of the V8 . Without going into details, this means that a full-fledged separate instance of the JavaScript machine is created, which has not only its own global context, but also its own heap memory, garbage collector and works within a separate event loop. Of the minuses: a small RAM overhead (about 20 MB) is required for each running machine, and it is impossible to transfer objects or call functions directly to the inside of the machine, the entire exchange must be serialized. This ends the cons, otherwise it’s just a panacea!



Now it is really possible to run the script of each player in their own completely isolated space. The player has his 500 MB of hip, if he has ended - that means that it is your own hip that ended, not the general process hip. If you generated garbage - then this is your own garbage, and you collect it. The hung promises will be executed only when your isolate passes control the next time, and not earlier. Well, security issues - under no circumstances it is impossible to get access somewhere outside the isolate, only if you find somewhere a vulnerability at the V8 level.


But what about balancing? Another plus of isolated-vm is that it starts the machines from the same process, but in separate threads (Marcel's experience on node-fibers came in handy here). If we have a 4-core machine, we can create a pool of 4 threads, and run 4 parallel machines at one time. At the same time being in the framework of the same process, which means having a common memory, we can transfer any player from one thread to another within this pool. Although each player remains tied to one specific process on one specific machine (in order not to lose the global context), balancing between 4 threads turns out to be enough to solve the problems of distribution of "heavy" and "light" players between the nodes so that all processors finish work simultaneously and on time.


After running in this function experimentally, we received a huge amount of positive feedback from players whose scripts began to work much better, more stable and more predictable. And now this is our default engine, although players still can optionally choose legacy runtime purely for backward compatibility with old scripts (some players consciously focused on the specifics of the shared environment in the game).


Of course, there is still room for optimization and further, and there are other interesting areas of the project in which we have solved various technical problems. But more about that another time.



Source: https://habr.com/ru/post/437836/