The slowest way to the fastest*** Advent Of Code solution

I got here just in time

Advent Of Code 2024 is a wonderful collection of 25 little coding challenges, allowing hundrets of thousands of aspiring and established programmers to learn something new, or just have a lot of fun solving puzzles under holiday lights.

On Day 17 - Chronospatial Computer, we were asked to calculate the output for a simple 3-bit (micro-) computer program, that was using a very limited instruction set.

Here's how I slowly created the fastest*** solution to this delightful puzzle.

Divide by 2

The following example program was provided:

Program: 0,1,5,4,3,0

These 6 numbers, together with the given instruction set descriptions decode to the following steps:

0, 1: Divide the value in register "A" by 2^1 (also known as the number 2).
5, 4: Output the value of register "A", modulo 8
3, 0: If register "A" is not zero, jump back to the first instruction (instruction 0).

Simple enough

The following C++ code shows the initial solution to the problem. A simple virtual machine implementation that parses the input program string of numbers, and performs the relevant commands:

while (pc < program.size()) {
  const auto op  = program.at(pc++);
  const auto arg = program.at(pc++);
  switch (op) {
    case OpCode::ADV: RA() = RA() / (1 << COMBO(arg));  break;
    case OpCode::BXL: RB() ^= arg;                      break;
    case OpCode::BST: RB() = COMBO(arg) & 0x7;          break;
    case OpCode::JNZ: pc = RA() != 0 ? arg : pc+1;      break;
    case OpCode::BXC: RB() ^= RC();                     break;
    case OpCode::OUT: OUT8(COMBO(arg));                 break;
    case OpCode::BDV: RB() = RA() / (1 << COMBO(arg));  break;
    case OpCode::CDV: RC() = RA() / (1 << COMBO(arg));  break;
    default: return;
  }
}

No problem! Part 1 of the Advent Of Code puzzle done. Gold stars all around!

But ...

"Limited" Instruction Set? Sign me up!

Part 2 of the Advent of Code puzzle asks for an answer, that theoretically could be derived by brute forcing the solution. In fact, the example given for Part 2 can be brute forced by running the same program using the above "virtual machine" implementation many times to find the correct solution.

The simple virtual machine above performs the steps a compiler and/or interpreter of most program languges does. It reads the program one instruction at a time and then performs the instructions on the host CPU, storing the relevant results.

But what if you didn't have to re-evaluate the program every time you want to run it?

Just-in-time compilation

Your web browser (likely) does this with JavaScript, Mono does this with C#, Android does it (did it) with ART. So let's do it for Advent of Code!

Given the limited instruction set of this toy computer, it provides an excellent example for learning about how to compile code to native instructions (x86_64 in this case), how to acquire memory and make it executable, and how to access memory shared with C++ parts of the code.

Where to, Mr. Von Neumann?

Owing to the fact that many (most?) modern comupters still utilize Von Neumann architecture, we can allocate memory and compile our littlle mircro-processor instructions directly into x86_64 microcode instructions and place them into memory.

To do this, we allocate an arbitrarily sized chunk of memory and make sure we can write to it (error handling removed from code for clarity):

auto allocateWritable() -> std::array<uint8_t, SIZE>* {
  void* mmaped = mmap(nullptr, SIZE, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  return new (mmaped) std::array<uint8_t, SIZE>{};
}

This will give us a page-aligned chunk of memory to work with. And for a tiny modicum of security, this memory is not executable yet. Placing instructions there, would not allow us to run the program yet.

Input and output

The micro-computer described in Advent of Code uses three input registers (A, B and C) and produces a single series of 3-bit numbers. To make the Just-in-time (JIT) compiler easier, we are using three of the 64-bit registers that are (on Linux) used to pass arguments to a function as our A, B and C registers.

So %rdi is used as register A, %rsi as register B and finally %rdx is used for register C.

On the output side, the three bit values are simply encoded into a single 64-bit register and returned as a function return value in the %rax register.

Compiling!

Here is an example of the "OUT" instruction that we also referenced in the example above:

void emit_OUT(uint8_t operand, auto&& emit) {
  if (isLiteral(operand)) {
    emit(std::array<uint8_t, 8>{
        0x48, 0xc1, 0xe0, 0x03,             // shl    $0x3,%rax
        0x48, 0x83, 0xc8, literal(operand)  // or     <op>,%rax
    });

  } else {
    const auto ra_rb_rc = std::array<uint8_t, 3>{0xf8, 0xf0, 0xd0};
    const auto from     = ra_rb_rc[operand - 4];
    emit(std::array<uint8_t, 14>{
        0x48, 0xc1, 0xe0, 0x03,  // shl    $0x3,%rax
        0x49, 0x89, from,        // mov    <from>,%r8
        0x49, 0x83, 0xe0, 0x07,  // and    $0x7,%r8
        0x4c, 0x09, 0xc0,        // or     %r8,%rax
    });
  }
}

This instruction will either place a single 3-bit literal, or the contents of either the A, B or C registers, into the output register. Since a single 64-bit register is used for the output, the value is first left-shifted by 3 bits to make room for the new value.

Running the code

Before we can run the code, we first have to signal to the operating system kernel that the memory we have allocated above should now be made executable. In the process, we also remove the write flag so the memory can no longer be written to. This is a minimal safety measure.

mprotect(program, program->size(), PROT_READ | PROT_EXEC)

Now we're ready to run! We can now simply force a cast of the pointer to an array-of-bytes that holds our program to a pointer to a function that takes three 64-bit input values, and produces one 64-bit return value:

using chronospatial_computer = size_t (*)(size_t, size_t, size_t);

auto* compute = reinterpret_cast<chronospatial_computer>(x86);

And finally, lets run a program with the input values 10 for register A, 20 for register B and 30 for register C:

auto result = (compute)(10, 20, 30);

Are we fast now?

So, after all this and a full JIT compiler implementation, complete with unit tests later, are we fast now?

Yes! It's super fast now. I would contend, this is the fastest*** possible way to run the micro-computer program provided by the puzzle on this architecture. Sure, the assembly could be optimized more, sure, we could also compile to SIMD instructions somehow. There are many caveats to the fastest claim. But it will be way faster to execute than any implemntation trying to read the program every time.

Can we brute force now?

Uhm, no. Even if the run time of the program is now measured in microseconds, it would take many years to arrive at the given solution for part 2 of the advent puzzle in time. :)

JIT was necessary, though, right?!

Nope. Not at all.

For part 1 of the problem, the program must be executed only once. So doing a just-in-time compilation first and THEN running the program is actually slower than just running it :).

For part 2, a simple algorithm does need a few thousand runs of the program potentially to arrive at a solution, which even when interpreting every time can be done in milliseconds on modern hardware.

Was it worth it?

Absolutely! While not being helpful in the slightest for solving the actual Advent of Code puzzle, this was a fantastic opportunity to learn / brush up on the concepts of Just-in-time compilation, actually build a compiler with a tiny instruction set, learn about memory protection and brush up on x86_64 assembly (which is an absulte pain....).

Reference

Advent of Code puzzle and description:
https://adventofcode.com/2024/day/17

Full code and unit tests here:
https://github.com/int2str/advent2024_public/blob/main/17/day_17_vm.hh

@int2str.net

2024-12-29T16:39:53.187Z