Prospero with Cranelift JIT and SIMD

When I saw the the prospero challenge by @mattkeeter.com, it looked right up my alley. If you haven't read that yet, go read it before proceeding. You can find the finished source code for this here.

I've been working on a Cranelift frontend for a toy programming language as part of an honors project (nothing very cool to show off yet) and immediately thought translating the code directly into Cranelift intermediate representation (IR) and evaluating it would be a fun approach. While Cranelift doesn't do a ton of optimizations compared to something like LLVM, it still has some basic ones that I expect would help here, and compared to writing the JIT assembly myself it should be fairly portable. I've also found it to have an extremely nice API.

prelude: This is my first blog post (ever?) so please be kind. I'm realizing as I write this that it's very hard to write a good one! I'm also recovering from pneumonia, so I might not be 100% on my game. I'm dysgraphic and something isn't working with the spell checker in whitewind, so please let me know about any typos.

getting started

I initially thought I'd copy my toy lang and edit it down, but it turns out that prospero is really simple from a language perspective. Each line is assigned to a variable, but the variable names are just incrementing hex values, corresponding to zero indexed line numbers. Leave your hashmaps at home! Moreover, you don't need any fancy tokenization or parsing. The provided Python just splits each line, and I decided to do the same for my impl.

Finally, there aren't any immediate values, meaning literal values used as part of an expression. For example, the equation x + 5 has the immediate value of 5. If you wanted to translate this into Cranelift, you'd either need to use the imm flavor of an instruction, like iadd_imm, or create a const and then use its value for traditional add. Speaking of, the docs page for InstBuilder, the part of Cranelift that you call methods on to insert instructions, has to be one of my favorite docs.rs pages ever—if you are working with Cranelift check it out!

All that said, the code in .vm translates extremely literally into Cranelift IR, since it's already (nearly) in SSA. Some variables are actually reused, but Cranelift will automatically duplicate them as needed for true SSA. For my first impl, the core of my translation looks roughly like this:

for line in src.lines().skip(1) {
    let mut tokens = line.split_ascii_whitespace();
    let key = variable_index(tokens.next().unwrap());
    let op = tokens.next().unwrap();

    kv.push(match op {
        "const" => {
            let val = tokens.next().unwrap();
            builder.ins().f64const(val.parse::<f64>().unwrap())
        }
        // ...
        "add" | "sub" | "mul" | "div" | "min" | "max" => {
            let a = kv[variable_index(tokens.next().unwrap())];
            let b = kv[variable_index(tokens.next().unwrap())];
            match op {
                "add" => builder.ins().fadd(a, b),
                "sub" => builder.ins().fsub(a, b),
                "mul" => builder.ins().fmul(a, b),
                "div" => builder.ins().fdiv(a, b),
                "max" => builder.ins().fmax(a, b),
                "min" => builder.ins().fmin(a, b),
                _ => unreachable!(),
            }
        }
        _ => panic!("unhandled opcode {op}"),
    });
}

Some of the code probably isn't ideomatic Cranelift (so please don't reference this repo and blame me), but it's hard to find modern examples—and a lot has evolved. For example, Cranelift used to have a separate boolean type, but now its just part of I8, and lots of reference code online will still use the old type.

There are some fun complexities around Cranelift JIT that I don't fully understand. When you get the finalized function for your Cranelift IR, you need to transmute it into a function pointer:

let code: *const u8 = module.get_finalized_function(func);
unsafe { mem::transmute::<_, extern "C" fn(f64, f64) -> i8>(code) }

I think this is fine? For my toy language's REPL, it posed issues since the size of the return value wasn't known, but in this case I believe the unsafe part is assuming that the function really has those params and return values (and that we actually have a pointer to a function and not some other memory location). If it didn't, Cranelift would fail to finalize, since we specify them above. I've seen more modern Cranelift code that avoids this, but haven't looked into how.

first attempt: initial src

My first attempt went very smoothly! It only took an hour or two. I've never used the ppm format before (spec, wikipedia), and I was shocked that it was so simple to write it out (that's the point of the format, it's really nice). However, it seems like ppm isn't supported in whitewind, so I converted it to png with ffmpeg; ffmpeg -i out.ppm out.png

Running it, I get a ppm file like this:

Obviously not quite right, and also it's slow! It looks to take about 7 seconds on my machine in debug mode. Honestly, I'm shocked we even see text on the first try. Release mode isn't much better...

❯ time ./target/release/prospero-jit-rs

________________________________________________________
Executed in    4.92 secs    fish           external
   usr time    4.90 secs  499.00 micros    4.90 secs
   sys time    0.01 secs  712.00 micros    0.01 secs

It isn't surprising that this is slow though, since it is computing each pixel in series (ugly code sorry, src):

let xy_fn = translate(src);
let bytes = (0..dim)
    .cartesian_product(0..dim)
    .map(|(x_step, y_step)| (-1f64 + step * x_step as f64, -1f64 + step * y_step as f64))
    .map(|(x, y)| {
        let b = xy_fn(x, y);
        if b != 0 {
            true
        } else {
            false
        }
    })
    .map(|b| match b {
        true => 255u8,
        false => 0u8,
    })
    .collect::<Vec<_>>();

Still, we beat the naive Python, and even the garbage collected Python, and those submissions are doing parallelism, so this might still pan out. We do cheat a little by using include_str!...

Looking at the stepping I did, it seems that x and y will range from -1 to 0, so let's fix that.

// let step = 1f64 / dim as f64;
let step = 2f64 / dim as f64;

Now we get the full image, but it's still sideways:

I suspect cartesian product isn't reading my mind on the row vs column ordering, so let's replace that with an uglier but more explicit iterator. I had to fiddle for a minute with this, but I really hate to think about coordinate spaces if I don't have to, so instead I used swayimg's reloading functionality, bacon's auto run functionality, dropped the resolution to 256x256, and kept trying until it came out right. This does the trick, don't ask me why.

let bytes = (0..dim)
    .flat_map(|y_step| (0..dim).rev().map(move |x_step| (x_step, y_step)))
    .rev()

Or, at full 1024x1024 resolution:

This leaves us here.

going faster

Personally, getting to use Rayon is one of my favorite parts of Rust. It doesn't usually end up being as fast as thinking hard about parallelism, but it requires no thinking—you just change some iters into par iters, race conditions are checked at compile time, and work is distributed pretty efficiently, without distributing small amounts of fast work or creating too many threads. Magic!

aside: I did actually try to beat Rayon later, but it was slower—someone with more recent parallel Rust experience can take up that quest.

All we need to do in this case is collect the iterator of (f64, f64) coordinates to check into a vector, then call into_par_iter before we map with our compiled JIT function. Rayon handles running the work in parallel, and collecting it back into the same order for us.

This is a lot better! Debug mode only takes a second:

❯ time ./target/debug/prospero-jit-rs

________________________________________________________
Executed in    1.51 secs    fish           external
   usr time    9.85 secs    0.00 micros    9.85 secs
   sys time    0.03 secs  989.00 micros    0.02 secs

Release is a bit faster...

❯ time ./target/release/prospero-jit-rs

________________________________________________________
Executed in  606.17 millis    fish           external
   usr time    8.65 secs      1.05 millis    8.65 secs
   sys time    0.02 secs      0.01 millis    0.02 secs

Normally I'd use hyperfine for this, but the execution time is large enough, and my laptop performance is noisy enough that I don't think it matters much. This is about as fast at 1024x1024 as the serial version is at 256x256, which makes sense.

Interestingly, the debug performance of this is roughly equivalent to the CuPy version by Max Bernstein, with the release version being ~3x faster—and that uses the GPU!

The prospero blog post mentions a GPU based renderer that has flat performance up to 4096x4096, lets try that:

❯ time ./target/release/prospero-jit-rs

________________________________________________________
Executed in    7.42 secs    fish           external
   usr time  113.50 secs  508.00 micros  113.50 secs
   sys time    0.18 secs  301.00 micros    0.18 secs

Our impl definitely doesn't have flat performance! 😅

The text does look sharper though!

r in 1024x1024	r in 4096x4096

Increasing the number of pixels by 16x increased our runtime by ~12x, which I imagine is attributed to amortizing some of the flat costs and parallelization overheads? Feel free to correct me!

Just as a sanity check, we can see that our Rayon based code is using all our CPUs:

This leaves us here.

jit cost

How much time do we spend creating our JIT function pointer? Out of curiosity, I set the resolution to 1x1, removed Rayon, and ran the renderer—this obviously isn't only the time spent in JIT compilation, but it says to me that Cranelift JIT is extremely fast for this usecase compared to the time we spend calling our function. We probably have breathing room to try and optimize the function before we give it to Cranelift, but I'm not up to that right now; maybe a followup.

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):      35.7 ms ±   0.6 ms    [User: 31.6 ms, System: 3.8 ms]
  Range (min … max):    34.5 ms …  37.1 ms    78 runs

getting sloppy

Do we really need f64 to render a good image? One thing I get good returns from during advent of code is using smaller datatypes if I can. Smaller datatypes are great for lots of reasons, including getting to fit more data in cache, and more data in a single vector. I don't have a good intuition for when they speed things up though. Let's try it! The smallest float purportedly supported by Cranelift is f16, but it seems like f16 support in Rust is nightly only. Nix makes switching to nightly really easy!

However, the Rust docs say:

Note that most common platforms will not support f16 in hardware without enabling extra target features, with the notable exception of Apple Silicon (also known as M1, M2, etc.) processors. Hardware support on x86-64 requires the avx512fp16 feature, while RISC-V requires Zhf. Usually the fallback implementation will be to use f32 hardware if it exists, and convert between f16 and f32 when performing math.

In that case, let's start with f32. Since we are comparing things that have similar performance, I'm whipping out hyperfine:

`f64`

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):     499.4 ms ±  49.2 ms    [User: 7128.8 ms, System: 18.2 ms]
  Range (min … max):   462.4 ms … 618.9 ms    10 runs

`f32`

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):     520.2 ms ±  58.9 ms    [User: 7405.8 ms, System: 17.7 ms]
  Range (min … max):   448.7 ms … 582.9 ms    10 runs

I saved a golden copy of the ppm file made with f64, and it's identical to the f32 version, so assuming my f64 impl was correct (big assumption) we don't need all the precision.

I could probably close Spotify, Discord, a few hundred tabs and some other projects to bring the variation down. Anyway, both results are very similar—it's totally plausible they perform basically the same. This doesn't look promising yet, but a vectorized version might benefit more... can we make f16 work?

Switching to nightly was easy, but I did need to add an overlay—nobody who knows nix would smile when reading my flake though, so lets not discuss it further. Since we switched compiler versions, let's rerun all the different variants. (I did a paranoid diff against the gold ppm, but nothing changed on nightly)

`f64`

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):     482.9 ms ±  25.0 ms    [User: 6904.6 ms, System: 19.6 ms]
  Range (min … max):   462.8 ms … 549.8 ms    10 runs

`f32`

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):     494.8 ms ±  12.8 ms    [User: 6724.0 ms, System: 12.8 ms]
  Range (min … max):   475.6 ms … 515.9 ms    10 runs

This is when I realized switching to f16 isn't a single line change, since its still missing some impls on nightly as of march 24th, 2024 (stuff like FromStr and Ieee16: From<f16> are missing, which I need). It's probably solvable, but I think using a single f16 is almost certainly a dead end except on hardware that supports it.

I'll stop using nightly for the next part.

just one more lane bro

So far, we've been using parallelism at the level of threads to speed up our JITed function. But even though we are using the whole CPU, we are neglecting some of the potential of each core. What if we could get more done for each invocation of our function? SIMD (sometimes called vectorization) is a very cool but kind of scary approach where we are able to do the same operation on multiple values at once. This is perfect for us, since we run the exact same function over and over on slightly different values.

Typically, SIMD instructions are emitted by your compiler, and you never need to worry about it. Sometimes you can use them intentionally with intrinsics. At the time of writing, cross platform SIMD intrinsics for Rust are still in progress.

I've never written SIMD by hand or used SIMD intrinsics, so this will be new for me.

Conceptually, SIMD vectors are like fixed sized arrays of values, where an operation on two vectors is like iterating, zipping, mapping with the scalar version of the operation and collecting the values—except it's able to be done in a single instruction.

Cranelift has support for generating SIMD instructions though, so in theory all we need to do is call our JIT function with a base x value, instantiate the x value in our Cranelift IR to be a SIMD vector, where each lane takes on the value of the base x value plus the applicable step size, and then use the vector version of checking each sign at the end. We could vectorize over y, but I chose x.

Three step plan:

call our JIT function with a base x value
instantiate the x value as a vector in Cranelift such that each lane is x + step
use the vector version of checking each sign

Spoiler from the future: this is missing stuff.

If you want a visualization, consider the grid of pixels we need to render:

Right now, each pixel requires its own function invocation, to determine the sign of the last variable and color the pixel accordingly.

But with vectorization, we are able to make each invocation operate on multiple values at once, and compute multiple pixels.

When we call the JIT function, it creates multiple x values internally, offset by a step size, and runs on all of them at the same time. There isn't a reason beyond iteration complexity that we couldn't have each invocation compute multiple y values.

1: big steps

Update the coordinate code to take 8 steps on x at a time:

.flat_map(|y_step| {
    // this ugly code lets us step 8 xes at once since each call computes 8 x values
    (0..(dim / SIMD_WIDTH))
        .rev()
        .map(move |x_step| (x_step * SIMD_WIDTH, y_step))
})
.rev()

2: get wide

I changed my translate function to specialize the function it returns for a given step size and wrote a naive bit of code to create the vector:

"var-x" => {
    let x_base = builder.block_params(block)[0];

    // WARN: probably bugbear live here
    let incrementing = builder.func.dfg.constants.insert(
        (0..SIMD_WIDTH)
            .map(|n| n as f32 * x_stepsize)
            .flat_map(|n| n.to_ne_bytes())
            .collect(),
    );

    let step_vec = builder.ins().vconst(types::F32X8, incrementing);
    builder.ins().fadd(step_vec, x_base)
}

Cranelift is taunting me with larger vectors, but I don't think my hardware supports them—I have a Ryzen 5900HX in my laptop, which supports AVX1 and AVX2 (SIMD instruction sets).

This didn't work! I get a nasty unwrapped error.

Cranelift validates the IR you create before generating code, but the error is pretty inscrutable for this much IR without a better error renderer, which I don't want to set up.

...VerifierError { location: inst7761, context: Some("v7764 = fadd.f32 v7763, v263  ; v7763 = 0x1.966666p0"), message: "arg 1 (v263) has type f32x8, expected f32" }, VerifierError { location: inst7765, context: Some("v7768 = fadd.f32 v7767, v263  ; v7767 = 0x1.6cccccp0"), message: "arg 1 (v263) has type f32x8, expected f32" }, VerifierError { location: inst7771, context: Some("v7774 = fadd.f32 v7773, v244  ; v7773 = 0x1.0f5c28p0"), message: "arg 1 (v244) has type f32x8, expected f32" }, VerifierError { location: inst7793, context: Some("v7796 = fadd.f32 v7795, v263  ; v7795 = 0x1.966666p0"), message: "arg 1 (v263) has type f32x8, expected f32" }, VerifierError { location: inst7796, context: Some("v7799 = fadd.f32 v7798, v263  ; v7798 = 0x1.c00000p0"), message: "arg 1 (v263) has type f32x8, expected f32" }, VerifierError { location: inst7803, context: Some("v7806 = fadd.f32 v7805, v244  ; v7805 = 0x1.0e81ecp0"), message: "arg 1 (v244) has type f32x8, expected f32" }, VerifierError { location: inst7819, context: Some("v7822 = fadd.f32 v7821, v7  ; v7821 = 0x1.600000p1"), message: "arg 1 (v7) has type f32x8, expected f32" }, VerifierError { location: inst7822, context: Some("v7825 = fadd.f32 v7824, v7  ; v7824 = 0x1.6cccccp1"), message: "arg 1 (v7) has type f32x8, expected f32" }, VerifierError { location: inst7827, context: Some("v7830 = fadd.f32 v7829, v7  ; v7829 = 0x1.666666p1"), message: "arg 1 (v7) has type f32x8, expected f32" }, VerifierError { location: inst7835, context: Some("v7838 = fadd.f32 v7837, v7  ; v7837 = 0x1.900000p1"), message: "arg 1 (v7) has type f32x8, expected f32" }, VerifierError { location: inst7840, context: Some("v7843 = fadd.f32 v7842, v7  ; v7842 = 0x1.bcccccp1"), message: "arg 1 (v7) has type f32x8, expected f32" }, VerifierError { location: inst7847, context: Some("v7850 = fadd.f32 v7849, v7  ; v7849 = 0x1.933334p2"), message: "arg 1 (v7) has type f32x8, expected f32" }])))

The gist seems to be that instructions are expecting f32x8 and getting f32. Weird.

SInce the IR for the full program is pretty massive, I created a simple program to try and deworm this, added a call to ctx.func.display() to see the IR, and pretty printed the verifier error:

# 1/4th filled in?
_0 var-x
_1 var-y
_2 add _0 _1
_3 const 1
_4 sub _2 _3

function u0:0(f32, f32) -> i8 system_v {
    const0 = 0x3c6000003c4000003c2000003c0000003bc000003b8000003b00000000000000

block0(v0: f32, v1: f32):
    v2 = vconst.f32x8 const0
    v3 = fadd v2, v0  ; v2 = const0
    v4 = fadd v3, v1
    v5 = f32const 0x1.000000p0
    v6 = fsub v4, v5  ; v5 = 0x1.000000p0
    v7 = f32const 0.0
    v8 = fcmp lt v6, v7  ; v7 = 0.0
    return v8
}

Compilation(
    Verifier(
        VerifierErrors(
            [
                VerifierError {
                    location: inst1,
                    context: Some(
                        "v3 = fadd.f32x8 v2, v0  ; v2 = const0",
                    ),
                    message: "arg 1 (v0) has type f32, expected f32x8",
                },
                VerifierError {
                    location: inst2,
                    context: Some(
                        "v4 = fadd.f32x8 v3, v1",
                    ),
                    message: "arg 1 (v1) has type f32, expected f32x8",
                },
                VerifierError {
                    location: inst4,
                    context: Some(
                        "v6 = fsub.f32x8 v4, v5  ; v5 = 0x1.000000p0",
                    ),
                    message: "arg 1 (v5) has type f32, expected f32x8",
                },
                VerifierError {
                    location: inst6,
                    context: Some(
                        "v8 = fcmp.f32x8 lt v6, v7  ; v7 = 0.0",
                    ),
                    message: "arg 1 (v7) has type f32, expected f32x8",
                },
                VerifierError {
                    location: inst7,
                    context: Some(
                        "return v8",
                    ),
                    message: "result 0 has type i32x8, must match function signature of i8",
                },
            ],
        ),
    ),
)

Ah, it looks like I've totally misunderstood how SIMD works! This is one of those times I should have read any docs first. SIMD instructions operate between two vectors. You don't add a scalar to each lane of a vector, you add two vectors lane by lane.

2.5: everything needs to be wide

I'll need to create vectors for all the constants, and the initial function parameters. I'm using splat to create vectors that have all lanes equal to some value. Doing this with the function parameters and all constants gets to here:

function u0:0(f32, f32) -> i8 system_v {
    const0 = 0x3c6000003c4000003c2000003c0000003bc000003b8000003b00000000000000

block0(v0: f32, v1: f32):
    v2 = splat.f32x8 v0
    v3 = vconst.f32x8 const0
    v4 = fadd v3, v2  ; v3 = const0
    v5 = splat.f32x8 v1
    v6 = fadd v4, v5
    v7 = f32const 0x1.000000p0
    v8 = splat.f32x8 v7  ; v7 = 0x1.000000p0
    v9 = fsub v6, v8
    v10 = f32const 0.0
    v11 = fcmp lt v9, v10  ; v10 = 0.0
    return v11
}

Compilation(
    Verifier(
        VerifierErrors(
            [
                VerifierError {
                    location: inst9,
                    context: Some(
                        "v11 = fcmp.f32x8 lt v9, v10  ; v10 = 0.0",
                    ),
                    message: "arg 1 (v10) has type f32, expected f32x8",
                },
                VerifierError {
                    location: inst10,
                    context: Some(
                        "return v11",
                    ),
                    message: "result 0 has type i32x8, must match function signature of i8",
                },
            ],
        ),
    ),
)

Now I need to decide how to get the data out.

3: exiting the highway

I see at least two choices here; we could generate code that returns an array of 8 bools (i8 since we are using Cranelift, which represents booleans at I8), or we could return a single i8 where each of the bits map to the sign of a given lane. I was inclined to return an array of 8 bools, since it seemed easier than adding a library for bit addressing... but when I emit that code I get the following from Cranelift:

Compilation(
    Unsupported(
        "Too many return values to fit in registers. Use a StructReturn argument instead. (#9510)",
    ),
)

So I guess we'll return an i8 and use bit addressing.

At this point I ran into a lot of issues, but the root was an issue I identified earlier, in "bugbear live here": you can only initialize a vector with a 128 bit constant in Cranelift. For some reason, I figured assertion failed: ty.bits() <= 128 was referring to something else for a while. I was banging my head against the wall until I searched for similar problems in the Cranelift Zulip, and saw:

Alex Crichton: Cranelift generally only has support for 128-bit vectors at this time, 256+ will likely hit bugs like this

At least on x64, platforms like risc-v may fare better but in general 256+ is not tested much and probably won't work

Welp. Cranelift is still in progress. We'll need to switch to F32X4. This cuts our theoretical throughput in half, but such is life. We can explore F16X8 later.

We are able to get a mask where each bit represents the corresponding vector lanes boolean after the sign comparison using vhigh_bits, an instruction that concatenates the most significant bit of each lane into a scalar. Since our lanes are IEEE floats, the most significant bit will be the sign:

let zero = builder.ins().f32const(0f32);
let zero_vec = builder.ins().splat(types::F32X4, zero);
let signs_vec = builder
    .ins()
    // compare the last variable vector assigned (4 different x values) with zero
    .fcmp(FloatCC::LessThan, *kv.last().unwrap(), zero_vec);
let signs = builder.ins().vhigh_bits(types::I8, signs_vec);
builder.ins().return_(&[signs]);

And we will need to update our main function to treat the returned value as a bit vector using bitvec:

// its not ugly and slow if it works, right?
.flat_map(|(x, y)| {
    let mask = xy_fn(x, y) as u8;
    mask.view_bits::<Lsb0>()
        .into_iter()
        .take(4)
        .map(|b| match *b {
            true => 255u8,
            false => 0u8,
        })
        .collect_vec()
})

And it works! We can render out my simple.vm test program:

# 1/4th filled in?
_0 var-x
_1 var-y
_2 add _0 _1
_3 const 1
_4 sub _2 _3

(I forgot how coordinates work, it wasn't 1/4th filled in)

Honestly, I'm again shocked this works. It didn't work the first time like my initial impl, since I got the bit order wrong and got an all black image, but it basically worked as soon as I solved all the Cranelift assertions. I'm pretty sure I flip the order of the coordinates at some point, but as long as you flip them an even number of times it's fine, and it looks like I got lucky :3

is it faster?

Yes! At first, I forgot I was using the test program, and got extremely excited about the order of magnitude performance boost. However, vectorizing our JITed function seems to have provided a satisfying 2x speedup.

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):     268.0 ms ±  33.3 ms    [User: 2017.2 ms, System: 143.6 ms]
  Range (min … max):   243.0 ms … 325.1 ms    12 runs

But... is it correct? Diff says its the same as our gold ppm file from earlier!

This leaves us here.

is generating SIMD code slower?

Yes. Now that we are emitting SIMD code, Cranelift takes a lot longer to generate our function. We can use black box to try and ensure the work of compilation isn't optimized away by the Rust compiler.

let xy_fn = translate(src, step);
core::hint::black_box(xy_fn);

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):     100.6 ms ±  14.6 ms    [User: 91.6 ms, System: 8.4 ms]
  Range (min … max):    87.2 ms … 129.9 ms    33 runs

Clearly this tradeoff is worth it for overall performance, but we can say that we are now spending almost half our runtime for a 1024x1024 image JIT compiling the function, and guess that we increased the cost of JITing the function by 3-4x.

We can amortize this cost by rendering a higher resolution image, since we get to use our function more. 4096x4096 is much faster now (~2.8x faster!):

❯ hyperfine ./target/release/prospero-jit-rs
Benchmark 1: ./target/release/prospero-jit-rs
  Time (mean ± σ):      2.643 s ±  0.096 s    [User: 30.482 s, System: 2.047 s]
  Range (min … max):    2.525 s …  2.812 s    10 runs

ok but double the number of lanes on the highway so i can speed

Can we use F16X8 and get our "lost" throughput back? I searched for f16 in the Cranelift Zulip to avoid wasting time and found a github comment from (who I believe to be) one of the primary maintainers in the last few months:

bjorn3: f16 and f128 is currently basically unsupported in Cranelift. Anything beyond passing around f16 and f128 values should be assumed to not yet be implemented.

Gotcha, no doubling the number of lanes on the highway.

Since we aren't using all the bits in our u8, we could consider running the function body twice, to amortize function call overhead, but I don't think this is interesting—if you do it and find speedups, let me know on bluesky!

where does this leave us

We went from around 5 seconds to ~270 ms. We started with using Cranelift for JIT compilation, added Rayon to process pixels in parallel, tried switching to f32, and finally added vectorization to get 4 pixels processed per function invocation. Compared to the original Python's 15 seconds, I'd say that's pretty good. Compared to the 17.3 ms that Matt's Fidget needs to render 1024x1024? We'd need to think a lot harder about optimizing the equation and how we render the image to approach that kind of speed. We spend much longer than that just generating the SIMD JIT code.

But, given the level of effort involved (pretty low) I think it's cool that you can get pretty decent performance from ~100 lines of code to translate instructions into vectorized Cranelift IR, and JIT compile it as needed. That's all for now, let me know if you see any mistakes, and feel free to give feedback. If you made it this far, feel free to leave a drawing at the bottom of my website <3

@aviva.gay

2025-03-25T21:36:52.453Z

getting started

first attempt: initial src

going faster

jit cost

getting sloppy

`f64`

`f32`

`f64`

`f32`

just one more lane bro

1: big steps

2: get wide

2.5: everything needs to be wide

3: exiting the highway

is it faster?

is generating SIMD code slower?

ok but double the number of lanes on the highway so i can speed

where does this leave us

🦌

Post reaction in Bluesky

Reactions from everyone (0)

🦌