On LLMs and Code Optimization

David G. Andersen January 04, 2025

Prologue: Well, in the spirit of "own your own stuff", I've decided to claw back my blog onto self-ish hosted space (the blog is static, though I'm not hosting it on my own metal, so it's easy to migrate). Welcome to the new blog, same as the old blog.

Max Woolf recently asked whether repeatedly prompting a large language model (LLM) to "write better code" would make it, well, write better code. His answer - summarized briefly - was "kinda, maybe, up to a point" (you should read the original article).

But what jumped out to me in the article was an obvious missed optimization that none of the prompts to the LLM identified, and that got me curious about where an LLM would or wouldn't help in optimizing. Because his example is fresh and unlikely to yet be in the training corpus for LLMs, I'm using his problem for this post and seeing how much farther we can go -- with or without an LLM.

We'll start with the prompt Max used:

Write Python code to solve this problem:

Given a list of 1 million random integers between 1 and 100,000, 
find the difference between the smallest and the largest numbers 
whose digits sum up to 30.

For this example, I'll use GPT 4o through Copilot for all examples. Copilot generates substantially the same code for me that it did for Max, in Python:

import random

# Generate a list of 1 million random integers between 1 and 100,000
random_integers = [random.randint(1, 100000) for _ in range(1000000)]

def digit_sum(n):
    return sum(int(digit) for digit in str(n))

# Filter numbers whose digits sum up to 30
filtered_numbers = [num for num in random_integers if digit_sum(num) == 30]

if filtered_numbers:
    min_number = min(filtered_numbers)
    max_number = max(filtered_numbers)
    difference = max_number - min_number
    print(f"The difference between the largest and smallest numbers whose digits sum up to 30 is: {difference}")
else:
    print("No numbers found with digits summing up to 30.")

This code takes 520ms on my M1 MBP. Great baseline. You've already seen Max's code, so we know it can be a lot faster using a few different techniques:

Numpy / native code
Parallelization / vectorization
Reducing redundant work

Because Python introduces some artificial weirdness in approaching this problem, let's jump to Rust, where things like the effects of using JIT or native code through numpy aren't as big, and we can see effects of our performance choices a little more clearly. Here's the same LLM-written starter code in Rust:

use rand::Rng;

fn digit_sum(n: i32) -> i32 {
    n.to_string().chars().map(|c| c.to_digit(10).unwrap() as i32).sum()
}

fn main() {
    let mut rng = rand::thread_rng();
    let numbers: Vec<i32> = (0..1_000_000).map(|_| rng.gen_range(1..=100_000)).collect();

    let mut min_number = i32::MAX;
    let mut max_number = i32::MIN;

    for &number in &numbers {
        if digit_sum(number) == 30 {
            if number < min_number {
                min_number = number;
            }
            if number > max_number {
                max_number = number;
            }
        }
    }

    if min_number == i32::MAX || max_number == i32::MIN {
        println!("No numbers with digit sum of 30 found.");
    } else {
        let difference = max_number - min_number;
        println!("Difference between the largest and smallest numbers with digit sum of 30: {}", difference);
    }
}

Intriguingly, the Rust code uses the same "split the string" approach that the Python code does. This code takes 42ms to run. Speed improvement going to Rust: 12x.

But, of course, this hides a bit, because the "Explode the string" function is so slow. Let's tell the LLM to make it faster. Selecting the code and typing "make this code faster" gives a new version:

fn digit_sum(n: i32) -> i32 {
    let mut sum = 0;
    let mut num = n;
    while num > 0 {
        sum += num % 10;
        num /= 10;
    }
    sum
}

Which is correct. New time: 13.2ms. Speed improvement from better digit_sum: 3.2x. Notably, this gain is hidden in Python: moving to the better string sum doesn't change things much because, while it's better, it involves more python-level operations.

Selecting the main function and asking the LLM to make it faster fails. Copilot tries to parallelize the checking of the numbers vector but, of course, Rust won't let you have multiple threads modifying the same variable at the same time:

   Compiling initial v0.1.0 (/Users/dga/blog/code_examples/llm-digit-diff-examples/rust/initial)
error[E0594]: cannot assign to `min_number`, as it is a captured variable in a `Fn` closure
  --> src/main.rs:29:17
   |
29 |                 min_number = number;
   |                 ^^^^^^^^^^^^^^^^^^^ cannot assign

So here, alas, we have to part ways with our LLM, but we'll use it a little more later.

Here's where our first human optimization comes in, and it's a simple one that I was quite surprised that none of Max's or my attempts at LLM-based optimization caught: Check to see if the number is useful before doing the comparatively expensive digit sum test. All of the LLM-generated code goes the opposite way: Filter by digit-sum and then compare. This is a manual change but an easy one:

    for &number in &numbers {
        if number < min_number || number > max_number {
            if digit_sum(number) == 30 {
                if number < min_number {
                    min_number = number;
                }
                if number > max_number {
                    max_number = number;
                }
            }
        }
    }

New time: 10.7ms. Speedup: 1.2x. This is much smaller than the 5x speedup the same change made with the Python code, because the optimized digit_sum function is much faster in Rust. But actually it's faster than that, because we're bottlenecked on an entirely different part of the code...

One major limitation the LLM has is that it can't run a profiler on the code, but we can. (This would be a fun thing to do in the future - feed the output of perf to the LLM and say 'now optimize'). To save a little time, I'll tell you that a major cost of the Rust code is, surprisingly, in generating the vector of random numbers. Rust's rand package defaults to a cryptographically secure RNG that's not very fast. We'll address this by switching to the fastrand package. Note, however, that I'm cheating here because I picked it because it's friendly to some parallelization we'll introduce later. Our LLM makes this easy, but the change was simple on its own:

switch to using fastrand instead of rand

The changes were correct, and the only substantial change was:

    let numbers: Vec<i32> = (0..1_000_000).map(|_| fastrand::i32(1..=100_000)).collect();

New time: 2.8ms. Speedup: About 3.8x.

I tried a few generic "make it faster" prompts to the LLM but it didn't generate anything that helped - and usually didn't even compile. So I got more specific. That random generation is still a bottleneck, so let's speed it up by giving more specific instructions, selecting the "let numbers" line and saying:

Parallelize this code using rayon

And it did:

        use rayon::prelude::*;
        let numbers: Vec<i32> = (0..1_000_000).into_par_iter().map(|_| fastrand::i32(1..=100_000)).collect();

New time: 2.35ms. Speedup: About 1.2x.

At this point, you might think we're kinda done - after all, we're about 2.5x faster than the fastest, numba+numpy JIT-optimized 6ms run from Python (where presumably most of the time was spent in C).

Except we're not. We can parallelize, but we have to do so intelligently. A naive parallelization can't retain the "check first" optimization because it updates those min/max's across threads, so we now have to get human on the problem and realize where the trick is: After processing a few thousand numbers, we've found some OK candidates for max_number and min_number. They're probably not the best ones, but they let us avoid a lot of work. After we do that, then we can parallelize without having to worry about something icky like updating those with a shared lock. So we add a second loop, and it looks like this:

        const PREGEN: usize = 10_000;
        for _ in 0..PREGEN {
            let num = fastrand::i32(1..100000);
            if num < min_num || num > max_num {
                if digit_sum(num) == 30 {
                    min_num = min(min_num, num);
                    max_num = max(max_num, num);
                }
            }
        }
        let numbers: Vec<i32> = (0..(1_000_000 - PREGEN))
            .into_par_iter()
            .map(|_| fastrand::i32(1..100000))
            .filter(|&x| (x < min_num || x > max_num) && digit_sum(x) == 30)
            .collect();

        for num in numbers {
            if num < min_num || num > max_num {
                min_num = min(min_num, num);
                max_num = max(max_num, num);
            }
        }

New time: 760 microseconds. Speedup: 3.6x. Relative to the initial LLM-generated Rust version, we're about 55x faster, and about 17x faster than the "not stupid digit_sum" version.

Note that this is a very project Euler-like question. It admits additional mathematical optimization that I've deliberately ignored, because I like the pattern of the problem as a general one: Given a choice of a few ways to eliminate candidates, some of which parallelize and some of which are more tricky, find an order in which to do so.

I think it's revealing how limited the LLMs were in algorithmically optimizing this problem vs approaches like "add some parallelism" or "use numpy", for which they were quite decent in Python and somewhat less decent in Rust. It's surprising to me that unless I prompted it overly-specifically, GPT 4o didn't suggest using a faster rand implementation, as that was a pretty effective, easy step.

I suspect that part of why is that this problem turns out to be unexpectedly interesting from a parallelization perspective: Because the digit_sum function is one of the biggest costs (once you fix rand), a naive parallelization does more work than a version of the code using the fast check first: With 1M numbers, the "fast check" code only calls digit sum about 5,000 times, whereas a naively parallel version calls it 1M times. The naively parallel version is still faster than the non-parallel fast check code, but at a really high cost in total work (and therefore, energy and machine cost). That tradeoff made this problem perhaps a more interesting LLM testcase than the author had intended, but it's a very real kind of issue: Many approaches to parallelization may end up trading off work for latency, and it requires a thoughtful programmer to figure out how to navigate it.

The code written for this post can be found here.

Unfortunately, of course, if you're reading this after 2025, the LLM will probably have been trained on this blog post and will apply these optimizations for you if you try to reproduce it.