Learning to Bit Twiddle and Optimize

David G. Andersen January 14, 2025

ENTER STUDENT: You're in 15-213^[1] (or learning that material online).

It's time for the bit twiddling lab.

Here's the thing about that lab: It can be fairly straightforward or quite difficult depending on your background in things like competitive math and competitive programming. At CMU, the distribution of people's backgrounds on these topics is wide. And if you haven't had an intro to it you can spend a lot of time banging your head against the problems.

This post is an attempt at a gentle intro to the thinking in bit twiddling and optimization.

What we're doing: We're going to use simple instructions, like addition and bitwise operations like AND, OR, XOR, NOT to change values in sometimes surprising ways. One of the big tricks we'll talk about is sometimes called SWAR: SIMD within a register. (SIMD meaning "single instruction multiple data" -- applying the same operation to multiple chunks of data).

Note that in these example I'm prefixing binary numbers with "0b" for clarity. C won't actually let you do that, so you'll have to write your constants in hex instead.

This post uses many folds: Click here to unfold

I've put a lot of stuff under little folds so that you can think about it first or hide things you already know. Just click to fold or unfold.

C operator reference if useful

For reference, relevant operators in C:

Operation	C example	Binary example
bitwise AND	A & B	0101 & 0011 = 0001
logical AND	A && B	0101 && 0011 = 1
bitwise OR	A \| B	0101 \| 0011 = 0111
logical OR	A \|\| B	0101 \|\| 0011 = 1
bitwise XOR	A ^ B	0101 ^ 0011 = 0110
bitwise NOT	~A	~0101 = 1010
logical NOT	!A	!0101 = 0
left shift	A << B	0101 << 1 = 01010
right shift	A >> B	0101 >> 1 = 010

With all of these examples, try to figure out the solution for yourself first, ideally writing it down / typing it into an editor to keep yourself honest. The thinking about it is how you train your brain.

Inverting bits

Let's start with an easy one: How do you flip all of the bits of a value?

   Input:  0b00000...000
   Output: 0b11111...111

Invert Bits - remember, write your answer down FIRST

output = ~input

Explanation

The tilde operator is the bitwise inversion operator. It will invert any bit pattern, so, e.g., 0101 becomes 1010.

Thinking about Two's Complement

On x86 C, we represent negative numbers in two's complement. To generate a two's complement number, invert all of the bits (see above), and then add one.

  Convert to negative without using the - operator:
  Input: 5
  Output: -5

Convert to Negative

x = ~x + 1

Knowing how to invert a number, how would you subtract 27 from a variable x without using the subtraction operator?

Subtract 27 Without Subtraction

x += (~27 + 1);

What is the two's complement representation of zero?

Two's complement representation of zero

It's 0

Explanation: ~0 = all 1's. Add 1 to 0b111111111 and it wraps, giving 0b00000 again

SWAR Techniques

SWAR is a very powerful technique for optimizing certain low-level operations, when it applies. But it can be a little unintuitive.

Let's look at an example problem. In a 32 bit integer, x, invert the lowest 4 bits of each 8-bit byte. In other words, if the integer looks like this: 0b00000000000000000000000000000000 then the output should be 0b00001111000011110000111100001111.

To solve this problem, let's first assume that you have the variable x and another variable called invert_mask, where invert_mask = 0x0f0f0f0f. (That's the same bit pattern as 0b000111100001111..., just written in hex).

Invert high-order bits of x using invert_mask

x ^= invert_mask

Explanation

The carat operator is the bitwise XOR (exclusive-or) operator. Recall that xor 1 0 = 1; xor 0 0 = 0, etc. It outputs 1 if and only if exactly one of the input bits is one and the other is zero. And it operates bitwise, so taking `0b1010 ^ 0b1100 = 0b0110`.

Xoring x with the input mask inverts only the bits where invert_mask is 1. Otherwise it leaves them untouched.

Now we have to generate the invert_mask. Under normal programming you would simply define it like that: invert_mask = 0x0f0f0f0f; But the bit twiddling lab has a further restriction: You can't use constants greater than a single byte (0xff).

Step 1: Create 0x0f000000 using shifting

partial_mask = 0x0f << 24;

Step 2: create invert_mask of 0x0f0f0000

invert_mask = 0x0f << 24 | 0x0f << 16;

Explanation

| is the bitwise or operator (not to be confused with ||, the logical OR operator). It produces an output of 1 if either input bit is 1. We shift our 0xff left by 24 bits so that the highest 8 bits are 1, and then OR that with 0xff shifted by 16.

Interestingly, there's a way to create the invert mask 0x0f0f0000 using only ONE reference to the 0x0f constant instead of two as we did above. Can you figure it out?

create invert_mask 0x0f0f0000 using only one constant 0x0f

invert_mask = 0x0f << 16;
invert_mask |= (invert_mask << 8);

Explanation

Here we re-used invert_mask; after the first line, 0x000f0000.

And finally, that brings us to one of our core efficiency tricks in SWAR. In this part of the exercise, try to build up a 64 bit invert mask that looks like 0x0f0f0f0f0f0f0f0f. Efficiently means "in fewer operations than just shifting 0x0f left 8 times".

Efficiently create a 64-bit invert mask

invert_mask = 0x0f;
invert_mask |= invert_mask << 8;
invert_mask |= invert_mask << 16;
invert_mask |= invert_mask << 32;

Explanation

Here we first create invert_mask as 0x00000000000000f0.

Then we shift it left 8 bits and OR it into itself, producing 0x0000000000000f0f

Then we shift it left by 16 bits and again OR it into itself, producing 0x000000000f0f0f0f.

And one more of those by 32 bits gives us our final mask.

SWAR Addition

SWAR can also be used to do things like addtion in parallel within a word.

Imagine we have a 64 bit word that looks like this: 0x0001000200030004 and we want to treat it, instead of as a single 64 bit word, as four individual 16 bit words.

Observe the result of adding w + w right shifted by 16

w = 0x0001000200030004;
w = w + (w >> 16);
w now equals: 0x0001000300050007

Hey that's kinda cool: We've used a single shift and add to add multiple items simultaneously. Now, assuming we name the sub-parts of that word "abcd" (so "a" is 0x0001, b is 0x0002, etc.), let's add them so that we get a+b in the upper 32 bits and c+d in the lower 32 bits without having to worry about overflow.

SWAR add handling overflow

w = 0x0001000200030004;
mask = 0x0000ffff0000ffff;
w = (w & mask) + ((w >> 16) & mask);
w now equals 0x0000000300000007

Note that this code works properly even if the result of addition needs more than 16 bits.

This example of having a vector of four numbers doesn't benefit as much from SWAR, but, of course, you might have a 64 bit number that actually consists of 16x 4-bit numbers, or something similar, in which case the SWAR technique can be quite a bit faster.

15-213 is Carnegie Mellon's Intro to Systems class, usually taken by 1st or 2nd year undergraduates. ↩