Learning to Bit Twiddle and Optimize
David G. Andersen January 14, 2025ENTER STUDENT: You're in 15-213[1] (or learning that material online).
It's time for the bit twiddling lab.
Here's the thing about that lab: It can be fairly straightforward or quite difficult depending on your background in things like competitive math and competitive programming. At CMU, the distribution of people's backgrounds on these topics is wide. And if you haven't had an intro to it you can spend a lot of time banging your head against the problems.
This post is an attempt at a gentle intro to the thinking in bit twiddling and optimization.
What we're doing: We're going to use simple instructions, like addition and bitwise operations like AND, OR, XOR, NOT to change values in sometimes surprising ways. One of the big tricks we'll talk about is sometimes called SWAR: SIMD within a register. (SIMD meaning "single instruction multiple data" -- applying the same operation to multiple chunks of data).
Note that in these example I'm prefixing binary numbers with "0b" for clarity. C won't actually let you do that, so you'll have to write your constants in hex instead.
This post uses many folds: Click here to unfold
I've put a lot of stuff under little folds so that you can think about it first or hide things you already know. Just click to fold or unfold.C operator reference if useful
With all of these examples, try to figure out the solution for yourself first, ideally writing it down / typing it into an editor to keep yourself honest. The thinking about it is how you train your brain.
Inverting bits
Let's start with an easy one: How do you flip all of the bits of a value?
Input: 0b00000...000
Output: 0b11111...111
Invert Bits - remember, write your answer down FIRST
output = ~input
Explanation
The tilde operator is the bitwise inversion operator. It will invert
any bit pattern, so, e.g., 0101 becomes 1010.
Thinking about Two's Complement
Input: 0b00000...000
Output: 0b11111...111
Invert Bits - remember, write your answer down FIRST
output = ~input
Explanation
The tilde operator is the bitwise inversion operator. It will invert any bit pattern, so, e.g., 0101 becomes 1010.On x86 C, we represent negative numbers in two's complement. To generate a two's complement number, invert all of the bits (see above), and then add one.
Convert to negative without using the - operator:
Input: 5
Output: -5
Convert to Negative
x = ~x + 1
Knowing how to invert a number, how would you subtract 27 from a variable x without using the subtraction operator?
Subtract 27 Without Subtraction
x += (~27 + 1);
What is the two's complement representation of zero?
Two's complement representation of zero
It's 0
Explanation: ~0 = all 1's. Add 1 to 0b111111111 and it wraps, giving 0b00000 again
SWAR Techniques
SWAR is a very powerful technique for optimizing certain low-level operations, when it applies. But it can be a little unintuitive.
Let's look at an example problem. In a 32 bit integer, x
, invert the lowest 4 bits of each 8-bit byte.
In other words, if the integer looks like this: 0b00000000000000000000000000000000 then the output
should be 0b00001111000011110000111100001111.
To solve this problem, let's first assume that you have the variable x
and another variable called invert_mask
, where invert_mask = 0x0f0f0f0f
.
(That's the same bit pattern as 0b000111100001111..., just written in hex).
Invert high-order bits of x using invert_mask
x ^= invert_mask
Explanation
The carat operator is the bitwise XOR (exclusive-or) operator. Recall that xor 1 0 = 1; xor 0 0 = 0, etc. It outputs 1 if and only if exactly one of the input bits is one and the other is zero. And it operates bitwise, so taking `0b1010 ^ 0b1100 = 0b0110`.Xoring x with the input mask inverts only the bits where invert_mask is 1. Otherwise it leaves them untouched.
Now we have to generate the invert_mask.
Under normal programming you would simply
define it like that: invert_mask = 0x0f0f0f0f;
But the bit twiddling lab has a further restriction: You can't use constants greater than
a single byte (0xff).
Step 1: Create 0x0f000000 using shifting
partial_mask = 0x0f << 24;
Step 2: create invert_mask of 0x0f0f0000
invert_mask = 0x0f << 24 | 0x0f << 16;
Explanation
| is the bitwise or operator (not to be confused with ||, the logical OR operator). It produces an output of 1 if either input bit is 1. We shift our 0xff left by 24 bits so that the highest 8 bits are 1, and then OR that with 0xff shifted by 16.Interestingly, there's a way to create the invert mask 0x0f0f0000
using only ONE reference to the 0x0f constant instead of two as we did above. Can you figure it out?
create invert_mask 0x0f0f0000 using only one constant 0x0f
invert_mask = 0x0f << 16; invert_mask |= (invert_mask << 8);
Explanation
Here we re-used invert_mask; after the first line, 0x000f0000.And finally, that brings us to one of our core efficiency tricks in SWAR. In this part of the exercise, try to build up a 64 bit invert mask that looks like 0x0f0f0f0f0f0f0f0f. Efficiently means "in fewer operations than just shifting 0x0f left 8 times".
Efficiently create a 64-bit invert mask
invert_mask = 0x0f; invert_mask |= invert_mask << 8; invert_mask |= invert_mask << 16; invert_mask |= invert_mask << 32;
Explanation
Here we first create invert_mask as 0x00000000000000f0.Then we shift it left 8 bits and OR it into itself, producing 0x0000000000000f0f
Then we shift it left by 16 bits and again OR it into itself, producing 0x000000000f0f0f0f.
And one more of those by 32 bits gives us our final mask.
SWAR Addition
SWAR can also be used to do things like addtion in parallel within a word.Imagine we have a 64 bit word that looks like this: 0x0001000200030004 and we want to treat it, instead of as a single 64 bit word, as four individual 16 bit words.
Observe the result of adding w + w right shifted by 16
w = 0x0001000200030004; w = w + (w >> 16); w now equals: 0x0001000300050007
Hey that's kinda cool: We've used a single shift and add to add multiple items simultaneously. Now, assuming we name the sub-parts of that word "abcd" (so "a" is 0x0001, b is 0x0002, etc.), let's add them so that we get a+b in the upper 32 bits and c+d in the lower 32 bits without having to worry about overflow.
SWAR add handling overflow
w = 0x0001000200030004; mask = 0x0000ffff0000ffff; w = (w & mask) + ((w >> 16) & mask); w now equals 0x0000000300000007
Note that this code works properly even if the result of addition needs more than 16 bits.
This example of having a vector of four numbers doesn't benefit as much from SWAR, but, of course, you might have a 64 bit number that actually consists of 16x 4-bit numbers, or something similar, in which case the SWAR technique can be quite a bit faster.
-
15-213 is Carnegie Mellon's Intro to Systems class, usually taken by 1st or 2nd year undergraduates. ↩