Optimizing Fizz Buzz

Last time modified: 08 Sep, 2023

NOTE: I’m not a professional C/C++ or assembly programmer so probably the following solutions could be much better. And a lot of times I’m making uneducated guesses without proper profiling. All of this programs have been measured in my current computer (i7-6700k - 3200Mhz RAM speed). And all of them have been compiled with GCC version 13.1.1 using -O3 unless stated otherwise.

I’ll start defining our starting program. One which we know it’s output is completely correct.

#include <cstdio>
#include <cstdint>
#include <string>
#include <functional>

static constexpr uint32_t ITERATIONS = 500000000; 

int main() {
	std::string accum;

	for(uint32_t i = 1; i <= ITERATIONS; i++) {
		if(i % 3 == 0 && i % 5 == 0) {
			accum += "fizzbuzz ";
		}else if(i % 3 == 0) {
			accum += "fizz ";
		}else if(i % 5 == 0) {
			accum += "buzz ";
		}else {
			accum += std::to_string(i);
			accum += " ";
		}
	}

	std::string subs = accum.substr(accum.length()-20, 20);
	printf("%s\n", subs.c_str());
	#ifdef INFO_MODE
		printf("Total size string: %ld\n", accum.length());
		printf("HASH: %ld\n", std::hash<std::string>{}(accum));
	#endif

	return 0;
}

As you see we are printing the last 20 characters of the fizzBuzz. Also if we declare INFO_MODE at compile time. It will make a hash of the entire string and print it. This serves me well when checking if the solution is correct in the following iterations. Also, for debugging I set -fsanitize=address to check for memory errors (It saved me a few times on this journey).

The, first speed up we can make is just by reserving memory for our string.

accum.reserve(ITERATIONS*8);

Although I’m wasting a lot of space. This should give us the space necessary to store all of the fizz buzz string. With this we get a speed improvement of 8-9%.

Encoding decision

The second thing to optimize is the number of modulo operations and if jumps. We could try to remove one modulo operation by encoding them in one integer variable.

for(uint32_t i = 1; i <= ITERATIONS; i++) {
		uint8_t encode = (i % 3 == 0) ? 1 : 0;
		encode += (i % 5 == 0) ? 2 : 0;

		if(encode == 0) {
			accum += std::to_string(i);
			accum += " ";
		}else if(encode == 1) {
			accum += "fizz ";
		}else if(encode == 2) {
			accum += "buzz ";
		}else if(encode == 3) {
			accum += "fizzbuzz ";
		}
	}

Not only we remove one modulo operation in the worst case, but also give GCC a lot of hints to optimize this sequence. Although I couldn’t get a significant speed improvement. Some test show a 2% improvement. Others show none at all.

Custom String

Next we could consider that the string concatenation is probably very slow. I’m sure the string concatenations checks if it needs more space to store all of the string. We don’t need any checks because we know that everything is going ot fit in our pre-reserved String. So our next point should be creating a simple struct to store and concatenate the fizz buzz.

struct CustomString {
	char* string;
	size_t total_size_reserved;
	size_t position_null_char; // End of string

	void initialize(size_t size_bytes) {
		string = (char*)malloc(size_bytes);
		string[0] = '\0';
		total_size_reserved = size_bytes;
		position_null_char = 0;
	}

	void concat(const char* str_to_concat) {
		size_t length = std::strlen(str_to_concat);

		memcpy(string+position_null_char, str_to_concat, length+1); // length+1 to include null terminator
		position_null_char = position_null_char+length;
	}

	void concat(const char* str_to_concat, size_t length) {
		memcpy(string+position_null_char, str_to_concat, length+1); // length+1 to include null terminator
		position_null_char = position_null_char+length;
	}
};

With this we get a nice 6-8% improvement in speed.

Remove modulo operations

Now we could try to remove those bad modulo operations. The only thing that occurs me right now is to just having two counters. One for 3 and another for 5. So when we hit ‘3’ in one counter, we know it’s divisible by 3. Same goes for the ‘5’ counter.

uint8_t c3 = 1;
uint8_t c5 = 1;
for(uint32_t i = 1; i <= ITERATIONS; i++) {
	if(c3 == 3 && c5 == 5) {
		accum.concat("fizzbuzz ", 9);
		c3 = 0;
		c5 = 0;
	}else if(c3 == 3) {
		accum.concat("fizz ", 5);
		c3 = 0;
	}else if(c5 == 5) {
		accum.concat("buzz ", 5);
		c5 = 0;
	}else{
		accum.concat(std::to_string(i).c_str());
		accum.concat(" ", 1);
	}
	c3++;
	c5++;
}

But again this solution perform more or less equal that our previous solution. Probably because GCC is already optimizing it, or it could be that conditional jumps are more expensive. So this change is ruled out.

Unrolling

But you know what is faster than making modulo operations and if statements?. Not doing it. Because we know that Fizz Buzz is a sequence with a length of 15. We can just make those 15 statements, and repeat them until we have cover (ITERATIONS / 15). And Then do the remainder fizz buzz normally until we complete everything.

uint32_t iterToUnroll = ((uint32_t)(ITERATIONS / 15.0f)) * 15;

for(uint32_t i = 1; i < iterToUnroll; i += 15) {
	// nninuinniuninnf
	accum.concat(std::to_string(i).c_str());
	accum.concat(" ", 1);

	accum.concat(std::to_string(i+1).c_str());
	accum.concat(" ", 1);

	accum.concat("fizz ", 5); // 3

	accum.concat(std::to_string(i+3).c_str());
	accum.concat(" ", 1);

	accum.concat("buzz ", 5); // 5

	accum.concat("fizz ", 5); // 6

	accum.concat(std::to_string(i+6).c_str());
	accum.concat(" ", 1);

	accum.concat(std::to_string(i+7).c_str());
	accum.concat(" ", 1);

	accum.concat("fizz ", 5); // 9

	accum.concat("buzz ", 5); // 10

	accum.concat(std::to_string(i+10).c_str());
	accum.concat(" ", 1);

	accum.concat("fizz ", 5); // 12

	accum.concat(std::to_string(i+12).c_str());
	accum.concat(" ", 1);

	accum.concat(std::to_string(i+13).c_str());
	accum.concat(" ", 1);

	accum.concat("fizzbuzz ", 9); // 15
}

for(uint32_t i = iterToUnroll+1; i <= ITERATIONS; i++) {
	uint8_t encode = (i % 3 == 0) ? 1 : 0;
	encode += (i % 5 == 0) ? 2 : 0;

	if(encode == 0) {
		accum.concat(std::to_string(i).c_str());
		accum.concat(" ", 1);
	}else if(encode == 1) {
		accum.concat("fizz ", 5);
	}else if(encode == 2) {
		accum.concat("buzz ", 5);
	}else if(encode == 3) {
		accum.concat("fizzbuzz ", 9);
	}
}

This get us an improvement of 3-4%. Not much of an improvement. This proves that the compiler was already unrolling the loop. But It seems this solution gives the compiler more headroom to optimize.

Custom int to String

After all of this I don’t see anything to optimize for the exclusion of std::to_string. Let’s see a flamegraph about the program with the last optimization.

In this very long flamegraph whe can see that calling std::to_string Is taking a lot of time. So we could implement our own, with zero safeguards, and a lot of foot-guns. For this I’ve created three programs to implement the same behavior. I did it for experimentation and because the first program is very, very ugly. So here is my thought process.

We want to transform an integer to an array of chars. What is the fastest method?. Of course… It’s not doing it at all. We could just have a buffer and increment the chars of the buffer as it is a decimal number.

So for the 1º Implementation this was my though:

Have a initially reserve buffer that is long enough for the use in our whole application.
Put a ‘0’ on the first position and a null char ‘\0’ on the second.
Record the index position of the null char as the last position.
We can update the buffer by updating the character that is the in the last-1 position.
If we surpass ‘9’ we update the number accordingly and we update the current position-1.
If the position is invalid (position < 0). We move the entire string one place to the right and then put a ‘1’ in the first position.
We continue like this until we have update the whole buffer.

Because we don’t need to update the array with numbers that are grater than 3. We don’t need to worry about updating more than one position at a time.

This algorithm works, and it even outclass (by a really small margin) my two other algorithms for some reason that I can’t understand. It is probably because the characters are aligned at the beginning of the string.

The 2º implementation is just an algorithm that uses modulo 10 and division to get each character from a unsigned int.

The 3º implementation is like the first but instead of storing the characters that compose our number at the beginning of the buffer. I store it at the end. So we don’t need to displace the characters.

And the results against the loopUnroll (with all the other optimizations) are this:

57.99% improvement
30.25% improvement
57.48% improvement As we can see, the first and third implementation are more or less on par. All benchmarks I’ve done suggest that the first implementation is faster than the third one. But for clarity and because simplicity I’ll choose the third implementation from here on out (also the first one is really ugly and it even uses recursion).

Here is the whole third implementation:

struct UintInString{
	char* string;
	char* init_of_number_string;
	size_t position_null_char;
	size_t length;

	void initialize(size_t reserve_size) {
		string = (char*)malloc(reserve_size);
		for(size_t i = 0; i < reserve_size-1; i++) {
			string[i] = '0';
		}

		position_null_char = reserve_size-1;
		string[position_null_char] = '\0';
		length = 1;
		init_of_number_string = string + (position_null_char-1);
	}
	
	void update_add(uint8_t num) {
		// CANT UPDATE WITH NUMBERS THAT ARE greater than 9
		// '0' --> 48
		// '9' --> 57
		uint8_t position_char = position_null_char-1;
		uint8_t total_length = 1;
		while(num > 0) {
			uint8_t lastChar = string[position_char];
			if(lastChar + num > '9') {
				string[position_char] = lastChar + num - 10;

				position_char--;
				num = 1;

				total_length++;
				if(total_length > length) {
					length = total_length;
					init_of_number_string--;
				}
			}else{
				string[position_char] = lastChar + num;
				num = 0;
			}
		}
	}
};

With this, we can initialize the buffer and then use update_add to add a number no greater than 3. And then use init_of_number_string, and it’s length to concat the buffer to our custom string.

Here is the flamegraph o the third implementation:

Now, the flamegraph is much more clean. 70+ percent of the graph is just our custom concat function. I’m not really sure how to be faster than memcpy. Maybe there is a way, but for now I’m leaving this like it is.

Multithreading

But you know what time it is, right?. It’s time for boring multithreading optimizations. And oh man I’m really bad at multithreading.

uint32_t total_iters_mult_15 = ((size_t)(ITERATIONS / 15)) * 15;
uint32_t total_chunks = total_iters_mult_15 / 15;

UintInString num_string_buffer;
num_string_buffer.initialize(12);

const uint32_t total_threads =  std::thread::hardware_concurrency();
uint32_t num_threads = std::min(total_threads, total_chunks);

std::thread* threads = new std::thread[num_threads];
CustomString* threads_accum = new CustomString[num_threads];

// Start threads
size_t chunks_per_thread = total_chunks / num_threads;
size_t remainder_chunks = total_chunks - (chunks_per_thread * num_threads);

size_t from = 0;
size_t to = chunks_per_thread * 15;

for(uint32_t i = 0; i < num_threads-1; i++) {
	threads_accum[i] = CustomString();
	threads_accum[i].initialize((ITERATIONS / num_threads) * 30);
	threads[i] = std::thread(fizzBuzz, from, to, &threads_accum[i]);
	from = to;
	to += chunks_per_thread * 15;
}
// Last thread do the remainder operations
threads_accum[num_threads-1] = CustomString();
threads_accum[num_threads-1].initialize((ITERATIONS / num_threads) * 30);
to = to+remainder_chunks*15;
threads[num_threads-1] = std::thread(fizzBuzz, from, to, &threads_accum[num_threads-1]);

// Wait for every thread to finish
for(uint32_t i = 0; i < num_threads; i++) {
	threads[i].join();
}

// Join all the strings
size_t total_length = 0;
for(uint32_t i = 0; i < num_threads; i++) {
	total_length += threads_accum[i].position_null_char + 1;
	accum.concat(threads_accum[i].string);
}

// Finish remainder operations
num_string_buffer.set_to_number(to);
for(size_t i = to+1; i <= ITERATIONS; i++) {
	uint8_t encode = (i % 3 == 0) ? 1 : 0;
	encode += (i % 5 == 0) ? 2 : 0;

	num_string_buffer.update_add(1);
	
	if(encode == 0) {
		accum.concat(num_string_buffer.init_of_number_string, num_string_buffer.length);
		accum.concat(" ", 1);
	}else if(encode == 1) {
		accum.concat("fizz ", 5);
	}else if(encode == 2) {
		accum.concat("buzz ", 5);
	}else if(encode == 3) {
		accum.concat("fizzbuzz ", 9);
	}
}

Of course, fizz buzz on multithread is more or less trivial. We just need to divide the work load between all the threads in chunks that are divisible by 15 (to process it in an unrolled loop like before), leaving the non-divisible portion for the main thread. Or that’s what I have done here.

But after benchmarking it against the other programs, we can see that it’s much slower than the single threaded one.

I’m not sure why. But it’s probably because it does double amount of mallocs. One for each thread. And then we need to concat all of this strings into one. This, probably cause a lot of cache invalidation.

Also, I compiled it with CLANG, but it has shown the same performance metrics.

Anyway. When trying to optimize this simple problem I realized that, this is not really a fizz buzz problem. And actually it’s a concatenate-and-parse-integers problem.

The code is available in github if you want to see all of this aberration.