UUIDs and B-Trees and Legacy, Oh My!

I recently did a bit of work to improve the performance of a system. Normally this isn't the sort of thing I'd write about, but this particular adventure has a lot of interesting things I thought you might enjoy reading about. The system itself isn't all that special. It's a piece of middleware for an information system. It sits between a user's application and a database offering a CRUD API to clients.

I think this can serve as a bit of a case study for what I keep trying to tell people, which is you cannot just "optimize" a system built without a care for performance at the end. While we were able to improve the performance issue, it's still troubled. There are many deeply embedded performance issues with the system that remain effectively unfixable. If you build software without a constant eye towards properties like performance and security throughout the process, you will end up with an insecure and/or inefficient system that you cannot just magically improve at the end. It remains structurally deficient.

With that out of the way, I hope you enjoy the read.

Identifiers (UUIDs)

The particular design of the system that matters here is that the original authors chose to use UUIDs for record identification. All information systems require every record be given an identifier so that it can be referenced and retrieved. This is true no matter the implementation of the system, from analogue library, to console video game, to even something as massive as the internet itself. All of these information systems require resource identifiers. This can be as simple as an index into an array or as complicated as a hierarchically coordinated identifier issuing agency like ISBN or IANA's IP address assignments. No matter the scale and forethought, all information systems require identifiers for resources. UUIDs are one way you can assign identifiers to resources within an information system.

So how does choosing UUIDs for identifiers lead to a performance issue? Well, let's first look at what UUIDs are. UUIDs were first designed to be used within the Apollo Network Computing System (NCS) in the 1980s. The NCS was an abstraction designed by Apollo Computers to make writing software for distributed systems resemble writing software for a single computer. UUIDs were specifically designed so that any number of computers can simultaneously generate identifiers that are guaranteed to be unique among all identifiers generated by all computers. Later these made their way into the Distributed Computing Environment (DCE) system under supervision of the Open Software Foundation (OSF) as ISO/IEC 11578:1996. Apollo would be acquired by Hewlett-Packard in 1989 but the technology would live on to ultimately be standardized by RFC 4122 in 2005.

A UUID has a fairly common and recognizable representation as a hexadecimal string grouped with hyphens, for example 01234567-0123-0123-0123-0123456789ab. There's just one catch, UUIDs are not just one specification, but instead are generated using one of five different algorithms. For the particular system in question, the version of UUID in use is the version 4, or purely random UUID generation algorithm. In this version, the representation takes the form ________-____-4___-[89ab]___-____________ where _ represents any hexadecimal nibble ([0-9a-f]). Formally:

The version 4 UUID is meant for generating UUIDs from truly random or pseudo-random numbers.

The algorithm is as follows:

  • Set the two most significant bits (bits 6 and 7) of the clock_seq_hi_and_reserved to zero and one, respectively.
  • Set the four most significant bits (bits 12 through 15) of the time_hi_and_version field to the 4-bit version number from Section 4.1.3.
  • Set all the other bits to randomly (or pseudo-randomly) chosen values.

Indexes (B-Trees)

Why is this important? Well, the system requires efficient lookups of this identifier to return the associated resource. The problem is that these identifiers are handed out randomly, each new identifier issued has as much a chance to be lexically early as it does to be late. Another way to say that is the chance the lead nibble is a 0 is just as likely as it is to be an f and this is true for each successive nibble. This by itself isn't an issue, but it is when you need to use an index like a B-Tree.

An index if you're unaware is just a data structure that has efficient lookup and maintenance complexities. You're probably aware of runtime complexities like O(n). This complexity for example is the naive complexity of just looking through every single record for the one with a given identifier. It's not bad, but not fast enough when your system holds billions of records and you need a latency in the millisecond range. To do better, all databases implement one or more indices. In the case of the existing database, this index was a B-Tree. B-Trees provide O(logk(n)) lookups at the cost of O(n + log(n)) additional storage and an O(n) write performance penalty.

You may wonder why we're willing to pay O(n) to write data (as well as that additional storage) when we won't pay that much to read the data. Well, it's a matter of what that n refers to. In the case of reads, that's every record that exists in the system. In the case of writes, that only represents the number of records we write at a given time. Given the system will be frequently searching for these records, the scale of the write penalty is returned by the read performance many times over. Just remember that there is a price to having an index and you should always make these trade-offs based on the circumstances and not just dogma.

Anyway, back to why B-Trees and UUIDs don't mix. The problem is that B-Trees are inherently lexicographic. Invented in 1970 by Rudolf Bayer and Edward M. McCreight while working at Boeing, B-Trees are essentially a highly optimized binary tree. For simplicity let me explain how a binary tree works and then how it differs from a B-Tree.

In a binary tree, nodes are defined such that each node has 3 parts,

  1. its value,
  2. a link to a child node of lower value, and
  3. a link to a child node of higher value.
Example binary tree
Example of a binary tree.

To search for a given identifier you start at the root node. You compare the identifier you're looking for to the node's value. Assuming this node doesn't have the identifier you're looking for, if your identifier is bigger than the value of the node you follow the link to the higher node, if it's smaller you follow to the lower. This subdivides the remaining nodes such that on average you only need to look at O(log2(n)) nodes to find the node you're looking for. If you then store in each of those nodes the location in the table for the record itself, you'd have a working index for your data.

B-Trees only differences are that it's self balancing and you have many more elements per node. Self balancing just means changes to the tree (like adding a new record) can cause it to spend some time moving nodes around to ensure the tree has a nearly uniform depth. This ensures the search path is as short as possible for all nodes on average. The other difference, having more elements per node, is a simple concept that requires a bit more to understand.

Example B-tree
Example of a B-tree.

Instead of a single value and lower/higher links it keeps an array of values and an array of links who's length is one longer than the values array. These arrays align such that they could be interpreted interleaved as first a link to a lower node, then a link between each adjacent pair of values, and finally a link to a higher node. When searching for an identifier you start at the root node as before but instead scan the values array for the two elements that bound the identifier you're looking for. This picks the link that "sits between them." These nodes are designed so that all of the values in the linked node are between the values that bound the link. This is where the k comes from in the complexity mentioned earlier. Each node has k - 1 elements so that searching a node reduces the search space by k. This presents a puzzle though, doesn't this mean the computer is doing more work than it was in a binary tree?

Yes! Searching a B-Tree means at each node there's a linear scan of the values for the bounds of the identifier. This looks at way more values than you would if we followed a strictly balanced binary tree. However, this is where reality kicks theory's butt. This value of k is chosen specifically based on the machine the B-Tree is running on to take as much advantage of things like the cache line and page size as possible. See, computers are not abstract pointer machines, but instead exhibit real performance characteristics based on data locality, a concept completely excluded from Big-O complexity models. In reality, following a link to a node elsewhere in memory is far more expensive than a linear search through a cache line's worth of data. B-Trees exploit this by packing the cache line.

That doesn't really sound like a problem for UUIDs though. B-Trees should be able to handle UUIDs like any other identifier, shouldn't it? Well, the problem is what happens once you start to run out of memory. Again, in reality this all has to live in memory on a system that only has so much. This memory is precious because of how much faster it is than persistent storage (where everything ultimately lives). The problem is, once the database starts to run out of memory for cached data, indexes, and its own process data/text, it has to make choices about what gets to sit in memory and what gets evicted to make room. Once evicted, it will have to be read back in before it can be used. The problem is that in a system with random identifiers, every single node is equally likely to be needed and so the entire index has to be in memory even though say 80% of the data is really old and not likely to be needed any time soon.

Picking a Path

Thus the performance issue is found. Now what? Well, now real effort comes in picking trade-offs on how to solve it. See, UUIDs were not a great choice. Firstly, this system does not need to distribute identifier generation. Sure the middleware runs on more than one machine, but the database is on a single server. Give the task of identifier generation to a single machine (like the database) and you don't need the complexities of universality. UUIDs also eat a significant amount of space compared to just about any other alternative. For example, you can hand out well over a billion sequential IDs using just a 32 bit integer. A UUID is at best 128 bits, however, this system isn't storing them as efficiently as possible. Instead it chose to store them as the hyphenated hexadecimal string you saw which means they each take up 288 bits or 2.25 times as much space as they really needed to.

Problem is though, the system is fairly large and that identifier is a part of the interface it exposes through its API. We cannot simply get rid of them without rewriting large parts of the system and the clients that depend on it. If changing the identifier is expensive, the alternative then is to change the index. Well, MySQL unfortunately doesn't offer any other index type. Yeah, our hope at this point is moving to a hash index. The problem is we're going to have to change databases to do it.

PostgreSQL has technically had support for hash indexes since version 7.2 in , but it wasn't production ready until with the release of version 10. The page for index has had a prominent though changing note regarding the usage of hash indexes.

Note: Because of the limited utility of hash indexes, a B-tree index should generally be preferred over a hash index. We do not have sufficient evidence that hash indexes are actually faster than B-trees even for = comparisons. Moreover, hash indexes require coarser locks.

Note: Testing has shown PostgreSQL’s hash indexes to be similar or slower than B-tree indexes, and the index size and build time for hash indexes is much worse. Hash indexes also suffer poor performance under high concurrency. For these reasons, hash index use is discouraged.

Note: Testing has shown PostgreSQL’s hash indexes to perform no better than B-tree indexes, and the index size and build time for hash indexes is much worse. Furthermore, hash index operations are not presently WAL-logged, so hash indexes may need to be rebuilt with REINDEX after a database crash. For these reasons, hash index use is presently discouraged.

…the problems with hash indexes may be fixed eventually…

Caution: Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash if there were unwritten changes. Also, changes to hash indexes are not replicated over streaming or file-based replication after the initial base backup, so they give wrong answers to queries that subsequently use them. For these reasons, hash index use is presently discouraged.

Later in PostgreSQL 10, the warning was dropped. However, there are still many reasons not to use a hash index unless you really need it. A read through the new and dedicated hash index docs lists many limitations like,

This page is well worth a read if you're in a similar situation because it's not an upgrade but an alternative with considerations and trade-offs. Again, death to dogma in software development. In our case though it makes good sense to go this route.

On how the hash index works, it's similar to how a hash table works. I have to say similar hear because hash tables are still an active area of research. There are a large number of subtle variations in implementation that can deeply impact performance and utility. In PostgreSQL's case, values go through a function that projects the input to an integer. This integer is used to calculate an offset into a series of buckets. Each bucket is a collection of pages that contain some number of hashed values that each map to the row in the table to which it belongs. The values within a page are kept in sorted order to allow a binary search for a given value. Pages within a bucket are chained as a linked list. When a bucket gets too full PostgreSQL automatically splits the bucket to reduce the number of pages and values it has to search in a given bucket. A full description of the particulars can be found in PostgreSQL's Hash README but that's essentially it. Complexity of O(1) without lexicographic ordering. Sadly the entire index still has to fit in memory, but because PostgreSQL's hash indexes don't store the identifier, but only the hashed output, they can take up significantly less memory in our case.

Actually Cutting Over

So we have a plan, just up and replace the database. Easy, right? Well, easier than the alternative. In all, it takes over a month of work to cut over. There are a couple tiny differences between MySQL and PostgreSQL that require modifying the code, but all of these are internal details to the service that don't impact clients or the API.

If you've never migrated a live middleware system before, there's a fairly standard technique you use. Essentially you have your service:

  1. Begin writing to the new location on top of writing to the old.
  2. Start a process to migrate all the old data to the new location.
  3. Start reading from the new location.
  4. Stop writing to the old location.

In our case we used a program that would copy the contents of the old database to the new, but also follow the binary log from the old database to catch up with all the changes to it and mirror those as well. This let us write to only the old database the whole time. We then combined the last two stages by using a runtime flag to change the database all at once. This required leaving gaps in auto incrementally number identified records so that any last second changes to the old database would not conflict with all the new ones. It worked well, but the program used is fairly clunky and surprising with arguably poor documentation so I won't recommend it here.

Stringly Typed Systems

One of the changes we hoped to make in all this was reducing the degree to which this system was stringly typed, which would also reduce the amount of data stored. Stringly typed systems are systems that use strings for things that should otherwise have much finer semantics on the possible values they can be. In this case, encoding the UUID as a string results in over twice as much data being stored for no good reason. It also means the system doesn't enforce these actually be UUIDs, something the unit tests of the system took copious advantage of.

In this system there are just over three hundred test cases. To change the type of the field containing this UUID meant that well over three hundred of them would no longer work. See, UUIDs are huge. So huge that given the chance, the original authors of this system took it upon themselves to use jokey strings instead of real UUIDs when creating test cases. This leads to a problem though. Reading and fixing that many tests, trying to keep straight all the different identifiers and who needs what is a herculean task. Some test cases have dozens of these values in play and it's essential to their operation that they match. Given none of these look like UUIDs, the problem of just finding what strings are now supposed to be UUIDs is intense, so how do you go about fixing this much code?

Well, first, add a couple lines of code to the system that will cause every test that attempts to create these broken records to fail. This was actually fairly easy. Now you know all the broken test cases. The next bit is way more fun. Write a wrapper function that you can slap around all these joke strings that hashes them into authentic UUIDs. This means you can spend a few hours just going through every broken test only focused on trying to find the strings that are supposed to be UUIDs and not have to worry about keeping track of the test's logic too much. Instead, wrap the strings in a friendly uuid_hash("myJokeID") and move on. So here's such a function I wrote:

import hashlib
import uuid

def uuid_hash(val: str) -> str:
	"""
	This is a hack to cover for over 300 broken tests that assumed the IDs were only strings.
	Passing any string will return a UUID v4 that is unique to the input.

	Arguments:
		val: str
			The string to hash to a UUID.

	Returns: str
		A UUID string based on the val passed.
	"""
	val_bytes = val.encode()
	val_hashed = hashlib.md5(val_bytes).digest()
	val_array = bytearray(val_hashed)

	# We need it to be a UUID v4 like: ________-____-4___-[89ab]___-____________
	val_array[6] = val_array[6] & 0b0000_1111 | 0b0100_0000  # :=      4[0-9a-f]
	val_array[8] = val_array[8] & 0b0011_1111 | 0b1000_0000  # := [89ab][0-9a-f]

	val_uuid = str(uuid.UUID(bytes=bytes(val_array)))

	return val_uuid

Magic! Several hours of mindless coding later and every test is now passing. For a bit of a breakdown on how it works, there's just three parts. Part 1:

val_bytes = val.encode()
val_hashed = hashlib.md5(val_bytes).digest()
val_array = bytearray(val_hashed)

This converts the given string into bytes and then hashes those bytes to an MD5 hash (as binary data, not a hex string). Quick note on the use of MD5: Never use MD5 in security critical code. It is completely broken against malicious inputs. Here though, it's a perfectly fine 128 bit digest hash. All we need is a deterministic mapping from arbitrary strings to UUIDs and this works just fine. Next is part 2:

val_array[6] = val_array[6] & 0b0000_1111 | 0b0100_0000
val_array[8] = val_array[8] & 0b0011_1111 | 0b1000_0000

Here we clobber the four most significant bits (bits 12 through 15) of the time_hi_and_version field to the value 4 required of the UUID version 4 specification. We also clobber bits 6 and 7 of the clock_seq_hi_and_reserved to zero and one, respectively. We do this by first using a bitwise and (&) to apply a bitmask over the byte that contains the directed bits. Using a bitmask, wherever there's a 1, that bit of the original byte will be left as it is, but any 0s will be set to a 0 because 0 & x = 0. Then we use a bitwise or (|) to set the zeros to 1s where needed.

Finally for part 3, we use str(uuid.UUID(bytes=bytes(val_array))) to let the python uuid module convert those bytes into a uuid.UUID() object which we can convert to a string using it's uuid.UUID().__repr__() magic method.

There's a bit of a problem with this solution though. Now there are thousands of calls to this function. The tests now take several seconds longer to run than they used to. Well, here's one of the most beautiful things about Python in my opinion: runtime introspection. Python has amazing runtime introspection, meaning with just a little more code we can have this code delete itself. Seriously, here's what we can add just before the return statement:

import inspect
import re

…

call_match = re.compile(fr"uuid_hash\(\s*(['\"]){val}(['\"])\s*\)", re.MULTILINE)
with open(inspect.stack()[1].filename, "r+") as fptr:
	content = fptr.read()
	new_content = call_match.sub(f"\g<1>{val_uuid}\g<2>", content)

	fptr.seek(0)
	fptr.truncate()
	fptr.write(new_content)

The inspect module allows us to get the call stack of the currently running code. This allows us to find out what file the calling function lives in. Using this, we create a fairly simple regular expression that searches for a call to uuid_hash() with the value we actually got and use that to replace that function call with what this function returns. Putting it all together we get:

import hashlib
import inspect
import re
import uuid

def uuid_hash(val: str) -> str:
	"""
	This is a hack to cover for over 300 broken tests that assumed the IDs were only strings.
	Passing any string will return a UUID v4 that is unique to the input.

	Arguments:
		val: str
			The string to hash to a UUID.

	Returns: str
		A UUID string based on the val passed.
	"""
	val_bytes = val.encode()
	val_hashed = hashlib.md5(val_bytes).digest()
	val_array = bytearray(val_hashed)

	# We need it to be a UUID v4 like: ________-____-4___-[89ab]___-____________
	val_array[6] = val_array[6] & 0b0000_1111 | 0b0100_0000  # :=      4[0-9a-f]
	val_array[8] = val_array[8] & 0b0011_1111 | 0b1000_0000  # := [89ab][0-9a-f]

	val_uuid = str(uuid.UUID(bytes=bytes(val_array)))

	call_match = re.compile(fr"uuid_hash\(\s*(['\"]){val}(['\"])\s*\)", re.MULTILINE)
	with open(inspect.stack()[1].filename, "r+") as fptr:
		content = fptr.read()
		new_content = call_match.sub(f"\g<1>{val_uuid}\g<2>", content)

		fptr.seek(0)
		fptr.truncate()
		fptr.write(new_content)

	return val_uuid

chef kiss

Python UUID Gotcha

One other side tidbit I discovered in all this is that the uuid.UUID.__init__() initializer is not sufficient for validating a string is a valid UUID. A change a colleague added to the code was to do better input validation on these strings to ensure client applications actually send us UUIDs (something we should have always done, but I digress). To do this they added effectively:

data = {…, "id": str, …}

try:
	uuid.UUID(data["id"])
except ValueError:
	return HTTPError(400, f"Bad id value {data['id']}.")

I had a hunch this would be pointlessly slow given this at a minimum would be deserializing the string and allocating a new object to immediately throw it away for essentially no reason. So I decided to do two things, first I used timeit to verify how this compared to a compiled regular expression for validation. It's almost three times as fast:

import timeit


# Case 1 ]==================================================
setup = """
import uuid

val = "56541e71-532c-4133-92c9-ef519cb460df"
"""

stmt = """
uuid.UUID(val)
"""

time1 = timeit.timeit(stmt, setup)


# Case 2 ]==================================================
setup = """
import re

val = "56541e71-532c-4133-92c9-ef519cb460df"
uuid_re = re.compile(
	"^"
	"[0-9a-f]{8}-"
	"[0-9a-f]{4}-"
	"4[0-9a-f]{3}-"
	"[89ab][0-9a-f]{3}-"
	"[0-9a-f]{12}"
	"$"
)
"""

stmt = """
uuid_re.match(val)
"""

time2 = timeit.timeit(stmt, setup)


# Conclusion ]==============================================
print(f"It's {time1 / time2 :.3} times as fast!")
It's 2.83 times as fast!

There's another issue though if we dig a bit deeper. If we read the source of the uuid module, we can see the initializer is not designed to securely verify the input. Relevant code:

hex = hex.replace('urn:', '').replace('uuid:', '')
hex = hex.strip('{}').replace('-', '')
if len(hex) != 32:
	raise ValueError('badly formed hexadecimal UUID string')
int = int_(hex, 16)

Here hex is the first argument of the initializer. This code means all sorts of non-UUID strings will parse into valid uuid.UUID() objects. For example, {e32682f1-furn:25d----4bd-9-8c21-dbb-8fd-349c18}}}uuid:. I'm sure there's code out there that assumes this can be used for input validation. There's probably a whole host of lessons in here on security like knowing what code you write is actually doing, being clear in your documentation what your API should and should not be used for, or not using a cannon to kill a fly. I'll just leave drawing those as an exercise to the reader.

Update: Brand New UUID Versions?

I found a fun bit of news to share that I read literally the same day I published this. Turns out there's a draft RFC (at the time of writing) to create 3 new UUID versions, bringing the total to 8. These new versions are designed specifically to deal with this problem. Well then… My thanks to Brad Peabody and Kyzer Davis for working to standardize this.

How do they work? Well, they all start with a timestamp that's big-endian encoded. This means they all sort lexically by timestamps.

UUID version 6 keeps the Gregorian timestamps from UUID version 1 (count of 100-nanosecond intervals since ) but reorders the parts of it so it sorts properly. It also replaces use of the MAC address with random bits. This is designed to be the most "drop in" for existing UUID implementations.

UUID version 7 is a 36-bit unix timestamp with arbitrary precision and a sequence ID followed by random bits. 36-bits for the timestamp means it has rollover problems on August 20th, 4147 instead of as the normal 32-bit variant has. The sequence bits ensure a single system generating multiple UUIDs at the same time keeps the resulting records in sorted order. Arbitrary precision means it supports any level of sub-seconds down and passed nanoseconds if you really want that. It does it in such a way that you can even have different precisions within the same system. This is the version I would personally encourage people use. It also seems to be the one the authors favour as well.

UUID version 8 is a format designed for use cases that need a custom timestamp format. By specification it requires you verify you cannot use versions 1, 6, or 7 before you resort to version 8. It's similar to version 7 but doesn't require the time be a unix epoch or Gregorian timestamp. It still has the sequence field followed by random data but that random data can also be node identification data. I mean, if you really want to share your MAC address (or other node identifiers) despite the privacy implications of that. Ultimately, version 8 is a kind of roll your own format. It may have uses if you want identifiers to have more meaning than simply identifying a resource. I'd personally encourage applying those values as attributes of a record a client can query instead of packing them into the identifier you share with everyone. That way you can better authorize who can see what attributes of a record based only on an opaque token, but your use case may vary.