September 16, 2024

Sorted Identifier Generators in Distributed Systems

Written by: Denis Selimović, Cloud Services Backend Engineer

Different methods can be used for unique identifiers as primary keys, ranging from simple auto-incrementing integers to various schemes using pseudo-random number generators, such as UUID v4. When choosing a method, we consider factors such as ease of generation, prevention of enumeration (predicting the next identifier if one is known), indexability, generation speed, etc. Sometimes, there is a need to generate unique identifiers that can be sorted chronologically. These identifiers can be used for event scheduling in time, as well as for advanced pagination methods that require database objects to be sorted by their primary key.

A seemingly simple solution is to delegate the responsibility for generating sorted unique identifiers to the database (in the form of auto-incrementing primary columns). There are several reasons why this approach should be avoided. With auto-incrementing columns, predicting the next identifier becomes easy, which can be considered a security flaw.

This approach completely fails when using a distributed database employing techniques like sharding. In this case, each database instance generates identifiers separately, which will inevitably lead to collisions, i.e., repeating the same identifier multiple times. To prevent collision problems, we would need to devise coordination between multiple instances, which is itself a more difficult problem than generating sorted identifiers.

With only one database instance and ignoring possible enumeration, the system cannot function properly when that single instance is unavailable (single point of failure). Obviously, for distributed systems, the process of generating unique identifiers that can be sorted over time must also be distributed. 

UUID v1 and UUID v6

UUID (Universally Unique Identifier) is a 128-bit sequence commonly used as an identifier in computer systems. With a proper generator, the chances of two systems generating the same UUID are negligible, though not zero. This can be achieved without a central authority or the need for coordination between multiple instances used as generators. These properties are precisely what we seek in distributed systems to maximize the efficiency of identifier generation. There are multiple versions of UUID, the most well-known probably being UUID v4, which we’ve already mentioned, and is often used as the primary key in relational databases. UUID v4 uses pseudo-random numbers when generating new values, making enumeration practically impossible. As such, they are not suitable for sorting by time.

Fortunately, there are two versions of UUID (v1 and v6) that contain a time component and can be used for such purposes.

UUID v1 uses a 48-bit MAC address of the instance to which 60 bits representing the instance’s timestamp at the time of generation are added. This timestamp counts how many 100ns intervals have passed since October 15, 1582. There is also a 16-bit clock sequence that distinguishes two UUIDs generated at the same time. This is crucial in distributed systems where multiple instances generate identifiers simultaneously, or when a single instance generates multiple identifiers in one millisecond. The remaining 4 bits are used to mark the version. The maximum number of identifiers generated with UUID v1 is 163 billion per second per instance. The 48-bit identifier, which is essentially the MAC address, guarantees that two UUID v1 identifiers from two different instances will be different given the uniqueness of these addresses. 

Format%20UUIDv1%20

UUIDv1 format

UUID v6 uses the same scheme as UUID v1, but the bits in the timestamp are reversed, allowing UUID v6 to be used as a generator of unique identifiers that can be sorted over time. One drawback of such identifiers is that it reveals the MAC address of the NIC that generated the identifier, which could be a security issue. While the 128-bit UUID scheme is shorter than some other identifiers, it’s still a significant amount of data, especially for primary keys and indexes. Although UUID v6 indexes better than UUID v4, the situation is far from ideal, especially for indexes that require sorted data (e.g., b-tree indexes). Nevertheless, they can still play a role as generators of sorted values in certain cases. Most programming languages have libraries implemented for these generators, which should definitely be utilized. 

ULID

ULID (Universally Unique Lexicographically Sortable Identifier) aims to solve all the problems associated with UUID schemes. It is a 128-bit identifier, with 48 bits allocated for a UNIX timestamp and the remaining 80 bits representing a pseudo-random number generated using some cryptographically secure method. Using the UNIX timestamp allows ULID identifiers to be sorted chronologically. For identifiers generated in the same millisecond, the order is undefined, which should be considered when implementing these generators. Although the scheme consists of 128 bits, it is often encoded as a string of 26 characters (instead of 36) using the Crockford Base32 algorithm for efficient encoding. In addition, the scheme contains no special characters (URL safe) and is compatible with UUID as they have the same format. The 80-bit pseudo-random number allows for more than one septillion (1.21e+24) different identifiers in one millisecond, though this limit is impossible to reach due to hardware constraints. 

Format%20ULID%20identifikatora%20

ULID identifiers format

A ULID identifier generator must guarantee monotonicity. If the last and current identifier are detected to be in the same millisecond, the 80-bit random number must be different for these two identifiers to ensure uniqueness.

When implementing, it’s important to ensure that uppercase and lowercase letters are not distinguished (case-insensitive format). Due to its properties, using ULID in indexes or for partitioning data in the database is much more efficient than in the case of UUID identifiers. Additionally, they can be used in distributed transactions (using sagas) to correctly determine the order of events.

The final format contains 26 characters (after Crockford encoding), with 10 characters for the UNIX timestamp and 16 characters for the random part. The simplest strategy for resolving conflicts with identifiers generated in the same millisecond is to increment the random part by 1, which guarantees uniqueness and monotonicity. The downside of this approach is that it becomes easier to predict the next identifier if it was generated in the same millisecond. Although not as popular as UUID identifiers, there is support in most programming languages. When choosing a library, make sure the generator is implemented correctly. 

Snowflake ID

Snowflake ID is both the name of a scheme for unique sorted identifiers and the name of a dedicated distributed service that generates these identifiers. The name Snowflake was chosen because every snowflake is unique. The scheme was devised by Twitter engineers (now X) about a decade ago. Although no longer used as such, it serves as the basis and inspiration for other schemes of this kind.

It is a completely distributed solution using Zookeeper to maintain the cluster of instances. Obviously, this is a complex solution not suitable for simpler needs. However, if a high-scale solution is needed, Snowflake ID or one of its derivatives is the right choice. The architecture of this dedicated service includes dedicated instances for identity discovery (service discovery). The need for such services becomes clear from the identifier format itself. The distributed architecture allows identifiers to be generated even if a certain part of the instances is currently or permanently unavailable.

It uses 64-bit identifiers, meaning that storage requirements are half that of 128-bit UUIDs and ULIDs. 41 bits are reserved for the timestamp. 10 bits are used as an identifier for the instance generating the Snowflake ID (this identifier is assigned by the dedicated identity discovery instance), of which 5 bits are reserved for the datacenter ID and 5 bits for the instance ID. 12 bits are reserved for a sequential number used in case of a collision (when two identifiers are generated in the same millisecond). Thus, a maximum of 4096 identifiers is supported in the same millisecond per instance, and the maximum number of instances is 1028, giving us a maximum of 4,210,688 identifiers in one millisecond. All of this occupies 63 bits, with the last bit saved for future applications and currently unused.

Since the timestamp is the first part of the identifier, Snowflake ID can be sorted chronologically. If two identifiers from two different instances have the same timestamp, the instance identifier is used to determine which identifier was generated first.

Format%20Snowflake%20identifikatora%20

Snowflake identifier format

The generation process ensures that identifiers are unique. The generator must ensure that the identifiers are also monotonic by using the sequential number in case of collisions. Snowflake ID requires additional architecture, so it can lead to very complex solutions. Maintaining Zookeeper is not an easy task and requires a lot of financial and hardware resources. On top of that, the separate architecture can include services for generating Snowflake server identities. Independent implementation is probably not a good idea, and one should turn to the rather good open-source libraries available. Also, the original code for Snowflake ID is publicly available to everyone.

Of course, there are other methods that may be more suitable in certain cases. The task of generation is not simple and requires a lot of knowledge about distributed systems. In the case of a distributed solution, attention must be paid to ensuring that all instances have synchronized internal clocks based on a precise NTP server. Any deviation between individual instances can lead to incorrect results.

For those interested, more complex schemes include MongoDB ObjectId, Instagram ShardID, Flake, and Sonyflake.