Generating unique IDs in a distributed environment at high scale

I was recently working on an application that had a sharded MySQL database. While working on this application, I needed to generate unique IDs that could be used as the primary keys in the tables.

When you’re working with a single MySQL database, you can simply use an auto-increment ID as the primary key, But this won’t work in a sharded MySQL database.

So I looked at various existing solutions for this, and finally wrote a simple 64-bit unique ID generator that was inspired by a similar service by Twitter called Twitter snowflake.

In this article, I’ll share a simplified version of the unique ID generator that will work for any use-case of generating unique IDs in a distributed environment, not just sharded databases.

I’ll also outline other existing solutions and discuss their pros and cons.

Existing Solutions

UUID

UUIDs are 128-bit hexadecimal numbers that are globally unique. The chances of the same UUID getting generated twice is negligible.

The problem with UUIDs is that they are very big in size and don’t index well. When your dataset increases, the index size increases as well and the query performance takes a hit.

MongoDB’s ObjectId

MongoDB’s ObjectIDs are 12-byte (96-bit) hexadecimal numbers that are made up of -

  • a 4-byte epoch timestamp in seconds,
  • a 3-byte machine identifier,
  • a 2-byte process id, and
  • a 3-byte counter, starting with a random value.

This is smaller than the earlier 128-bit UUID. But again the size is relatively longer than what we normally have in a single MySQL auto-increment field (a 64-bit bigint value).

Database Ticket Servers

This approach uses a centralized database server to generate unique incrementing IDs. It’s like a centralized auto-increment. This approach is used by Flickr.

The problem with this approach is that the ticket server can become a write bottleneck. Moreover, you introduce one more component in your infrastructure that you need to manage and scale.

Twitter Snowflake

Twitter snowflake is a dedicated network service for generating 64-bit unique IDs at high scale. The IDs generated by this service are roughly time sortable.

The IDs are made up of the following components:

  • Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
  • Configured machine id - 10 bits (gives us up to 1024 machines)
  • Sequence number - 12 bits (A local counter per machine that rolls over every 4096)

The extra 1 bit is reserved for future purposes. Since the IDs use timestamp as the first component, they are time sortable.

The IDs generated by twitter snowflake fits in 64-bits and are time sortable, which is great. That’s what we want.

But If we use Twitter snowflake, we’ll again be introducing another component in our infrastructure that we need to maintain.

Distributed 64-bit unique ID generator inspired by Twitter Snowflake

Finally, I wrote a simple sequence generator that generates 64-bit IDs based on the concepts outlined in the Twitter snowflake service.

The IDs generated by this sequence generator are composed of -

  • Epoch timestamp in milliseconds precision - 41 bits. The maximum timestamp that can be represented using 41 bits is 241 - 1, or 2199023255551, which comes out to be Wednesday, September 7, 2039 3:47:35.551 PM. That gives us 69 years with respect to a custom epoch.
  • Node ID - 10 bits. This gives us 1024 nodes/machines.
  • Local counter per machine - 12 bits. The counter’s max value would be 4095.

The remaining 1-bit is the signed bit and it is always set to 0.

Your microservices can use this Sequence Generator to generate IDs independently. This is efficient and fits in the size of a bigint.

Here is the complete program -

import java.net.NetworkInterface;
import java.security.SecureRandom;
import java.time.Instant;
import java.util.Enumeration;

/**
 * Distributed Sequence Generator.
 * Inspired by Twitter snowflake: https://github.com/twitter/snowflake/tree/snowflake-2010
 *
 * This class should be used as a Singleton.
 * Make sure that you create and reuse a Single instance of SequenceGenerator per node in your distributed system cluster.
 */
public class SequenceGenerator {
    private static final int UNUSED_BITS = 1; // Sign bit, Unused (always set to 0)
    private static final int EPOCH_BITS = 41;
    private static final int NODE_ID_BITS = 10;
    private static final int SEQUENCE_BITS = 12;

    private static final int maxNodeId = (int)(Math.pow(2, NODE_ID_BITS) - 1);
    private static final int maxSequence = (int)(Math.pow(2, SEQUENCE_BITS) - 1);

    // Custom Epoch (January 1, 2015 Midnight UTC = 2015-01-01T00:00:00Z)
    private static final long CUSTOM_EPOCH = 1420070400000L;

    private final int nodeId;

    private volatile long lastTimestamp = -1L;
    private volatile long sequence = 0L;

    // Create SequenceGenerator with a nodeId
    public SequenceGenerator(int nodeId) {
        if(nodeId < 0 || nodeId > maxNodeId) {
            throw new IllegalArgumentException(String.format("NodeId must be between %d and %d", 0, maxNodeId));
        }
        this.nodeId = nodeId;
    }

    // Let SequenceGenerator generate a nodeId
    public SequenceGenerator() {
        this.nodeId = createNodeId();
    }

    public synchronized long nextId() {
        long currentTimestamp = timestamp();

        if(currentTimestamp < lastTimestamp) {
            throw new IllegalStateException("Invalid System Clock!");
        }

        if (currentTimestamp == lastTimestamp) {
            sequence = (sequence + 1) & maxSequence;
            if(sequence == 0) {
                // Sequence Exhausted, wait till next millisecond.
                currentTimestamp = waitNextMillis(currentTimestamp);
            }
        } else {
            // reset sequence to start with zero for the next millisecond
            sequence = 0;
        }

        lastTimestamp = currentTimestamp;

        long id = currentTimestamp << (NODE_ID_BITS + SEQUENCE_BITS);
        id |= (nodeId << SEQUENCE_BITS);
        id |= sequence;
        return id;
    }


    // Get current timestamp in milliseconds, adjust for the custom epoch.
    private static long timestamp() {
        return Instant.now().toEpochMilli() - CUSTOM_EPOCH;
    }

    // Block and wait till next millisecond
    private long waitNextMillis(long currentTimestamp) {
        while (currentTimestamp == lastTimestamp) {
            currentTimestamp = timestamp();
        }
        return currentTimestamp;
    }

    private int createNodeId() {
        int nodeId;
        try {
            StringBuilder sb = new StringBuilder();
            Enumeration<NetworkInterface> networkInterfaces = NetworkInterface.getNetworkInterfaces();
            while (networkInterfaces.hasMoreElements()) {
                NetworkInterface networkInterface = networkInterfaces.nextElement();
                byte[] mac = networkInterface.getHardwareAddress();
                if (mac != null) {
                    for(int i = 0; i < mac.length; i++) {
                        sb.append(String.format("%02X", mac[i]));
                    }
                }
            }
            nodeId = sb.toString().hashCode();
        } catch (Exception ex) {
            nodeId = (new SecureRandom().nextInt());
        }
        nodeId = nodeId & maxNodeId;
        return nodeId;
    }
}

The above generator uses the system’s MAC address to create a unique identifier for the Node. You can also supply a NodeID to the sequence generator. That will guarantee uniqueness.

Let’s now understand how it works. Let’s say it’s June 9, 2018 10:00:00 AM GMT. The epoch timestamp for this particular time is 1528538400000.

First of all, we adjust our timestamp with respect to the custom epoch-

currentTimestamp = 1528538400000 - 1420070400000 // 108468000000 (Adjust for custom epoch)

Now, the first 41 bits of the ID (after the signed bit) will be filled with the epoch timestamp. Let’s do that using a left-shift -

id = currentTimestamp << (10 + 12)

Next, we take the configured node ID and fill the next 10 bits with the node ID. Let’s say that the nodeId is 786 -

id |= nodeId << 12

Finally, we fill the last 12 bits with the local counter. Considering the counter’s next value is 3450, i.e. sequence = 3450, the final ID is obtained like so -

id |= sequence  // 454947766275219456

That gives us our final ID.

More Learning Resources