>With the release of UUIDv7 that offers some benefits as ULIDs and are native to Postgres as of December 2024 (see the commit here), it might be better to switch to UUIDv7 in the future if one doesn’t care about URL friendliness.
Yes, I think UUIDv7 would be a much better choice especially because you could continue to use the UUID type in postgres and not need to devolve to text. You could also choose to encode the IDs with base32/58/64 at the edge to make them shorter and more URL friendly, though that adds a complexity to your application in tracking database IDs separately from public IDs.
I wish UUID would specify a more url-friendly standard representation format beyond the hex string with dashes.
timewizard 22 days ago [-]
> encode the IDs with base32/58/64 at the edge to make them shorter
Having tried this I immediately regretted it. Storage is not costly enough to justify the additional pain points that you've correctly identified.
> UUID would specify a more url-friendly standard representation
There's always the 2.25 OID space via URN (urn:oid:2.25.12345...). In which case you encode the underlying integer directly without any grouping punctuation involved.
For the same reason above you should use probably use a single encoding for all use cases, at which point, just using the ugly 8-4-4-4-12 will save you the most trouble.
aftbit 21 days ago [-]
Not sure why your comment was flagged. I also somewhat regret using base58 UUIDs, but with sufficient DB and app-side helpers, and sufficient discipline to always convert at the edges, it became tolerable. It was the only option I could come up with to retrofit short IDs onto a system designed with UUIDs where a project owner decided our URLs needed to be shorter and prettier late in the development process.
TranquilMarmot 21 days ago [-]
I don't think they're suggesting storing them in base32/58/64, but just having that be how they're presented to the user. We do this in some of our APIs- if the URL has an ID of a specific length we try to turn it into a full UUID first before passing it on to other code. In the postgres database, the ID column is still a UUID type.
layer8 21 days ago [-]
What’s wrong with the hex-and-dash representation? It’s the textual representation recommended by RFC 9562, and it’s immediately recognizable as a UUID.
There’s also a URN namespace defined for it, if an absolute URI is needed or if one wants to be more explicit:
urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6
aftbit 21 days ago [-]
It's too long. That's the only real problem with it. The same UUID in base58 would be:
Xe22UfxT3rxcKJEAfL5373
Which is 22 characters instead of 36.
layer8 21 days ago [-]
Too long with respect to what practical requirements? It’s still shorter than the usual hex representation of a full Git hash for example, and I don’t see calls to encode those as Base58. The dashes also make for a more readable structure.
sagarm 21 days ago [-]
You get long ugly URLs. The system in work on often has 4-5 of these IDs in the URL, making working with them -- like copying and pasting them, or even extracting the particular id you care about from the path -- cumbersome.
TranquilMarmot 21 days ago [-]
+1 for this, UUIDs in URLs is such a pain. For the app we're working on we went with UUIDs and often have 4+ in the URL as well. So ugly and cumbersome.
Worst part is that you can't double-click on one to highlight the whole thing, you have to drag your cursor over it.
At a previous company, we worked _really_ hard to come up with a "4x4" ID system (i.e. a1b2-c3d4) because they'd often have to be read over the phone. Originally, we worried we'd run out of them but after 15+ years it seems like they're still going strong.
aftbit 21 days ago [-]
Here's some python code that implements what I discussed:
def decode_id(id):
if id is None:
return None
return str(uuid.UUID(bytes=base58.b58decode(id)))
def encode_id(id):
if id is None:
return None
if not isinstance(id, uuid.UUID):
id = uuid.UUID(hex=id)
return base58.b58encode(id.bytes)
def ensure_id(id):
if id is None:
return None
try:
return decode_id(id)
except Exception:
try:
encode_id(id)
except Exception:
return None
else:
return id
tttp 21 days ago [-]
we encoded the id with a base that removes most vowels (to avoid generating words, potentially offensive ones, and added a checksum to prevent copy paste mistakes
We had a Django project using ULIDs and it just caused headache after headache when interacting with Postgres, and we had all sorts of weird extensions to try to get it working that blocked a Django upgrade. I ended up ripping it all out and just using UUIDv7 everywhere, much more standard.
jerrygoyal 21 days ago [-]
why uuidv7 isn't url-friendly?
aftbit 21 days ago [-]
It is just too long. It does not include any illegal characters, but it is ~50% larger than it needs to be.
mgoetzke 21 days ago [-]
They are quite helpful, but one should be aware of information leakage if a database item id is potentially visible to readers with access to the entire data entry.
In most systems that is possible via references , and this could allow unauthorized users to deduce the timing of certain events that happened.
Whether this is of any concern OSS of course domain dependent. We will keep using v4 by default, but allow newer methods where applicable
8organicbits 22 days ago [-]
> Random bits are incremented sequentially within the same millisecond
That surprised me. This provides sub-millisecond sorting when the same generator is used (I.E. same process) but doesn't hold across different processes. So you still have unsorted sub-millisecond events in a distributed system, so the concern isn't fully eliminated. It looks like a decent performance optimization though since it reduces calls to generate random bits.
I ended up reading RFC 9562, which talks about a bunch of ideas and tradeoffs with this sort of sub-millisecond sorting.
Yeah this basically negates a lot of the advantages of a unique id IMO:
- scaling ID generation for a distributed system and avoiding a synchronization point
- optimistic locking and idempotency keys
atombender 21 days ago [-]
A few small issues:
- They call generate_ulid(now()). This returns the transaction timestamp, so all the timestamps will be the same. They should be using clock_timestamp().
- It also appears the generate_uuid() function they're using (which is not explained) is implemented with PL/PGSQL and is quite slow. There is a native C extension called pg-ulid [1] which is much faster; about 15% faster than Postgres' gen_random_uuid().
- Using EXPLAIN ANALYZE to benchmark stuff is a bad idea in generally. It will not give realistic timings, and it has a lot of overhead. EXPLAIN ANALYZE is intended to debug a query plan, not benchmark it.
Instead of using EXPLAIN, you can use COPY:
COPY (SELECT ...) TO '/dev/null' (FORMAT BINARY);
This has the advantage that it is more realistic, since the server has to actually serialize the results, so you get an approximation of that overhead. If you're using psql, you can enable timings and use \copy:
\timing on
\copy (SELECT ...) TO '/dev/null' (FORMAT BINARY);
This will transfer the data from the server to psql, so it will include network time, which makes the benchmark more realistic.
I used ULIDs for a time until i discovered snowflake ids. They are (“only”) 64 bits, but incorporate timestamps and randomness as well. They take up way less space than ULIDs for this purpose and offer acceptably rare collisions for things I’ve worked on.
beala 22 days ago [-]
The original snowflake id developed at twitter contains a sequence number so they should never collide unless you manage to overflow the sequence number in a single millisecond.
sgarland 21 days ago [-]
Also, you can store them as a BIGINT, which is awesome. So much smaller than even a binary-encoded UUID. IIRC the spec reserves the right to use the sign bit, so if you’re concerned, use BIGINT UNSIGNED (natively in MySQL, or via extension in Postgres).
I wish more people cared about the underlying tech of their storage layer – UUIDv4 as a string is basically the worst-case scenario for a PK, especially for MySQL / InnoDB.
OutOfHere 21 days ago [-]
The only ID type I like is hash based as it can be reproducibly reconstructed from a source tuple without having to look it up. Everything else requires a lookup.
kardos 21 days ago [-]
So why are they 3.27x slower to insert? Are they 3.27x longer in string form?
sgarland 21 days ago [-]
It's likely a function of the fact that `gen_random_uuid()` is implemented in C [0], and is essentially just reading from `/dev/urandom`, then modifying the variant and version bits. Whereas, assuming they're using something like what was described here [1], that's a lot of function calls within Postgres, which slows it down.
As an example, this small function that makes UUIDv4:
postgres=# CREATE OR REPLACE FUNCTION custom_uuid_v4() RETURNS uuid AS $$
SELECT encode(set_byte(set_byte(gen_random_bytes(16), 6, (get_byte(gen_random_bytes(1), 0) & 15) | 64), 8, (get_byte(gen_random_bytes(1), 0) & 63) | 128), 'hex')::uuid;
$$ LANGUAGE sql;
Took 14.5 seconds to create / insert 1,000,000 rows into a temp table, compared to 7.1 seconds for `gen_random_uuid()`.
I don't think that's right. They show in the section titled "Generating" that the performance of calling the ULID function from SQL is only very slightly slower. It's the INSERT that performs worse.
Generally, inserting sorted values (like sequential integers or in this case, ULIDs) into a B-tree index is much faster than inserting random values. This is because inserted values go into the same, highly packed B-tree nodes, whereas random inserts will need to create a lot of scattered B-tree nodes, resulting in more pages written. Random values are generally faster to query, but slower to insert.
In this case I think the insert speed differences may come down to the sizes of the keys. Postgres's native UUID type is 128 bits, or 16 bytes, whereas the ULID is stored as the "text" type, encoded as base32, resulting in a string that is 26 bytes, plus a 32-bit string length header, so 240 bits in total, or 1.87x longer. In the benchmark, the ULID insert is about 3x that of the UUID. So the overhead may be not just the extra space but the overhead of string comparisons compared to just comparing 128-bit ints.
Edit: The article doesn't actually say which ULID implementation they use. The one implemented in PL/PGSQL mentioned in one of the article's links [1] is very slow. The other [2] is quite fast, but doesn't use base32. However, this [3] native C extension is fast, about 15% faster than the UUID function on my machine.
On my machine, using pg-ulid, inserting 1M rows was on average 1.2x faster for UUID than ULID (mean: 963ms vs 1131ms). This is probably all I/O, and reflects the fact that the ULIDs are longer. Raw output here: https://gist.github.com/atombender/7adccb17a95056313d0e8ff56....
Edit 2: They don't have an index on the column in the article, so my comment about B-tree performance doesn't apply here.
I assumed that they were storing the ULIDs as binary, in the UUID column type, as link 2 in your reply. If stored as TEXT, then yes, that absolutely would make a difference.
It’s also worth noting that unlike MySQL / SQL Server, Postgres does not store tuples clustered around the PK. Indices are of course still in a B+tree.
atombender 21 days ago [-]
They show that they're storing the ULIDs as text. Quoting from the article:
CREATE TABLE ulid_test(id TEXT);
I suspect their poor results come from their choice of ULID implementation. The native C implementation I tried out is faster than the Postgres UUID type when testing computation only.
I noticed a bug in their test: They call generate_ulid() with now(). But now() is an alias for transaction_timestamp(), which is computed once at the start of the transaction, so all the timestamps will be the same. They should be using clock_timestamp().
sgarland 21 days ago [-]
Good catch to both.
bob778 22 days ago [-]
They didn’t want to make a breaking change but didn’t:
1. Use UUIDv7, which has the same sortability without breaking the ID format, or
2. Repackage the ULIDs to maintain consistency
And then broke pagination with this change?
How was this ever approved by a change control board? Or do they not have one?
Yes, I think UUIDv7 would be a much better choice especially because you could continue to use the UUID type in postgres and not need to devolve to text. You could also choose to encode the IDs with base32/58/64 at the edge to make them shorter and more URL friendly, though that adds a complexity to your application in tracking database IDs separately from public IDs.
I wish UUID would specify a more url-friendly standard representation format beyond the hex string with dashes.
Having tried this I immediately regretted it. Storage is not costly enough to justify the additional pain points that you've correctly identified.
> UUID would specify a more url-friendly standard representation
There's always the 2.25 OID space via URN (urn:oid:2.25.12345...). In which case you encode the underlying integer directly without any grouping punctuation involved.
For the same reason above you should use probably use a single encoding for all use cases, at which point, just using the ugly 8-4-4-4-12 will save you the most trouble.
There’s also a URN namespace defined for it, if an absolute URI is needed or if one wants to be more explicit:
Worst part is that you can't double-click on one to highlight the whole thing, you have to drag your cursor over it.
At a previous company, we worked _really_ hard to come up with a "4x4" ID system (i.e. a1b2-c3d4) because they'd often have to be read over the phone. Originally, we worried we'd run out of them but after 15+ years it seems like they're still going strong.
it worked quite well so far
https://dxid.tttp.eu/
In most systems that is possible via references , and this could allow unauthorized users to deduce the timing of certain events that happened.
Whether this is of any concern OSS of course domain dependent. We will keep using v4 by default, but allow newer methods where applicable
That surprised me. This provides sub-millisecond sorting when the same generator is used (I.E. same process) but doesn't hold across different processes. So you still have unsorted sub-millisecond events in a distributed system, so the concern isn't fully eliminated. It looks like a decent performance optimization though since it reduces calls to generate random bits.
I ended up reading RFC 9562, which talks about a bunch of ideas and tradeoffs with this sort of sub-millisecond sorting.
https://www.rfc-editor.org/rfc/rfc9562.html#monotonicity_cou...
- They call generate_ulid(now()). This returns the transaction timestamp, so all the timestamps will be the same. They should be using clock_timestamp().
- It also appears the generate_uuid() function they're using (which is not explained) is implemented with PL/PGSQL and is quite slow. There is a native C extension called pg-ulid [1] which is much faster; about 15% faster than Postgres' gen_random_uuid().
- Using EXPLAIN ANALYZE to benchmark stuff is a bad idea in generally. It will not give realistic timings, and it has a lot of overhead. EXPLAIN ANALYZE is intended to debug a query plan, not benchmark it.
Instead of using EXPLAIN, you can use COPY:
This has the advantage that it is more realistic, since the server has to actually serialize the results, so you get an approximation of that overhead. If you're using psql, you can enable timings and use \copy: This will transfer the data from the server to psql, so it will include network time, which makes the benchmark more realistic.[1] https://github.com/andrielfn/pg-ulid
I wish more people cared about the underlying tech of their storage layer – UUIDv4 as a string is basically the worst-case scenario for a PK, especially for MySQL / InnoDB.
As an example, this small function that makes UUIDv4:
Took 14.5 seconds to create / insert 1,000,000 rows into a temp table, compared to 7.1 seconds for `gen_random_uuid()`.[0]: https://doxygen.postgresql.org/uuid_8c.html#a6296fbc32909d10...
[1]: https://blog.daveallie.com/ulid-primary-keys/
Generally, inserting sorted values (like sequential integers or in this case, ULIDs) into a B-tree index is much faster than inserting random values. This is because inserted values go into the same, highly packed B-tree nodes, whereas random inserts will need to create a lot of scattered B-tree nodes, resulting in more pages written. Random values are generally faster to query, but slower to insert.
In this case I think the insert speed differences may come down to the sizes of the keys. Postgres's native UUID type is 128 bits, or 16 bytes, whereas the ULID is stored as the "text" type, encoded as base32, resulting in a string that is 26 bytes, plus a 32-bit string length header, so 240 bits in total, or 1.87x longer. In the benchmark, the ULID insert is about 3x that of the UUID. So the overhead may be not just the extra space but the overhead of string comparisons compared to just comparing 128-bit ints.
Edit: The article doesn't actually say which ULID implementation they use. The one implemented in PL/PGSQL mentioned in one of the article's links [1] is very slow. The other [2] is quite fast, but doesn't use base32. However, this [3] native C extension is fast, about 15% faster than the UUID function on my machine.
On my machine, using pg-ulid, inserting 1M rows was on average 1.2x faster for UUID than ULID (mean: 963ms vs 1131ms). This is probably all I/O, and reflects the fact that the ULIDs are longer. Raw output here: https://gist.github.com/atombender/7adccb17a95056313d0e8ff56....
Edit 2: They don't have an index on the column in the article, so my comment about B-tree performance doesn't apply here.
[1] https://blog.lawrencejones.dev/ulid
[2] https://blog.daveallie.com/ulid-primary-keys/
[3] https://github.com/andrielfn/pg-ulid
It’s also worth noting that unlike MySQL / SQL Server, Postgres does not store tuples clustered around the PK. Indices are of course still in a B+tree.
I noticed a bug in their test: They call generate_ulid() with now(). But now() is an alias for transaction_timestamp(), which is computed once at the start of the transaction, so all the timestamps will be the same. They should be using clock_timestamp().
1. Use UUIDv7, which has the same sortability without breaking the ID format, or
2. Repackage the ULIDs to maintain consistency
And then broke pagination with this change?
How was this ever approved by a change control board? Or do they not have one?