Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple ULIDs per millisecond not guaranteed #11

Open
dragondave opened this issue Dec 27, 2018 · 43 comments
Open

Multiple ULIDs per millisecond not guaranteed #11

dragondave opened this issue Dec 27, 2018 · 43 comments

Comments

@dragondave
Copy link

If, in the extremely unlikely event that, you manage to generate more than 2^80 ULIDs within the same millisecond, or cause the random component to overflow with less, the generation will fail.

Given the width of the random space is 80 bits, the latter is guaranteed to happen first (or at the same time).

There is also no guarantee that the number randomly generated is not very close to 2^80 -- should implementations consider a maximum number they will accept from the random number generator, and that they will change random numbers greater than to? (Or choose a number lower than)?

@alex-ilin
Copy link
Contributor

I don't mean to hijack this thread, but I have a related question that needs clarification before the @dragondave's question can be answered, so I decided against creating a separate issue. The Sorting section says that "Within the same millisecond, sort order is not guaranteed", and then the Monotonicity section tries to establish exactly that guarantee (implementation is supposed to throw exception in exactly the same case when the sort order within a millisecond is threatened by overflow). Which is correct? Is either of those sections to be considered optional? I'm confused.

If throwing exception on overflow is optional, then multiple ULIDs per millisecond are not an issue (within 2^80, of course). If throwing is mandatory, then the Sorting section is wrong, and @dragondave's question needs to be answered.

Technically, it's possible to generate 79 bits of entropy and still have plenty of space for incremented values without performance loss.

@lxcid
Copy link

lxcid commented Jan 2, 2019

nanoid have a great website to help figuring out the collision for layman like me.

https://zelark.github.io/nano-id-cc/

So I was calculating the 16 characters random component collision chances (1% probability of collision) when generating 1,000,000 id/milliseconds is about ~3 mins or about ~180,000 ms…

screen shot 2019-01-02 at 12 21 06 pm

Because its scoped by millisecond, that I believe chance is way lower… or if I am not calculating incorrectly, we'll have to generate about:

~180,000,000,000 id/milliseconds to hit 1% probability of collision.

I guess that make the uuid claim holds. 👍

@lxcid
Copy link

lxcid commented Jan 2, 2019

@alexiljin monotonicity is stateful per function created by the factory function, I would avoid reusing it on concurrency scenario.

You should generated one function per thread/concurrency unit.

https://github.com/ulid/javascript/blob/f45529c95994019125cdd5a497ab792e00ef57f2/lib/index.ts#L160-L178

Otherwise it should be fine.

@alex-ilin
Copy link
Contributor

@lxcid, I asked whether the Sorting or the Monotonicity section's claim was optional. I read your answer to mean that the Monotonicity is optional: if we have concurrency, there is no monotonicity anyway. So, basically, in the general case, sorting order within a millisecond is not guaranteed.

It seems to me, that after establishing that there is no sorting guarantee, it's a little too much trouble to detect the same-millisecond condition, which means storing the extra state between invocations of the ULID generator, and then implementing the increment with overflow on an 80-bit value, and add a new exception type for the client code to worry about, etc. The spec would be much cleaner if it stopped at saying there is no sorting guarantee between milliseconds, period. The Monotonicity sections seems more like documentation of a specific implementation, not a general spec, especially since it says "we can provide some guarantees...", which is not clear whether it should be followed by a conforming implementation, or optional.

I noticed that a lot of listed implementations don't throw the monotonicity exceptions, or even try to detect them, which prompted me to ask this question. It seems that the spec is unclear in this regard. If that's an optional feature, it should be clearly marked as such.

Shall I offer a PR?

@lxcid
Copy link

lxcid commented Jan 2, 2019

Ahh I see, I agree with your observation. The monotonicity in millisecond behaviour does introduce a bit of statefulness and extra cases to worry about and in a distributed environment (multi node/machine; horizontal scaling), that guarantee would be broken.

There might be cases where that extra guarantee for millisecond is useful but I view it as more of an edge cases. Would have been cleaner to keep this part optional as it come with a few caveats.

@RobThree
Copy link

RobThree commented Jan 9, 2019

We're having some discussion over at the .Net implementation as well (I noticed more discussion elsewhere). Quite honestly I think this monotonicity isn't well thought out and I vote for it to be removed from the spec. Arguments for why I vote for removal can be found here but I'll sum up what I've come across in different places by different people:

  • Monotonicity can cause problems in multithreading environments (though that could be solved with some locking / lock-free algorithms, it adds complexity)
  • The behaviour is undefined on what happens on an 'overflow' when lots of ULID's are generated in the same millisecond and the initial random-part is very high causing the overflow eventually Okay, my bad, the defined behaviour is that it should throw; but then each millisecond there's a different (potential!) number of ULID's you can (theoretically) produce, entirely based on chance because the start is random
  • It causes confusion on the sort order; without monotonocity the sort order is random within the same millisecond, when monotonicity is used then suddenly the sort order is defined for the entire ulid and the 'random part' isn't so 'random' anymore (one side effect is that you can gauge from looking at ID's how many of them were generated approximately in that millisecond; that may be an undesirable property; at the very least the random-part should be much more carefully documented as being 'not-so-random' when monotonicity is used)
  • It adds little to no value in my humble opinion, especially for the added complexity it adds to implement (incrementing 80-bit integers isn't exactly simple in all languages (and usually also not without drastic costs in performance!) and then we're not even talking keeping state etc.)
  • As it's currently standing there seem to be differences in opinions on the exact implementation (if at all) which could cause problems between implementations across the board / incompatibilities. I think before it's too late we need to 'nip this in the bud'. (Maybe someone can map out which implementation does/did what with monotonicity (if at all implemented)?).

I'd love to hear your thoughts.

@dragondave
Copy link
Author

My understanding from a code review of the ulid/javascript version (which I'm taking as a reference implementation):

  • The default ULID factory does not support monotonicity
  • The optional monotonicFactory does, with a few quirks:
    • Generating a ULID with an earlier timestamp generates one for the current millisecond, as if it arrived right now, which is a surprise [see section of ulid/javascript docs, below]
    • Generating a ULID with a later timestamp does change the random seed

From a very cursory code review:
Python:
mdipierro/ulid -- no monotonic behaviour observed in main file
ahawker/ulid -- no monotonic behaviour observed in main file
mdomke/python-ulid -- no monotonic behaviour observed in main file

Powershell:
PetterBomban/posh-ulid -- no monotonic behaviour observed in main file

Ruby:
rafaelsales/ulid -- no monotonic behaviour observed in main file

Go:
oklog/ulid -- not spec compliant but an interesting stab at fixing the problem of being able to trivially guess other ULIDs in the same timestamp. Not enabled by default.

// Monotonic returns an entropy source that is guaranteed to yield
// strictly increasing entropy bytes for the same ULID timestamp.
// On conflicts, the previous ULID entropy is incremented with a
// random number between 1 and inc (inclusive).
//
// The provided entropy source must actually yield random bytes or else
// monotonic reads are not guaranteed to terminate, since there isn't
// enough randomness to compute an increment number.
//
// When inc == 0, it'll be set to a secure default of math.MaxUint32.
// The lower the value of inc, the easier the next ULID within the
// same millisecond is to guess. If your code depends on ULIDs having
// secure entropy bytes, then don't go under this default unless you know
// what you're doing.

JS:
aarondcohen/id128 -- monotonic behaviour in separate file [ulid-monotonic] -- same rules as reference.

From reference notes:

const ulid = monotonicFactory()

// Strict ordering for the same timestamp, by incrementing the least-significant random bit by 1
ulid(150000) // 000XAL6S41ACTAV9WEVGEMMVRB
ulid(150000) // 000XAL6S41ACTAV9WEVGEMMVRC

// Even if a lower timestamp is passed (or generated), it will preserve sort order
ulid(100000) // 000XAL6S41ACTAV9WEVGEMMVRD 

@andrzejmartyna
Copy link

BTW. I've found a very nice article about history of unique identifiers:
https://segment.com/blog/a-brief-history-of-the-uuid/
I was surprised how rich the history is so we can for sure learn from it ;)

@itdaniher
Copy link

I actually added monotonicity to Python's ULID2 recently - https://github.com/valohai/ulid2/pull/8/files

Ran into some of the issues called out here, but found it to be worthwhile.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jan 17, 2019

We're having some discussion over at the .Net implementation as well (I noticed more discussion elsewhere). Quite honestly I think this monotonicity isn't well thought out and I vote for it to be removed from the spec. Arguments for why I vote for removal can be found here but I'll sum up what I've come across in different places by different people:

  • Monotonicity can cause problems in multithreading environments (though that could be solved with some locking / lock-free algorithms, it adds complexity)
  • The behaviour is undefined on what happens on an 'overflow' when lots of ULID's are generated in the same millisecond and the initial random-part is very high causing the overflow eventually Okay, my bad, the defined behaviour is that it should throw; but then each millisecond there's a different (potential!) number of ULID's you can (theoretically) produce, entirely based on chance because the start is random
  • It causes confusion on the sort order; without monotonocity the sort order is random within the same millisecond, when monotonicity is used then suddenly the sort order is defined for the entire ulid and the 'random part' isn't so 'random' anymore (one side effect is that you can gauge from looking at ID's how many of them were generated approximately in that millisecond; that may be an undesirable property; at the very least the random-part should be much more carefully documented as being 'not-so-random' when monotonicity is used)
  • It adds little to no value in my humble opinion, especially for the added complexity it adds to implement (incrementing 80-bit integers isn't exactly simple in all languages (and usually also not without drastic costs in performance!) and then we're not even talking keeping state etc.)
  • As it's currently standing there seem to be differences in opinions on the exact implementation (if at all) which could cause problems between implementations across the board / incompatibilities. I think before it's too late we need to 'nip this in the bud'. (Maybe someone can map out which implementation does/did what with monotonicity (if at all implemented)?).

I'd love to hear your thoughts.

RobThree,

I believe that monotonic ULIDs are necessary to reduce and guarantee the minimum search time in highload systems that use ULIDs as primary keys of tables. So we need monotonicity within a database (not in multithreading environments).

My suggestion is here: ahawker/ulid#306 (comment)

@RobThree
Copy link

I don't understand; you still have a random part in your suggestion?

I also don't see why highload systems would have any problems with a ULID as-is? I don't see what specific problem it solves?

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jan 19, 2019

RobThree,

I don't understand; you still have a random part in your suggestion?

Yes, there is a true random part in my suggestion. It provides global key uniqueness. It's necessary for:

  1. merging of records from different databases
  2. generating records on the client
  3. key collision avoidance during replication
  4. key collision avoidance in distributed database
  5. key collision avoidance in multithreading software
  6. links between records in different databases
  7. accurate search by ULID and queries on the Internet
  8. inability to predict the key by the attacker
  9. polymorphic associations between any database tables
  10. ability to find a record by ID without knowing the table

I also don't see why highload systems would have any problems with a ULID as-is? I don't see what specific problem it solves?

Lack of strict monotonicity cause unpredictable search time and performance issues in databases:

  1. we have to periodically recluster every table according to its primary index (i.e. ULID) (see SQL Command CLUSTER in PostgreSQL)
  2. if for a long time there was no reclustering than search time for some records may be very long

@RobThree
Copy link

RobThree commented Jan 19, 2019

Yes, there is a true random part in my suggestion. It provides global key uniqueness. It's necessary for

I understand perfectly well what the random part is for, but exactly because of having a random part you won't be able to make any guarantees which you were talking about and what is was referring to 😉 You see, as soon as a random component is involved it will be very hard to make guarantees. You will be able to make some very likely assumptions and predictions (especially when the RNG produces very well distributed values) but guarantees are... well, at the very least a risky part of the equation by then.

Also, as I said, depending on the load (assuming it's quite constant/even) you will be able to make pretty accurate predictions if the random part is evenly distributed. Only when you have very short, specific, spikes in the load where one ms you're generating thousands or more ULID's and the other ms there's little to none there could be some unpredictability in some systems (I can imagine some TSDB's don't like that); and then it will not be due to the random part but because of the 'spiking'-part.

Lack of strict monotonicity cause unpredictable search time and performance issues in databases

See, you're again on the "unpredictable search time".

Either way; I disagree. The search time will be very predictible, even with a standard ULID as-is without monotonicity. The search time will be largely, if not almost(!) entirely, dependent on the timestamp part. Only a very small part of the search space will be left for the random part and most of that will likely have been stored in the same (consecutive) page(s) of the database. If used as a primary key and the index is clustered the inserts will be a tiny bit more 'expensive' but then the end result is that the ULID's will be in order physically anyway after inserting.

To simply illustrate: Let's assume you generate 1.000.000.000 ULID's p/hour. That means you generate 277.777 ULID's per second and 277 ULID's per ms. You'll have 3.600.000 ordered 'buckets' of random ULID's, each containing only 277 randomly ordered ULID's (if not stored in a clustered manner anyway). So the searchspace is then only 277 ULID's out of a billion. That's 0,0000277%. You can increase the numbers by a factor 100 or a billion, the net result is still that each bucket will contain the same percentage of randomly ordered ULID's (again, assuming an even load during the entire hour and assuming the ULID's aren't stored in order (worst case). I think it's safe to assume the ULID's will be generated 'out of order' but stored in-order).

Unless you have a system that generates loads of ULID's in the same millisecond and then does nothing "for the rest of the day" (so you only have "spikes" in your load in over a few ms once or a few times a day) the ULIDS will be largely sorted anyway because of the timestamp. And so the random part isn't as important (in most cases maybe even negligible, depending how the database is set up, which (if any) key clustering is used etc).

And even if it turns out to be a problem in, say, PostgreSQL (or any other RDBMS or any system for that matter) but other systems are fine then isn't it just that? A problem for PostgreSQL (or whatever system)? Is it up to the ULID to account for all possible problems any random system in the world could have with a ULID and try to find a solution for it? Or would it be better, if a ULID doesn't work in a specific scenario, to advise to simply pick any other available ULID alternative that's better suited for the task/system at hand instead of trying to fit a square peg in a round hole and modify the ULID spec to accommodate for that very specific scenario and create a ULIDv1.1 introducing monotonicity and other confusion (and complexity)?

(see SQL Command CLUSTER in PostgreSQL)

Sorry, my Russian sucks 😜 I'll take a look at https://postgrespro.com/docs/postgresql/11/sql-cluster

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jan 19, 2019

RobThree: ...exactly because of having a random part you won't be able to make any guarantees...

I reduce the random part of ULID by 3 characters, but I increase the sortable part of ULID by the same 3 characters. This allows me to guarantee that 32 * 32 * 32 = 32768 saved ULIDs per millisecond are strictly sortable, even if all of these ULIDs came at the same time at the beginning or at the end of this millisecond. Remember that Timestamp allows to save 1 strictly sortable ULID per millisecond. I think this is an impressive result, for which it’s worth sacrificing 3 characters. The lengthening of Sequence is much more efficient at peak loads than the lengthening of Timestamp (that is why ULID with Sequence between Timestamp and Randomness is appropriate for much higher load than UUIDv1 with random left part).

RobThree: ...If used as a primary key and the index is clustered the inserts will be a tiny bit more 'expensive' but then the end result is that the ULID's will be in order physically anyway after inserting.

This assumption is not proven by any experiment or even calculation.

RobThree: ...Unless you have a system that generates loads of ULID's in the same millisecond...

I mean exactly such a highload system, for example, a global payment system or monitoring system.

RobThree: ...And even if it turns out to be a problem in, say, PostgreSQL (or any other RDBMS or any system for that matter) but other systems are fine then isn't it just that? A problem for PostgreSQL (or whatever system)? Is it up to the ULID to account for all possible problems any random system in the world could have with a ULID and try to find a solution for it?

First, do not neglect PostgreSQL, which is the best open source general-purpose DBMS. Secondly, all DBMSs have similar problems with a clustered index for a non-monotonic key. Thirdly, ULIDs are designed specifically for high-load systems, and the rest can use traditional UUIDs/GUIDs.

@aarondandy
Copy link

As I understand it and as is mentioned in writings for UUID time can sometimes move backwards on systems for good reasons. In those cases it may be extra difficult to make a monotonic "guarantee" even on a single thread on a single computer. I really think y'all should consider this to be best effort within a millisecond. A spec that is reasonable to implement and with characteristics that are reasonable for most systems could take off. I'm not google scale and none of my peers are either but we definitely would like most (effectively, all) of our keys to be increasing.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Nov 3, 2019

Just look at this implemetation of guaranteed multiple increasing ULIDs per millisecond: ULID with sequence

@RobThree
Copy link

RobThree commented Nov 4, 2019

@sergeyprokhorenko I'm no Python expert but that implementation
a) doesn't (have to?) take multithreading into account
b) appears to generate a maximum of 32767 ULID's per millisecond and then just spends the rest of that millisecond waiting for it to pass
c) has no guarantees of not 'overflowing' within the same millisecond
d) appears to stop generating ULID's if an overflow occurs (see b)
e) depends on the caller to determine if a sequence should be (re)started (used_ulid argument); there's something to be said for that but on the other hand it has some concerns (again: especially in multithreaded code, but then again IIRC Python isn't commonly run in a multithreaded fashion?)

I may be mistaken and I may read the Python code incorrectly but I do have my concerns with that code...

@Sofya2003
Copy link

RobThree,

...that implementation
a) doesn't (have to?) take multithreading into account

It's correct. That implementation is a student project. It would be great to see faster forks of this implementation or even bindings to C-libraryes.

b) appears to generate a maximum of 32767 ULID's per millisecond and then just spends the rest of that millisecond waiting for it to pass

It's correct too. But I think "32767 ULID's per millisecond" is quite enough for the real world databases (the implementation is intended for surrogate keys of databases ). Even 150 ULID's per millisecond would be great. That implementation generates only 5 ULID's per millisecond on ordinary PC.

c) has no guarantees of not 'overflowing' within the same millisecond

The overflowing is almost impossible, because no database can write more than 32767 records per millisecond.

d) appears to stop generating ULID's if an overflow occurs (see b)

Not "stop", but "pause".

e) depends on the caller to determine if a sequence should be (re)started (used_ulid argument); there's something to be said for that but on the other hand it has some concerns (again: especially in multithreaded code, but then again IIRC Python isn't commonly run in a multithreaded fashion?)

The sequence does not depend on the caller. The sequence restarts automatically at the beginning of next millisecond.

@RobThree
Copy link

RobThree commented Nov 4, 2019

It's correct too. But I think "32767 ULID's per millisecond" is quite enough for the real world databases

ULID's aren't specifically for databases. They may be used in other contexts too 😉

Even 150 ULID's per millisecond would be great

For you maybe? But who knows what somebody else needs? If there's to be a limit (which is totally fine and even realistic) the spec should at least specify what that limit is then.

because no database can write more than 32767 records per millisecond.

Again: not all contexts are databases (and I also beg to differ; I'm sure there are "databases" that can handle that much writes easily; remember there's more in the world than single machine instances - what about large clusters of databases? Or MemSQL or MemCached alike stuff? I don't think it's unheard of - wether you would use a single instance ULID generator would be another story)

Not "stop", but "pause"

What's the difference? Either way it means work is queuing up and the situation is getting worse and worse by the (milli)second.

The sequence does not depend on the caller.

That's not what i said. The (re)starting of the sequence depends on the caller; that's what the used_ulid argument is used for specifically.

@ahawker
Copy link

ahawker commented Nov 11, 2020

FWIW, this PR contains an implementation which attempts to give you microsecond guarantees, instead of millisecond, and remain stateless.

This does not address concerns of clocks moving backwards in time as discussed above.

It steals 10-bits of randomness from the clock (caveat being the accuracy of your system clock) and uses standard RNG for the remaining bits. The RNG bits are truly random, so you can still run into worst case scenarios where two generated within the same microsecond are out of order, but the likely hood of such a collision has decreased with the more accurate clock. Instead of falling back to truly random, you could source those remaining bits from a lock based implementation or one that provides stronger guarantees to monotonically increase for the same microsecond.

@RobThree
Copy link

RobThree commented Nov 12, 2020

🤔 It would be pretty trivial to write a MicrosecondRng (which accepts an IUlidRng for the random part) for my implementation that 'overwrites' the leftmost 10 bits generated by the given RNG with the microsecond part and sets the 11th bit to 0 to create (or 'force'/'guarantee') enough 'headroom' so that a sequence won't overflow too soon. I like the idea 👍 and will implement this soon I think...

Come to think of it... I can even let the user optionally specify how many bits of the microseconds part they want to use by adding another constructor argument (which defaults to 10 or some other sane default).

@RobThree
Copy link

It would be pretty trivial

Well... that was a bit of an understatement. Turns out, when decoding a ULID this microseconds part will be considered 'random data' and not part of the timestamp. Which essentially isn't really a problem. But in .Net we have a DateTime(Offset) structure with nanosecond precision; you'd expect a decoded ULID to then have the microsecond part in the timestamp again. But unless I involve my IUlidRng in decoding as well the microseconds will be "lost" (be part of the 'random' part). This means that "roundtripping" a ULID -> string and back will result in weird behaviour... I need to sleep on this a bit longer.

@ahawker
Copy link

ahawker commented Nov 18, 2020

Some random thoughts:

It's been ~6 years since I've written C# but I think I'm understanding what you've written :)

That sounds similar to my implementation and what I found. When decoding a ULID, the timestamp is still millisecond precision and the microsecond bits are still stored in the randomness as I always want them to be 48bit/80bit, respectively.

In my python implementation, I have classes for Timestamp, Randomness, and ULID, the former two wrapped around the correct byte offsets of the ULID. This complicates extracting a appropriate microsecond value because the timestamp field doesn't have access to the byte offset of the randomness, e.g.:

value = ulid.parse('...')
value.timestamp().microseconds  # can't work (timestamp object doesn't have access to additional 10-bits)

But the following could work, since the ULID object does have access to all the data.

value = ulid.parse('...')
value.timestamp_microseconds()

However, the danger with exposing a way to extract a microsecond value from a ULID is that we cannot guarantee that a decoded ULID was created with a randomness implementation that uses 10-bits from a microsecond clock. Otherwise, its going to generate a value using millisecond timestamp + 10-bits of random trash. This felt like a big foot gun so I left it out (at least for now) of my implementation.

@Artoria2e5
Copy link

we cannot guarantee that a decoded ULID was created with a randomness implementation that uses 10-bits from a microsecond clock

One of the things ULID removed from UUID was the version field. Maybe that wasn't such a good idea after all...

@ssg
Copy link

ssg commented Jan 20, 2022

I also suggest the removal of monotonicity from the spec. It actually fosters a false sense of uniqueness, despite that it's only true with the single generator scenario. If only a single instance is generating ID's, why do you need timestamps at all? Just have a counter?

Also, monotonic identifiers make the random part not random, and that might violate some security-related assumptions about the predictability of the identifiers. That's one of the uses cases of UUID's over incremental identifiers, and this behavior actually harms that.

If someone's DB performance relies on sequential identifiers, they can directly use autoincrement fields or sequences native to the database. Why do you need 128-bit "universally unique" identifiers? Why do you need a library at all?

Maybe, I'm missing something. I'd love to be corrected.

@peterbourgon
Copy link

peterbourgon commented Jan 27, 2022

@ssg You're right. ULIDs are sortable/monotonic only to some precision. By default that precision is 1ms as defined by the relevant system clock, but of course there's no guarantee of that clock's accuracy. Monotonic entropy is an opt-in way for implementations to improve that sortability, but of course only on a per-host basis, and only statistically, not with strong guarantees. As you note, it's actually impossible to produce monotonic IDs in any nontrivial (i.e. distributed) system — at least, not without some form of consensus or coördination.

I'm not sure why the spec describes a specific monotonic entropy algorithm at all, IMO that's well out of scope. The spec describes the layout in memory and its semantics and I guess that should be it.

@EricGrange
Copy link

When ULIDs are generated from different hosts, time synchronization at the millisecond level cannot be guaranteed in practice. Clocks can be off by seconds (common), hours (timezone issues) or even days (time synch failures), and a clock issue needs to happen only once for a single host in a dataset's lifetime to negate the advantages of ULID for that dataset...

The weakness here is that while a timestamp field can be corrected easily in most datasets, an ULID being a reference will be much more problematic to correct (with the easier future-proof fix being... to add a timestamp field).
Another weakness of ULIDs is against an attacker seeking a collision: 80 bits is not much against a Birthday attack. A Birthday attack against a 128bit UUID will stand out like a DDoS, one against a 80bit ULID doesn't need to.

So the only strong point of ULID remains monotonicity guarantees for all ULIDs generated by a particular host, otherwise trading collision resistance for an optimistic monotonicity that you're likely to lose at some point is not worth it IMO.

@ssg
Copy link

ssg commented Dec 20, 2022

So the only strong point of ULID remains monotonicity guarantees for all ULIDs generated by a particular host

@EricGrange How is ULID "stronger" than a regular integer counter in a single generator scenario?

@EricGrange
Copy link

Higher concurrency and lower latency in contexts where you will not create isolated records, but also create references as well. With a single generator, even instantaneous, you will still have a roundtrip to query the generator before you can create/update references.

Of course that's when the order of generation is meaningful for debugging/diagnostic. When the order is irrelevant, you're IMHO better off with a random UUID plus separate timestamp field.

Though it's debatable if it wouldn't be simpler to generate a random UUID and increment it in that case, as you would have a lower risk of collision. Even when incremented 1000 times (which is a lot) you only sacrificed 10 bits, where ULIDs always sacrifices 48bits.

@ssg
Copy link

ssg commented Dec 23, 2022

@EricGrange You might have missed the fact that monotonic ULID's were also serialized.

@EricGrange
Copy link

@ssg I'm not sure to follow what you mean ? monotonicity of ULID is at best only monotonic per host (per clock reference actually) at the above millisecond level, and at best per process at the sub-millisecond level.
Since this requires a state (the timestamp of the last created ULID), this state could be used for sequential UUIDs (or any other scheme) to provide monotonicity for a given timestamp.

A single generator is usually understood as universally serialized, independently from hosts, processes or any time issues, something ULID cannot achieve. On the other hand, since it will be centrally generated, and thus subject to lag, the order of generation may not be the order of use (something ULIDs can be better at, as they are locally generated).

So ULIDs (and local monotonic IDs) can provide fine-grained ordering that a single generator can not.

@ssg
Copy link

ssg commented Dec 23, 2022

@EricGrange I am talking about a single host. Take a web service. All its requests have to block on ULID generation to create monotonic ids. GUIDs or non-monotonic ULIDs don't suffer from that problem. If you want monotonic ids from a single source, a numeric counter would work faster than a monotonic ULID. I don't understand why that's touted as a feature at all.

@EricGrange
Copy link

@ssg because a numeric counter will only work for a single process on a single host, so that's a narrow use case. And you have the non-trivial issues of the numeric counter initialization when restarting your service: a MAX() query may not be efficient (requiring a particular index, and wasting time/storage), the MAX() may not account for a still uncommitted transaction, or someone restored a db backup while the web service was still up, etc.

If you don't need monotonicity, UUID make much more sense, no false promise of time reference, high collision resistance.
If you need strict monotonicity, database generators are the only option, but come at a latency cost when cross-referencing.

ULIDs offer something in between UUID and database generators.

@ssg
Copy link

ssg commented Dec 26, 2022

@ssg because a numeric counter will only work for a single process on a single host

That's exactly how monotonic ULIDs work too. See my point?

@EricGrange
Copy link

@ssg sorry no, what do you mean by "That's exactly how monotonic ULIDs work too." ?
A numeric counter used across processes (or merely just not initialized safely) will experience collisions very easily, ULIDs won't.

@ssg
Copy link

ssg commented Dec 26, 2022

@EricGrange non-monotonic ULIDs won't, monotonic ULIDs will.

@EricGrange
Copy link

@sgg sorry, won't what and will what ?

@ssg
Copy link

ssg commented Dec 26, 2022

@EricGrange Monotonic ULID variant (not the regular one) works by serializing requests (so, the seed can be incremented without conflicts). This means that the bottleneck for non-conflicting monotonic ID generation is the same as simply using an integer. An integer could even be faster with atomic incrementing primitives.

There's no advantage of monotonic ULID over a regular integer counter because both require serialization of generation requests to avoid conflicts, otherwise both monotonic ULID and integer have pretty much the same conflicting behavior.

As a matter of fact, monotonic ULID is essentially an integer counter with extra steps. Timestamp prefix doesn't add any additional conflict avoidance properties either as a regular integer counter would perfectly preserve generation order and avoid conflicts since the generation is sequential.

My whole argument is about the monotonic ULID variant which provides no usefulness over an integer.

@peterbourgon
Copy link

peterbourgon commented Jan 2, 2023

@ssg

There's no advantage of monotonic ULID over a regular integer counter

ULIDs can be generated on arbitrarily many nodes, and will maintain time-based sort order with each other, within the precision of the system clock drift across those nodes. The timestamp of a given ULID is also parseable from the ID itself.

Monotonic entropy, properly implemented, does not increment each successive ID by 1, but rather by a random offset. This makes it much more difficult to guess or predict ULID sequences from historical data.

These properties are extremely useful in many situations, and aren't provided by integer counters.

@ssg
Copy link

ssg commented Jan 3, 2023

@peterbourgon I think you're confusing ULID with monotonic ULID too. Read the spec please.

@peterbourgon
Copy link

@peterbourgon I think you're confusing ULID with monotonic ULID too. Read the spec please.

I don't believe I am, no. And the spec is not authoritative.

@ssg
Copy link

ssg commented Jan 4, 2023

the spec is not authoritative

Is the code authoritative? Because the code behaves the same way the spec says.

@peterbourgon
Copy link

I'm not sure what you're trying to communicate.

My only point is that ULID monotonicity can produce IDs which are not predictable in the way that integer counters are predictable, and that this property, while it may not be useful for you, is useful in many circumstances.

ULIDs do not need to have monotonic entropy. Monotonic entropy does not necessarily produce unpredictable IDs. That's all fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests