-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple ULIDs per millisecond not guaranteed #11
Comments
I don't mean to hijack this thread, but I have a related question that needs clarification before the @dragondave's question can be answered, so I decided against creating a separate issue. The Sorting section says that "Within the same millisecond, sort order is not guaranteed", and then the Monotonicity section tries to establish exactly that guarantee (implementation is supposed to throw exception in exactly the same case when the sort order within a millisecond is threatened by overflow). Which is correct? Is either of those sections to be considered optional? I'm confused. If throwing exception on overflow is optional, then multiple ULIDs per millisecond are not an issue (within 2^80, of course). If throwing is mandatory, then the Sorting section is wrong, and @dragondave's question needs to be answered. Technically, it's possible to generate 79 bits of entropy and still have plenty of space for incremented values without performance loss. |
nanoid have a great website to help figuring out the collision for layman like me. https://zelark.github.io/nano-id-cc/ So I was calculating the 16 characters random component collision chances (1% probability of collision) when generating 1,000,000 id/milliseconds is about ~3 mins or about ~180,000 ms… Because its scoped by millisecond, that I believe chance is way lower… or if I am not calculating incorrectly, we'll have to generate about: ~180,000,000,000 id/milliseconds to hit 1% probability of collision. I guess that make the uuid claim holds. 👍 |
@alexiljin monotonicity is stateful per function created by the factory function, I would avoid reusing it on concurrency scenario. You should generated one function per thread/concurrency unit. Otherwise it should be fine. |
@lxcid, I asked whether the Sorting or the Monotonicity section's claim was optional. I read your answer to mean that the Monotonicity is optional: if we have concurrency, there is no monotonicity anyway. So, basically, in the general case, sorting order within a millisecond is not guaranteed. It seems to me, that after establishing that there is no sorting guarantee, it's a little too much trouble to detect the same-millisecond condition, which means storing the extra state between invocations of the ULID generator, and then implementing the increment with overflow on an 80-bit value, and add a new exception type for the client code to worry about, etc. The spec would be much cleaner if it stopped at saying there is no sorting guarantee between milliseconds, period. The Monotonicity sections seems more like documentation of a specific implementation, not a general spec, especially since it says "we can provide some guarantees...", which is not clear whether it should be followed by a conforming implementation, or optional. I noticed that a lot of listed implementations don't throw the monotonicity exceptions, or even try to detect them, which prompted me to ask this question. It seems that the spec is unclear in this regard. If that's an optional feature, it should be clearly marked as such. Shall I offer a PR? |
Ahh I see, I agree with your observation. The monotonicity in millisecond behaviour does introduce a bit of statefulness and extra cases to worry about and in a distributed environment (multi node/machine; horizontal scaling), that guarantee would be broken. There might be cases where that extra guarantee for millisecond is useful but I view it as more of an edge cases. Would have been cleaner to keep this part optional as it come with a few caveats. |
We're having some discussion over at the .Net implementation as well (I noticed more discussion elsewhere). Quite honestly I think this monotonicity isn't well thought out and I vote for it to be removed from the spec. Arguments for why I vote for removal can be found here but I'll sum up what I've come across in different places by different people:
I'd love to hear your thoughts. |
My understanding from a code review of the ulid/javascript version (which I'm taking as a reference implementation):
From a very cursory code review: Powershell: Ruby: Go: // Monotonic returns an entropy source that is guaranteed to yield JS: From reference notes:
|
BTW. I've found a very nice article about history of unique identifiers: |
I actually added monotonicity to Python's ULID2 recently - https://github.com/valohai/ulid2/pull/8/files Ran into some of the issues called out here, but found it to be worthwhile. |
RobThree, I believe that monotonic ULIDs are necessary to reduce and guarantee the minimum search time in highload systems that use ULIDs as primary keys of tables. So we need monotonicity within a database (not in multithreading environments). My suggestion is here: ahawker/ulid#306 (comment) |
I don't understand; you still have a random part in your suggestion? I also don't see why highload systems would have any problems with a ULID as-is? I don't see what specific problem it solves? |
RobThree,
Yes, there is a true random part in my suggestion. It provides global key uniqueness. It's necessary for:
Lack of strict monotonicity cause unpredictable search time and performance issues in databases:
|
I understand perfectly well what the random part is for, but exactly because of having a random part you won't be able to make any guarantees which you were talking about and what is was referring to 😉 You see, as soon as a random component is involved it will be very hard to make guarantees. You will be able to make some very likely assumptions and predictions (especially when the RNG produces very well distributed values) but guarantees are... well, at the very least a risky part of the equation by then. Also, as I said, depending on the load (assuming it's quite constant/even) you will be able to make pretty accurate predictions if the random part is evenly distributed. Only when you have very short, specific, spikes in the load where one ms you're generating thousands or more ULID's and the other ms there's little to none there could be some unpredictability in some systems (I can imagine some TSDB's don't like that); and then it will not be due to the random part but because of the 'spiking'-part.
See, you're again on the "unpredictable search time". Either way; I disagree. The search time will be very predictible, even with a standard ULID as-is without monotonicity. The search time will be largely, if not almost(!) entirely, dependent on the timestamp part. Only a very small part of the search space will be left for the random part and most of that will likely have been stored in the same (consecutive) page(s) of the database. If used as a primary key and the index is clustered the inserts will be a tiny bit more 'expensive' but then the end result is that the ULID's will be in order physically anyway after inserting. To simply illustrate: Let's assume you generate 1.000.000.000 ULID's p/hour. That means you generate 277.777 ULID's per second and 277 ULID's per ms. You'll have 3.600.000 ordered 'buckets' of random ULID's, each containing only 277 randomly ordered ULID's (if not stored in a clustered manner anyway). So the searchspace is then only 277 ULID's out of a billion. That's 0,0000277%. You can increase the numbers by a factor 100 or a billion, the net result is still that each bucket will contain the same percentage of randomly ordered ULID's (again, assuming an even load during the entire hour and assuming the ULID's aren't stored in order (worst case). I think it's safe to assume the ULID's will be generated 'out of order' but stored in-order). Unless you have a system that generates loads of ULID's in the same millisecond and then does nothing "for the rest of the day" (so you only have "spikes" in your load in over a few ms once or a few times a day) the ULIDS will be largely sorted anyway because of the timestamp. And so the random part isn't as important (in most cases maybe even negligible, depending how the database is set up, which (if any) key clustering is used etc). And even if it turns out to be a problem in, say, PostgreSQL (or any other RDBMS or any system for that matter) but other systems are fine then isn't it just that? A problem for PostgreSQL (or whatever system)? Is it up to the ULID to account for all possible problems any random system in the world could have with a ULID and try to find a solution for it? Or would it be better, if a ULID doesn't work in a specific scenario, to advise to simply pick any other available ULID alternative that's better suited for the task/system at hand instead of trying to fit a square peg in a round hole and modify the ULID spec to accommodate for that very specific scenario and create a ULIDv1.1 introducing monotonicity and other confusion (and complexity)? Sorry, my Russian sucks 😜 I'll take a look at https://postgrespro.com/docs/postgresql/11/sql-cluster |
I reduce the random part of ULID by 3 characters, but I increase the sortable part of ULID by the same 3 characters. This allows me to guarantee that 32 * 32 * 32 = 32768 saved ULIDs per millisecond are strictly sortable, even if all of these ULIDs came at the same time at the beginning or at the end of this millisecond. Remember that Timestamp allows to save 1 strictly sortable ULID per millisecond. I think this is an impressive result, for which it’s worth sacrificing 3 characters. The lengthening of Sequence is much more efficient at peak loads than the lengthening of Timestamp (that is why ULID with Sequence between Timestamp and Randomness is appropriate for much higher load than UUIDv1 with random left part).
This assumption is not proven by any experiment or even calculation.
I mean exactly such a highload system, for example, a global payment system or monitoring system.
First, do not neglect PostgreSQL, which is the best open source general-purpose DBMS. Secondly, all DBMSs have similar problems with a clustered index for a non-monotonic key. Thirdly, ULIDs are designed specifically for high-load systems, and the rest can use traditional UUIDs/GUIDs. |
As I understand it and as is mentioned in writings for UUID time can sometimes move backwards on systems for good reasons. In those cases it may be extra difficult to make a monotonic "guarantee" even on a single thread on a single computer. I really think y'all should consider this to be best effort within a millisecond. A spec that is reasonable to implement and with characteristics that are reasonable for most systems could take off. I'm not google scale and none of my peers are either but we definitely would like most (effectively, all) of our keys to be increasing. |
Just look at this implemetation of guaranteed multiple increasing ULIDs per millisecond: ULID with sequence |
@sergeyprokhorenko I'm no Python expert but that implementation I may be mistaken and I may read the Python code incorrectly but I do have my concerns with that code... |
RobThree,
It's correct. That implementation is a student project. It would be great to see faster forks of this implementation or even bindings to C-libraryes.
It's correct too. But I think "32767 ULID's per millisecond" is quite enough for the real world databases (the implementation is intended for surrogate keys of databases ). Even 150 ULID's per millisecond would be great. That implementation generates only 5 ULID's per millisecond on ordinary PC.
The overflowing is almost impossible, because no database can write more than 32767 records per millisecond.
Not "stop", but "pause".
The sequence does not depend on the caller. The sequence restarts automatically at the beginning of next millisecond. |
ULID's aren't specifically for databases. They may be used in other contexts too 😉
For you maybe? But who knows what somebody else needs? If there's to be a limit (which is totally fine and even realistic) the spec should at least specify what that limit is then.
Again: not all contexts are databases (and I also beg to differ; I'm sure there are "databases" that can handle that much writes easily; remember there's more in the world than single machine instances - what about large clusters of databases? Or MemSQL or MemCached alike stuff? I don't think it's unheard of - wether you would use a single instance ULID generator would be another story)
What's the difference? Either way it means work is queuing up and the situation is getting worse and worse by the (milli)second.
That's not what i said. The (re)starting of the sequence depends on the caller; that's what the |
FWIW, this PR contains an implementation which attempts to give you microsecond guarantees, instead of millisecond, and remain stateless. This does not address concerns of clocks moving backwards in time as discussed above. It steals 10-bits of randomness from the clock (caveat being the accuracy of your system clock) and uses standard RNG for the remaining bits. The RNG bits are truly random, so you can still run into worst case scenarios where two generated within the same microsecond are out of order, but the likely hood of such a collision has decreased with the more accurate clock. Instead of falling back to truly random, you could source those remaining bits from a lock based implementation or one that provides stronger guarantees to monotonically increase for the same microsecond. |
🤔 It would be pretty trivial to write a Come to think of it... I can even let the user optionally specify how many bits of the microseconds part they want to use by adding another constructor argument (which defaults to 10 or some other sane default). |
Well... that was a bit of an understatement. Turns out, when decoding a ULID this microseconds part will be considered 'random data' and not part of the timestamp. Which essentially isn't really a problem. But in .Net we have a |
Some random thoughts: It's been ~6 years since I've written C# but I think I'm understanding what you've written :) That sounds similar to my implementation and what I found. When decoding a ULID, the timestamp is still millisecond precision and the microsecond bits are still stored in the randomness as I always want them to be 48bit/80bit, respectively. In my python implementation, I have classes for value = ulid.parse('...')
value.timestamp().microseconds # can't work (timestamp object doesn't have access to additional 10-bits) But the following could work, since the value = ulid.parse('...')
value.timestamp_microseconds() However, the danger with exposing a way to extract a microsecond value from a ULID is that we cannot guarantee that a decoded ULID was created with a randomness implementation that uses 10-bits from a microsecond clock. Otherwise, its going to generate a value using millisecond timestamp + 10-bits of random trash. This felt like a big foot gun so I left it out (at least for now) of my implementation. |
One of the things ULID removed from UUID was the version field. Maybe that wasn't such a good idea after all... |
I also suggest the removal of monotonicity from the spec. It actually fosters a false sense of uniqueness, despite that it's only true with the single generator scenario. If only a single instance is generating ID's, why do you need timestamps at all? Just have a counter? Also, monotonic identifiers make the random part not random, and that might violate some security-related assumptions about the predictability of the identifiers. That's one of the uses cases of UUID's over incremental identifiers, and this behavior actually harms that. If someone's DB performance relies on sequential identifiers, they can directly use autoincrement fields or sequences native to the database. Why do you need 128-bit "universally unique" identifiers? Why do you need a library at all? Maybe, I'm missing something. I'd love to be corrected. |
@ssg You're right. ULIDs are sortable/monotonic only to some precision. By default that precision is 1ms as defined by the relevant system clock, but of course there's no guarantee of that clock's accuracy. Monotonic entropy is an opt-in way for implementations to improve that sortability, but of course only on a per-host basis, and only statistically, not with strong guarantees. As you note, it's actually impossible to produce monotonic IDs in any nontrivial (i.e. distributed) system — at least, not without some form of consensus or coördination. I'm not sure why the spec describes a specific monotonic entropy algorithm at all, IMO that's well out of scope. The spec describes the layout in memory and its semantics and I guess that should be it. |
When ULIDs are generated from different hosts, time synchronization at the millisecond level cannot be guaranteed in practice. Clocks can be off by seconds (common), hours (timezone issues) or even days (time synch failures), and a clock issue needs to happen only once for a single host in a dataset's lifetime to negate the advantages of ULID for that dataset... The weakness here is that while a timestamp field can be corrected easily in most datasets, an ULID being a reference will be much more problematic to correct (with the easier future-proof fix being... to add a timestamp field). So the only strong point of ULID remains monotonicity guarantees for all ULIDs generated by a particular host, otherwise trading collision resistance for an optimistic monotonicity that you're likely to lose at some point is not worth it IMO. |
@EricGrange How is ULID "stronger" than a regular integer counter in a single generator scenario? |
Higher concurrency and lower latency in contexts where you will not create isolated records, but also create references as well. With a single generator, even instantaneous, you will still have a roundtrip to query the generator before you can create/update references. Of course that's when the order of generation is meaningful for debugging/diagnostic. When the order is irrelevant, you're IMHO better off with a random UUID plus separate timestamp field. Though it's debatable if it wouldn't be simpler to generate a random UUID and increment it in that case, as you would have a lower risk of collision. Even when incremented 1000 times (which is a lot) you only sacrificed 10 bits, where ULIDs always sacrifices 48bits. |
@EricGrange You might have missed the fact that monotonic ULID's were also serialized. |
@ssg I'm not sure to follow what you mean ? monotonicity of ULID is at best only monotonic per host (per clock reference actually) at the above millisecond level, and at best per process at the sub-millisecond level. A single generator is usually understood as universally serialized, independently from hosts, processes or any time issues, something ULID cannot achieve. On the other hand, since it will be centrally generated, and thus subject to lag, the order of generation may not be the order of use (something ULIDs can be better at, as they are locally generated). So ULIDs (and local monotonic IDs) can provide fine-grained ordering that a single generator can not. |
@EricGrange I am talking about a single host. Take a web service. All its requests have to block on ULID generation to create monotonic ids. GUIDs or non-monotonic ULIDs don't suffer from that problem. If you want monotonic ids from a single source, a numeric counter would work faster than a monotonic ULID. I don't understand why that's touted as a feature at all. |
@ssg because a numeric counter will only work for a single process on a single host, so that's a narrow use case. And you have the non-trivial issues of the numeric counter initialization when restarting your service: a MAX() query may not be efficient (requiring a particular index, and wasting time/storage), the MAX() may not account for a still uncommitted transaction, or someone restored a db backup while the web service was still up, etc. If you don't need monotonicity, UUID make much more sense, no false promise of time reference, high collision resistance. ULIDs offer something in between UUID and database generators. |
That's exactly how monotonic ULIDs work too. See my point? |
@ssg sorry no, what do you mean by "That's exactly how monotonic ULIDs work too." ? |
@EricGrange non-monotonic ULIDs won't, monotonic ULIDs will. |
@sgg sorry, won't what and will what ? |
@EricGrange Monotonic ULID variant (not the regular one) works by serializing requests (so, the seed can be incremented without conflicts). This means that the bottleneck for non-conflicting monotonic ID generation is the same as simply using an integer. An integer could even be faster with atomic incrementing primitives. There's no advantage of monotonic ULID over a regular integer counter because both require serialization of generation requests to avoid conflicts, otherwise both monotonic ULID and integer have pretty much the same conflicting behavior. As a matter of fact, monotonic ULID is essentially an integer counter with extra steps. Timestamp prefix doesn't add any additional conflict avoidance properties either as a regular integer counter would perfectly preserve generation order and avoid conflicts since the generation is sequential. My whole argument is about the monotonic ULID variant which provides no usefulness over an integer. |
ULIDs can be generated on arbitrarily many nodes, and will maintain time-based sort order with each other, within the precision of the system clock drift across those nodes. The timestamp of a given ULID is also parseable from the ID itself. Monotonic entropy, properly implemented, does not increment each successive ID by 1, but rather by a random offset. This makes it much more difficult to guess or predict ULID sequences from historical data. These properties are extremely useful in many situations, and aren't provided by integer counters. |
@peterbourgon I think you're confusing ULID with monotonic ULID too. Read the spec please. |
I don't believe I am, no. And the spec is not authoritative. |
Is the code authoritative? Because the code behaves the same way the spec says. |
I'm not sure what you're trying to communicate. My only point is that ULID monotonicity can produce IDs which are not predictable in the way that integer counters are predictable, and that this property, while it may not be useful for you, is useful in many circumstances. ULIDs do not need to have monotonic entropy. Monotonic entropy does not necessarily produce unpredictable IDs. That's all fine. |
Given the width of the random space is 80 bits, the latter is guaranteed to happen first (or at the same time).
There is also no guarantee that the number randomly generated is not very close to 2^80 -- should implementations consider a maximum number they will accept from the random number generator, and that they will change random numbers greater than to? (Or choose a number lower than)?
The text was updated successfully, but these errors were encountered: