-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Monotonicity #306
Comments
Monotonicity wasn't actually part of the spec when I wrote this library. I'll take a deeper look into the details. |
Andrew, Sergey |
Based on my reading, you are limited to a random number of values you can generate within the same millisecond. If your first randomness value is sufficiently high, you can only generate within that limited window until you overflow. The implementation in https://github.com/oklog/ulid appears to do this, so it's really best effort but unlikely you'll be able to create 2^80 values within the same millisecond. Further reading lead me to ulid/spec#11 which brings up the issue/discrepancy. If we implement a best effort approach, it would require the following: Changes to the following to use an entropy source instead of
The big addition here is going to be a stateful entropy source. The concurrency guarantees of this entropy source are still unknown and don't appear to be defined within the spec. The default JS implementation does not appear to be thread safe. Investigation of other implementations is necessary to see what guarantees other libraries are attempting to make. Based on this investigation, I can imagine this entropy source being a global instance, thread safe via locks, and created at import time Next Steps:
|
To solve these problems, I would advise to calculate monotonic ULID as ttttttttttsssrrrrrrrrrrrrr where |
Would it be possible to just loop to create ULIDs until the new ULIDs is bigger than the previous one? |
Just look at this implemetation of guaranteed multiple increasing ULIDs per millisecond: ULID with sequence |
Just an idea, sorry if it's stupid, but since time sources can have microsecond precision, how about taking anything under the millisecond as the most significant random bits? Wouldn't it improve monotonicity by quite a lot in practice and still be very efficient and simple? |
Hi Andrea, |
I guess generating a ULID takes more than a microsecond even on a fast CPU, so that covers the single thread case.
When you have multiple threads, already the order of operations is indeterminate to a point, so probably nobody would care. Furthermore that in python such a case is even possible is also to be demonstrated. So microsecond precision does not solve the issue in theory but I would say the case where it does ruin somebody's day, just does not exist. |
Taking into account that generating a ULID on the fly takes more than a microsecond, you should in advance generate 32768 ULIDs with Sequence to the buffer for every millisecond and use them from the buffer including at peak load. But your proposed use of a microsecond counter as a sequence will result in gaps in the sequence of microseconds recorded to ULIDs. That is, the three characters allocated for microseconds will be used less efficiently than the same three characters allocated for the Sequence. Fewer ULIDs will be generated than possible. And besides, thousands of new record will have to wait for generating new ULIDs on the fly: one ULID per more than a microsecond. I think the best place fo ULID generator is DBMS, not Python. But unfortunately DBMS developers have the wrong priorities. |
I hope I am not being irritating, but this is an interesting discussion, so I will say one more thing. The gap in the space of possible ULID is a non issue. ULID are already very sparse and use 128 bits, while a db sequence would very easily work with 32 bits, 64 at most. Lots of possible ULIDs will never be used just because you don't create objects every millisecond, and even if you would, the randomness would still make the space very sparse. The bits used for microsecond precision would have the same efficiency as the random bits, as more or less they are random. Also they would need to be 10 bits as there are 1000 microseconds to a millisecond, but we don't need to worry that we are eating into the randomness, because we are still reducing the chance of collision with those bits. We accept the sparse nature of ULIDs (and UUIDs), because it's distributed id generation: we don't need to hit a database to get one. Everything can create a ULID and expect no collision. This is also why I would not go down the sequence road: it assumes you have shared memory between consumers, but that could very well not be the case with multiple processes. In python we do use processes more than threads because of the GIL. In conclusion, I think that microseconds would work better in distributed environments, while being easier to implement and faster. The sequence approach is valid when you don't have microsecond precision available. |
@neg3ntropy I had come back to this issue a week ago or so and actually started researching the same thing; definitely not a stupid idea! It's another approach to trading a bit of entropy for a value that is guaranteed to be incrementing, in this case, the system clock. (Ignoring the fact that the system clock can be changed, thus breaking any montonic property of The reasons I have not (yet) tried implemented it:
Neither of these issues are true blockers, I just wanted to post where my current research and thinking was. |
Hi Andrea, I don't care of the gap between two sequential ULIDs in the space of possible ULIDs at all. I mean exactly the gap in the sequence of microseconds recorded to ULIDs. These two kinds of gap are very different things. Imagine you are near the end of a millisecond, and suddenly peak load occured. You don't have enough remaining microseconds in this millisecond to generate sufficient number of ULIDs. But you have enough variants of the Sequence in pre-generated ULIDs for this millisecond. Ordered IDs are important for databases only to shorten the search time. ULIDs with Sequence are intended for generation within database, and ULIDs with sequence are better than microseconds in this case (see previous paragraph). If you merge records with ULIDs with Sequence from several databases, you should sort the records within every millisecond in the buffer or generate own ULIDs within the target database. The sorting would be easier with microseconds (and the best available time resolution is 100 ns) instead of milliseconds+Sequence, but lack of Sequence may cause problems in source databases. If the sources are not databases, and you don't need quick search in them, than microseconds (without Sequence) would be enough. |
Monotonic implementation was added in #473 with the goal of implementing it "as defined by the spec". As discussed here and in other GH issues/threads, there are some problems/questions with that spec. However, the goal at this point should be interoperability with other ULID implementations with future effort put into proposing changes to the spec. I'll try and get a new version on pypi today with these additions. |
I have reviewed However I am happy for the development and this already covers my use case well enough. |
I took into account the discussions in this thread during design and I believe implementing a microsecond based ULID would be straight forward with a new provider implementation. I'll probably take a swing at it one of these weekends to see if that assumption is correct. |
Implementation using microseconds is in #476 |
No description provided.
The text was updated successfully, but these errors were encountered: