-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate vfs2 #661
Integrate vfs2 #661
Conversation
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #661 +/- ##
==========================================
- Coverage 77.41% 77.35% -0.07%
==========================================
Files 196 196
Lines 27269 27377 +108
Branches 5455 5519 +64
==========================================
+ Hits 21111 21178 +67
+ Misses 4297 4286 -11
- Partials 1861 1913 +52 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
This PR does a few different things (sorry):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass to half of the files approximately. The PR is looking very good, thanks!
I just have one general comment and that is that we need to make sure to remove all the usages of the disk config prior to this PR. For example: https://github.com/canonical/dqlite/blob/master/src/db.c#L50
@@ -39,10 +39,6 @@ AM_CONDITIONAL(BUILD_SQLITE_ENABLED, test "x$enable_build_sqlite" = "xyes") | |||
AC_ARG_ENABLE(build-raft, AS_HELP_STRING([--enable-build-raft[=ARG]], [use the bundled raft sources instead of linking to libraft [default=no]])) | |||
AM_CONDITIONAL(BUILD_RAFT_ENABLED, test "x$enable_build_raft" = "xyes") | |||
|
|||
AC_ARG_ENABLE(dqlite-next, AS_HELP_STRING([--enable-dqlite-next[=ARG]], [build with the experimental dqlite backend [default=no]])) | |||
AM_CONDITIONAL(DQLITE_NEXT_ENABLED, test "x$enable_dqlite_next" = "xyes") | |||
AS_IF([test "x$enable_build_raft" != "xyes" -a "x$enable_dqlite_next" = "xyes"], [AC_MSG_ERROR([dqlite-next requires bundled raft])], []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If dqlite next needs to bundle raft, shouldn't we remove the option enable_build_raft
as well and make it a default?
EDIT: we discussed this in the daily but now I think it makes more sense to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this for the next release but need to make sure first that our known downstreams are ready to move to bundled raft (or have already done so).
@@ -22,8 +24,20 @@ | |||
#define POST(cond) assert((cond)) | |||
#define ERGO(a, b) (!(a) || (b)) | |||
|
|||
#define UNHANDLED(expr) if (expr) assert(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this macro log something or is the stack trace going to be enough?
BTW thanks for this macro, I find code easier to read this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced the macro to mark places where there's a potential failure that I didn't yet know how to handle, so at least the defective lines are easily greppable. It doesn't mean I think we should abort in all these cases (I should've explained this).
/* TODO maybe vfs2 should just accept the pages and page numbers | ||
* in the layout that we receive them over the wire? */ | ||
dqlite_vfs_frame *frames = sqlite3_malloc((int)sizeof(*frames) * (int)cf->frames.n_pages); | ||
UNHANDLED(frames == NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have a unified approach for out of memory errors, either crash or return an error code, but right now I see code doing both things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, generally we try to return an error code for that---anything marked with UNHANDLED here is intended to be temporary until proper error handling is introduced. (Writing it this way initially helped me develop the patch faster, but I can fix the UNHANDLED call sites in this PR before it merges.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a TODO above the UNHANDLED definition macro then? Just so we remember to update it.
/* TODO maybe vfs2 should just accept the pages and page numbers | ||
* in the layout that we receive them over the wire? */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would still mean we would copy the data and not use the pointer from the command directly, right? So it would translate into removing the "translation loop" below from one data structure to the other, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible for vfs2_apply_uncommitted to parse the frames data in exactly the format that it comes over the wire, so no copy is needed at all. Maybe I should add that to the PR while I'm at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also be part of the next PR, but if the changes are minimal and we are avoiding a copy it seems worth it to me.
struct fsm *f = raft_malloc(sizeof *f); | ||
|
||
(void)config; | ||
struct fsm *f = raft_malloc(sizeof(*f)); | ||
if (f == NULL) { | ||
return DQLITE_NOMEM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of what I meant in the previous comment regarding error handling on memory allocation.
struct dqlite_node *node = g->raft->data; | ||
pool_t *pool = !!(pool_ut_fallback()->flags & POOL_FOR_UT) | ||
? pool_ut_fallback() : &node->pool; | ||
pool_queue_work(pool, &req->work, g->leader->db->cookie, | ||
WT_UNORD, qb_top, qb_bottom); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new calls into vfs2 in this PR all happen on the main thread, so in order to not have data races I had to undo the earlier change that moved sqlite3_step calls to the thread pool. The next PR will make everything async again.
/** | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc.
rv = r->fsm->apply(r->fsm, buf, &result); | ||
|
||
if (r->fsm->version >= 4 && r->fsm->apply2 != NULL) { | ||
bool is_mine = is_local && term == r->current_term; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be part of the documentation of the apply2 function, stating what it means for an entry to be mine. Also, let's try to think about a better name that conveys that it has to have been created by the node while it was the leader (if we cannot, then the documentation is enough and we can keep it as is_mine
). For example, what about created_as_leader_current_term
? (maybe too verbose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed about documentation. Maybe a more verbose name in the header file and the concise is_mine
when it's referred to in .c
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work.
rv = uvMetadataLoad(uv->dir, &metadata, io->errmsg); | ||
if (rv != 0) { | ||
return rv; | ||
} | ||
uv->metadata = metadata; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we moving this and the timer to _start
and _load
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uvMetadataLoad checks the format_version field to detect malformed metadata files. The correct format version isn't known until load or bootstrap time, so I moved the metadata loading later on. But I'm definitely less than 100% certain that this is the right way to do it. Maybe it would be a better idea to just modify raft_uv_init to accept the desired format version as an additional argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check my understanding here: we are using the disk_mode as a toggle for all of these changes (vfs2 + new format representation) which is why by default uv->format_version = 1
and we change it when calling dqlite_node_enable_disk_mode
. And we need to support both implementations for at least some time.
If all of that is true then I guess it makes sense to keep it as is because the sequence has to be init -> enable_disk_mode -> load
. Ideally we would pass the format as a parameter from the beginning so that we don't have to think about when the format changes and when to do each operation, like reading metadata, but it seems that is not possible here just yet with the current design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed another batch of files. I think we could have split this PR into the changes for the format version and removing the DQLITE_NEXT macro and the changes to the vfs2 (maybe it is not that straightforward to do). Definitely not something to do now, but for next PRs it might make reviews faster.
@@ -47,6 +47,7 @@ typedef unsigned long long uvCounter; | |||
/* Information persisted in a single metadata file. */ | |||
struct uvMetadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the metadata tied to the uv-based raft_io implementation below or is it agnostic? I assume it is the former because it is in the uv.h
file. If that is the case, can we add a comment in the format_version
like the one we have below?
/* 1 (original recipe) or 2 (with local data) */
If we need to document the format change further we could define an enum and document it there more thoroughly. The benefit of an enum would be that the checks PRE(1 <= version && version < 3)
would automatically be in sync when we add new fields instead of having to change all occurrences.
rv = uvMetadataLoad(uv->dir, &metadata, io->errmsg); | ||
if (rv != 0) { | ||
return rv; | ||
} | ||
uv->metadata = metadata; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check my understanding here: we are using the disk_mode as a toggle for all of these changes (vfs2 + new format representation) which is why by default uv->format_version = 1
and we change it when calling dqlite_node_enable_disk_mode
. And we need to support both implementations for at least some time.
If all of that is true then I guess it makes sense to keep it as is because the sequence has to be init -> enable_disk_mode -> load
. Ideally we would pass the format as a parameter from the beginning so that we don't have to think about when the format changes and when to do each operation, like reading metadata, but it seems that is not possible here just yet with the current design.
rv = uv->transport->init(uv->transport, id, address); | ||
if (rv != 0) { | ||
ErrMsgTransfer(uv->transport->errmsg, io->errmsg, "transport"); | ||
return rv; | ||
} | ||
uv->transport->data = uv; | ||
|
||
rv = uv_timer_init(uv->loop, &uv->timer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it that we are moving the timer?
{ | ||
size_t res = 8 + /* Number of entries in the batch, little endian */ | ||
16 * n; /* One header per entry */; | ||
if (with_local_data) { | ||
#ifdef DQLITE_NEXT | ||
if (format_version > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be tricky to find in the future if the format changes in a way that drops local data for version 3 (for example). Do you think it is better to check for the format version explicitly?
@@ -143,7 +143,7 @@ static void encodeAppendEntries(const struct raft_append_entries *p, void *buf) | |||
bytePut64(&cursor, p->prev_log_term); /* Previous term. */ | |||
bytePut64(&cursor, p->leader_commit); /* Commit index. */ | |||
|
|||
uvEncodeBatchHeader(p->entries, p->n_entries, cursor, false /* no local data */); | |||
uvEncodeBatchHeader(p->entries, p->n_entries, cursor, 1 /* no local data ever */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no local data ever? I assume it is because this is only used to send AppendEntries messages and we don't transmit local data. Maybe a slightly bigger comment saying something like: encodeAppendEntries is only called when sending AppendEntries messages and local data is never transmitted
, is more clear. What do you think?
{ | ||
unsigned i; | ||
void *cursor = buf; | ||
|
||
/* Number of entries in the batch, little endian */ | ||
bytePut64(&cursor, n); | ||
|
||
if (with_local_data) { | ||
#ifdef DQLITE_NEXT | ||
if (format_version > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
@@ -391,15 +390,13 @@ int uvDecodeBatchHeader(const void *batch, | |||
return 0; | |||
} | |||
|
|||
if (local_data_size != NULL) { | |||
#ifdef DQLITE_NEXT | |||
if (format_version > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
@@ -456,7 +453,7 @@ static int decodeAppendEntries(const uv_buf_t *buf, | |||
args->prev_log_term = byteGet64(&cursor); | |||
args->leader_commit = byteGet64(&cursor); | |||
|
|||
rv = uvDecodeBatchHeader(cursor, &args->entries, &args->n_entries, false); | |||
rv = uvDecodeBatchHeader(cursor, &args->entries, &args->n_entries, NULL, 1 /* no local data ever */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nit about slightly longer documentation.
@@ -295,7 +295,7 @@ static void uvServerReadCb(uv_stream_t *stream, | |||
s->message.append_entries.entries, | |||
s->message.append_entries | |||
.n_entries, | |||
false); | |||
0, 1 /* no local data ever */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nit, although in this case it might be easier to see because the function name already states that it is "receiving" something. Might still be useful for future readers though.
@@ -405,7 +405,7 @@ int uvSegmentLoadClosed(struct uv *uv, | |||
if (rv != 0) { | |||
goto err; | |||
} | |||
if (format != UV__DISK_FORMAT) { | |||
if (format != (uint64_t)uv->format_version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe be super defensive with uv->format_version < 0 || format != ...
(just for the extra security going from int to uint) or even a PRE
. This applies to several places but it is very minor so feel free to ignore it.
Closing in favor of more focused PRs. |
No description provided.