-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data copy from one server to the other: new data is not processed #845
Comments
Related code is
|
So my current speculation is that the raw place for this trip was copied over from the old data and the link up with the new data is broken somehow. However, if we copied everything, we should have copied over all of the location timestamps as well. Why was this missing? Let's first get the raw place (unnecessary fields redacted):
As expected, the enter timestamp is from the 30th, so before the copy.
|
Did all the locations not get exported properly? I was worried about that, but I checked the number of entries and it seemed to match. Let's double-check the location profile. Now that we have loaded a bunch of other data on top, our queries have to be more complicated. #$#$# trying to do this at the last minute... |
Hm, searching for the last three entries before the start_ts, they are from Oct
|
let's look for everything a little after the exit_ts to see if there is a transition
|
Checked the mongodump and confirmed that there are location entries there. I could also write a special script just for this that finds the missing entries from the timeseries and copies them over, but it seems like the more general fix addresses the underlying problem as well, and will fix this one too.
Checking to see whether this happens for the recreated locations as well. I'm guessing not, since the original copy seemed to work fine.
BINGO! As expected, the recreated locations have in fact been copied over correctly. |
Tried obvious solution to retry until we get to the end, similar to
However, we get a concatentation of the ts and uc data, from escts In practice, while testing with the user above, it looks like t1 > t2, so we end up with an infinite loop
Regardless, we need to check the timeseries and the usercache separately, which means that we need to change |
Added the retry flag and some more logging. And even without the retry, we do read all the points successfully while reading from a restored mongodump. Not quite sure why we didn't do this on the server - maybe we had dropped down the max limit of entries read? Alas, the server is now shut down, so I can't verify.
Or maybe there's just something very weird going on locally - we should be limiting to 250k, but the
|
Actually, even retrying at the
|
wrt #845 (comment) this is the same reason. Note that there are 601,169 entries in the timeseries, and 103,488 entries in the analysis timeseries, but our final batch size is only 353,488 entries. The last entry is in december, but it is from the analysis timeseries (
|
Note also that the dump script has a
The only principled solution is to actually read the three types of entries directly from the database and to retry each of them separately. |
We need to pass in the user_id as well Testing done: works without retry e-mission/e-mission-docs#845 (comment)
Porting the changes from master to the gis branch, that we actually use everywhere was a bit challenging since all the export code in the GIS branch is pulled out into a separate file (emission/export/export.py). I went through the changes carefully, and copied the new To test, I ran the export pipeline on both master and the gis branch.
Re-generating to see how the logs differ |
On master:
On GIS
So |
Running the query directly against the database while on the GIS branch does give the right number
|
ok, so the difference is due to the
Double-checking, the new entry is in fact invalid
Also true in the diff
And the GIS branch does in fact include the
while the master branch does not
Phew! I thought I was going crazy for a minute there |
As a side note, it looks like we don't have the invalid key as an index in the database. Since we include it in every single search, we should probably add it, it will improve the performance. |
restored for the test user
we now have those location entries
|
And the pipeline has since run successfully
|
Will close this once we have copied over data for all the other participants as well |
Looking at the other participants
[1] but we have data until 12-31, missing point is from 12-30? data from 12-31 is from new app, previous data copy is only until 12-05 |
Copied over:
|
So 16 is weird. The missing item is not the location entry, but the place for the trip
The missing place is in from
Although it was not reset in the original mongodump.
And we are missing analysis data for the user before Jan 2022
Ok so it looks like I copied over only data from 2022-01-01 onwards (1) One fix is to reset the pipeline completely Let's go with (2) so that we retain the object ids in case we want to handle any more fixes. |
Found another gap for user 16 from
|
After loading that gap, am currently at |
As we have separate enclaves for different projects, we need to have mechanisms for copying data from one enclave to another.
The second example has actually occurred, and I copied over the data using a combination of
bin/debug/extract_timeline_for_day_range_and_user.py
andbin/debug/load_multi_timeline_for_range.py
The data was copied over correctly, and the user confirmed that it was displayed correctly.
However, data collected after the migration is not processed correctly and always remains in
draft
mode.On investigating further, this is because the
CLEAN_AND_RESAMPLE
stage fails with the following errorWe need to see why the start point is not available
The text was updated successfully, but these errors were encountered: