-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Purge + Restore user timeseries data with long-term storage #952
base: master
Are you sure you want to change the base?
Purge + Restore user timeseries data with long-term storage #952
Conversation
When operating a server, the `Stage_timeseries` database can become quite big. In the case where only the `Stage_analysis_timeseries` is actually useful after the pipeline execution, the user's timeseries can be deleted to speed up the pipeline and gain some disk space.
Also added associated unit tests
Print() statements weren't being logged in AWS Cloudwatch logs. Logging.debug() statements are meant for this purpose. These statements may or may not show up in normal execution output depending on the set logger level.
Choosing JSON instead of CSV since: 1. CSV does not retain nested dict-like document data structure of MongoDB documents. 2. CSV stores redundant empty NaN columns as well.
CSV export kept on hold for now as restoring from CSV is complicated due to loss of data structure. This commit includes working code for export as JSON file and import from JSON file.
Default option for now is JSON which is easier for data restore. Provided export flags as a boolean dictionary which calls the specific export function as per the set boolean flag.
Thinking of options to be made available for type of export file - CSV, JSON, Mongodump
I’m proposing default option to be JSON which will occur irrespective of other type of export (CSV for now). Pros of this:
Pros of having Mongodump as default option:
Currently integrated CSV export as an optional export. Once Mongodump is implemented and even that data can be imported, then even Mongodump becomes an option for export. I have grouped all export data file types in a boolean dictionary which can the call specific export function is that flag is set. Next working on adding Mongodump as an option. |
Built on and added tests for normal data operations of purge() and restore(). Added edge cases tests: 1. Loading duplicate or already existing data by calling restore function again. 2. Loading from empty JSON file containing no data. Will add additional tests if needed.
Changed file path of empty json file used for testing to generic /var/tmp instead of local path.
Changed from "/tmp" to operating system's default temporary directory. Makes this functionality work on a cross-platform basis.
Latest fixes work with JSON files used as the export file type and this same file works for importing data correctly back into the database as well. However, with CSV files, it's a bit complicated because of two reasons:
Also, with Mongodump, I tried including the python subprocess module to use it to run the mongodump command as a terminal command but there were issues with dependencies missing (such as mongodump), then had to figure out how to run and/or save the file locally in the Docker container and then extract it out of there if needed for analysis on one's local system. With these in mind, JSON currently seems a good option for basic export/import. |
Yes, that is the export PR. We can also potentially use the associated pipeline state for incremental exports since this is planned to be part of a running system. I would suggest the use of command line flags to indicate whether we want an incremental dump or a full dump. |
Import file type added as command line argument with appropriate functions.
Checking for valid cstate, query results count before initiating purging.
As discussed earlier, this needs to use the existing export method. |
The scalability considerations here, if you follow the PR chain, are that we store the raw data in the timeseries, we generate the analysis timeseries from the raw timeseries, and then we use the analysis timeseries (aka with the analysis results) in the rest of our work. So the argument is that you don't need to store the raw timeseries on an ongoing basis. It is only required when we need to reset the pipeline and/or change analysis algorithms. So it can be moved to long-term storage, but with the ability to restore before we reset the pipeline. |
I started off by first understanding the differences in terms of code changes from the newer Purge PR and the older Export PR. A) Purge vs Export
B) Do we want to allow the user to specify the time range for export + purge ? Or should it be fetched programmatically using pipeline stages like it's being done in the Purge PR and in the
C) Output files containing exported data differ in the format of the ObjectID and UUID
This is managed via |
Looking at the files involved in the export PR:
Export Data Flow:
But I got some errors: A. FileNotFoundError
To solve this issue, I had to manually create the archive directory under emission, and then execute the test file.
Error stack trace
I'm yet to ascertain the cause of this. |
Main query / concern for nowOne of the main things that I am concerned with is that, the existing export() method takes in data from three sources: But the requirement in this PR is to export the raw data entries stored in timeseries_db only and then go on and purge the exported data entries and restore from the exported data when required. This snippet is from
Related PRs / Issues for Fetching data from three sources Reasoning behind these changes to read from the three database sources is mentioned in this issue This PR added these changes to fix this issue. This commit has the specific changes that were added. This commit talks about restoring dumped data. This comment has logs about restoring user data. I am going to scan these related issues again to see if there's a workaround or we'll need a separate export script altogether. |
I'm somewhat unclear about the usage of Detailed discussion here in the export PR:
Will need to read up on the pipeline code, pipeline states again. |
Programatically. I don't see any arguments for not. In general, the "purge" PR is what we should base our work on since it is by far the most polished option.
I don't see why you need a separate export script. It should be trivial to add a database argument to the existing script. We can also consider exporting using |
Details in this comment: e-mission#952 (comment)
…ile as well. Some pointers: had to comment out logging.config.dictConfig in export_stage as it was giving a logging has no module named config error. Also, running the test with and without running intake pipeline gives different results. I believe it's got to do with how the start_ts is being set as a part of the pipeline process. Next, will move onto purging and restoring from the .gz dump file.
Here’s an overview of the pipeline stages along with the approaches used for updating the last_processed_ts.
|
Having seen all these, I opted to go with the approach that first fetches all trip entries first before performing the operation.
Then once the specific stage's primary operations are completed, then the last trip is directly indexed. I'm using the value present in
|
I thought of using Took reference from emission/analysis/userinput/matcher.py
So, I decided to use Hence, decided to go ahead with I will take a second look and possibly using a similar |
PENDING / TODO:
|
…cesssed_ts for restore_data Simply using last index value like how it was done in purge_data
When validate truncate function signature still has 3 arguments, it gives an error but exports all entries. However, when I modify it to include only location query, it passes but does not export all entries. The last curr_end_ts timetstamp doesn't go all the way to the last one. Committing changes before switching to another branch.
Exporting and Purging entries in 1 hour time chunks into separate files with a defined start_ts and end_ts. Start_ts is the last_processed_ts for the user_id's Purge pipeline state. - If the pipeline has already been run for the user, then this would be a non-None value. - If pipeline hasn't been run, then it will be None; in this case, the earliest timestamp entry is chosen as the start ts. This helps avoid ValueErrors when adding 1 hour (3600 seconds) to start_ts for incremental export. End_ts differs for Full export and Incremental export: - Full export: current time at which pipeline is run will be the end ts; value returned by pipeline_queries on initiating pipeline stage is used. - Incremental export: First end_ts value would be 1 hour ahead of start_ts; this value would continue to be incremented by 1 hour as long as data exists for a user. If the value after adding 1 hour exceeds the current time, then the end_ts is set to the current time itself. The export + delete process continue as long as there is raw timeseries data for the user. ------- But what does 1 hour’s worth of data mean? - In any case, purge pipeline runs upto the current time or until no more raw timeseries entries present in db for the user. - If Purge pipeline running for the first time for a user, then it will export and purge all the timeseries data for a user from its first entry (which can be really old data and first purge run might take a lot of time) - If Purge pipeline has already been run before for a user, then it will set start_ts to last_processed_ts and export data from that point. - If purge pipeline run hourly, then it would eventually just have a small subset of entries. ------- Some points to consider: A. Numerous files; Less data quantity per file One observation is that current logic is creating multiple files in 1 hour chunks, which is okay. But these files don’t really have a lot of entries. What could be more efficient is to perhaps store more entries until a threshold say 5000 or 10000 (like batch_size in load_multi_timeline_for_range). If this default threshold batch size isn't reached, keep adding to the same file. Keeping updating the end_ts but start_ts would remain the same. Will attempt this next step. ------ B. Right approach for Export + Purge? Approach A 1. Export data in chunks to File 2. Delete exported data from DB. 3. Repeat until all data purged. Flow looks like: Export -> Delete -> Export -> Delete -> Export -> Delete —— Approach B 1. Export data in chunks to file. 2. Repeat until all data exported. 3. Delete all exported data from DB. Flow looks like: Export -> Export -> Export ... -> Delete --------------- C. Do we need all 3 types of entries: locations, trips, places? For now, commented out code from export_timeseries.py. If we only need location entries, can simplify code further to just work for these entries. If these are sort of like subsets of each other: location -> trip -> place. Then I can safely just take location. But this is valid only if all entries contain location and hence ts values. If only trip entries present or only place entries, then directly choosing latest ts is incorrect since trips use enter_ts while places use start_ts Searching codebase for references and read through Shankari’s thesis to start_ts and enter_ts. I’m getting hints that start_ts and enter_ts are analysis_timeseries entries? In that case, can ignore these since the purge + restore is concerned with raw timeseries data only. Trip entries created in emission/analysis/intake/segmentation/trip_segmentation.py —— Hint 1: Timeseries_Sample.ipynb - ct_df fetches analysis/cleaned_trip entries -> analysis DB ------ Hint 2: bin/historical/migrations/populate_local_dt.py - Looks like old code, some changes were last made 8 years ago. - The collection parameter refers to some non-time series databases as seen from the function calls. - The entry[start_ts] or entry[‘enter_ts’] values are then used in the find query by setting data.ts to this value. --------- D. Is pipeline_states export needed? Remove pipeline_states export if not needed. Currently being used in existing export + load scripts. ---------
This reverts commit 1b6833d.
…_range Main reason for this change was to avoid dealing with the pipeline_states file. ----- Problem: Observed in the exported pipeline state tar file that last_processed_ts for PURGE stage was null. But from logs it looks like updated pipeline state inserted in the DB ``` INFO:root:For stage PipelineStages.PURGE_TIMESERIES_DATA, last_ts_processed = 2015-07-23T06:40:40.069000 DEBUG:root:About to save object PipelineState({'_id': ObjectId('66c79deddbf93d53d0f61184'), 'user_id': UUID('222a6ab7-94f0-4fec-a48f-0471297a9644'), 'pipeline_stage': 19, 'curr_run_ts': None, 'last_processed_ts': 1437633640.069, 'last_ts_run': 1724358120.242778}) DEBUG:root:After saving state PipelineState({'_id': ObjectId('66c79deddbf93d53d0f61184'), 'user_id': UUID('222a6ab7-94f0-4fec-a48f-0471297a9644'), 'pipeline_stage': 19, 'curr_run_ts': None, 'last_processed_ts': 1437633640.069, 'last_ts_run': 1724358120.242778}), list is [{'_id': ObjectId('66c79deddbf93d53d0f61184'), 'user_id': UUID('222a6ab7-94f0-4fec-a48f-0471297a9644'), 'pipeline_stage': 19, 'curr_run_ts': None, 'last_processed_ts': 1437633640.069, 'last_ts_run': 1724358120.242778}] ``` --- Checked the DB, last_processed_ts is updated in the pipeline state db. ``` Command: pipeline_states_list = list(edb.get_pipeline_state_db().find({"user_id": UUID(uuid_test)})) for state in pipeline_states_list: print(state) Output: {'_id': ObjectId('66c79eaad2f85784cd19fd79'), 'user_id': UUID('b9f26d62-ef7b-489a-b814-154ea8e08dae'), 'pipeline_stage': 19, 'curr_run_ts': None, 'last_processed_ts': 1437633640.069, 'last_ts_run': 1724358309.836113} ``` ----- Why was it added in the original script? Commit that added it in the export PR e-mission@dd9ec1c These were added in Jan 2018 Commit that added it in extract_timeline_for_day_range_and_user.py e-mission@b38366b Commit that added it in load_multi_timeline_for_range.py e-mission@06a0a4e Why do we need it? Shankari “”” The raw data and the analysis results do not constitute the entire state of a pipeline. In particular, if we store only the raw + analysis results, and then we try to run the pipeline again, we will end up with two copies of the analysis results. “”” ----- Do we need it for purge PR? - The commit message states that pipelinestates were also exported / loaded so that we don't have duplicate analysis entries. - In the purge PR, we are strictly dealing with timeseries_db data. - Hence can remove it from the purge_restore related code. ------ Something wrong with the export_pipeline_states() in purge_data then? - No, I was calling export_timeseries to export pipeline states inside the run_purge_pipeline function in purge_data. - This was running for every export file but at this point last_processed_ts isn’t updated. - It is only updated once the function exits and goes back to the parent function where the stage is marked as done. final update to the pipeline state occurs when on returning to the parent function purge_data() ``` if pdp.last_processed_ts is None: logging.debug("After run, last_processed_ts == None, must be early return") espq.mark_purge_data_done(user_id, pdp.last_processed_ts) ``` - Hence in all the pipeline state files, last_processed_ts has the value NULL. - Also, we get multiple pipeline state files since the function call to export the pipeline states is also within the run_purge_data_pipeline function that exports the timeseries data. Now, one approach to resolve this that I thought would work: - Move the export pipeline states call to the parent function after the stage has been marked as done. - Also if we move the export pipeline states to the parent function, only one pipelinestates file would be created; I tried using the earliest start ts and the latest end ts to name the file. - But this had a few problems. - First, it doesn’t seem right that after a stage is marked as done we are still performing some steps. I checked the existing intake pipeline stages and none of them do this. - Second, while it would work for full export, it would be difficult for incremental export where each export filename has a different start and end ts corresponding to multiple time ranges. - This is because, load_multi_timeline_for_range calls load_pipeline_states which uses the same file_prefix name as the exported timeseries filename to load the pipeline states; it just adds "pipelinestates" to the file_prefix. - But then this file wasn’t found by the load_multi_timeline function since it would load the data from the first time range file say start_ts_1 to end_ts_1, so on. But when it tries to load pipeline_states file with the same prefix, it doesn’t exist as we now only have only file with earliest_start_ts to latest_end_ts. - Hence, it gives a FileNotFoundError. ------
One observation is that current logic is creating multiple files, which is okay. But these files don’t really have a lot of entries. What could be more efficient is to perhaps store more entries until a threshold say 5000 or 10000 (like batch_size in load_multi_timeline_for_range). If this default threshold batch size isn't reached, keep adding to the same file. Keeping updating the end_ts but start_ts would remain the same. ---- Found an edge case Incremental export is fine. Let’s say we have chosen full export. In the sample data we have 1906 entries. In batch testing I’m setting batch_size_limit to 500. Now, when the code executes: - current_end_ts will be set to initEndTs which is current time () - FUZZ time as set by the pipeline queries. - new_entries will have all 1906 entries which is more than the batch_size_limit - BOTH batch_size_limit check and current_end_ts checks will be TRUE. - It will export the excessive batch of more than limit and also delete entries. - While it seems fine, it will cause issues when we attempt to restore data whose size exceeds batch size. Hence, need a way to handle this by perhaps: - Setting the current_end_ts to the ts value of the entry at the batch_size_limit - 1 index. - Fetching entries unto this point only. - Then fetching the next batch of entries. Essentially, in this scenario, unlike the incremental scenario where we are incrementing current_end_ts by 3600 seconds, Here, we need to increment current_end_ts to the next batch size limit - 1 index entry’s ts value. -------- Working on this but pending writing tests for this. Also, batch size still being exceeded.
…emental But 1st batch has 501 entries. 2nd has 500, 3rd has 500. 4th has 405 entries. Understood the issue. Right now we are segregating based on time ranges as well as batch sizes. For incremental export, both are in play and right now, logic is getting messed up. For full export, mainly batch size is in play as end_ts would initially be set to current time. But if batch size exceeds limit, then we are setting end_ts to current batch size’s last entry. Now, while the run_purge_data_pipeline() is able to stop at batch size, the existing export() script is unable to do so. The export script just checks for the timestamps and exports everything in that range. Similarly, the delete function also doesn’t care about the batch size and just deletes all matching entries within the time range. A simple fix could be to try and limit the entries exported and deleted. For export, just returning 500 entries for now in export script. This works. For delete, there is no limit flag. Can try deleting only matching IDs ------- Trying to solve for incremental export. But realized that we might not need the batch size at all. The batch_size default in load_multi_timeline_for_range isn't a fixed cutoff that it'll only process the limited data. It just separates the data into batches in the script itself. No need to handle in the purge export script. ---------- Also, can simplify delete function in purge. ------- New test added for batch size ------ Just committing code here for reference.
Realized that we might not need the batch size at all. The batch_size default in load_multi_timeline_for_range isn't a fixed cutoff that it'll only process the limited data. It just separates the data into batches in the script itself. ------ Will clean up code in next commit.
Will clean up and add more tests. Looks good for now. Need to update PR with queries now.
This reverts commit c38b82d.
This seems fine but can read in from DB before deletion occurs. Then compare db entries with those read in from export file. Will work on that next. Also pending, temp directory for tests to store generated export files.
Added tests that assert first and last few entries from db and export files. Comparing object IDs only for now. Also added temp directory for tests so that local directory isn't filled with export files in emission/archived
Most of these latest code changes are focussed on the export part mainly to implement both full and incremental export. Some highlights from Export PR Comment): Full vs Incremental export discussion by kafitz Summary of changes made since last time around (details in below comments)
Discussed in detail here: 1.A, 1.B, 1.C, 1.D
Discussed in detail here: 2.A, 2.B
|
Main questions I had which also give some reasoning around the changes I’ve made. Commit with questions in more detail. Questions
Q. 1. Right approach for Export + Delete? Approach A
Approach B
|
Q. 2. Do we need all 3 types of entries: locations, trips, places? From export.py
For now, commented out code from export_timeseries.py that gets all three types of entries. I went through Shankari’s thesis and understood that entries are broken down as per the travel diary structure: In my current logic, I’m using Searching codebase for references and read through Shankari’s thesis to In that case, can we ignore these since the purge + restore functionality is concerned primarily with raw timeseries data? Hint 1: Check Saw that trips, places related keys were only found in
Hint 2: Timeseries_Sample.ipynb
Hint 3: bin/historical/migrations/populate_local_dt.py
|
Continued discussion of changes made:
Discussion from Export PR Comment: Detailed discussion by kafitz on full vs incremental export + REST API 1. A. Last_processed_ts explained by Shankari
|
1. B. What does 1 hour’s worth of data mean? This is my understanding of what it possibly means. “1 hour’s worth of data” -> different connotations:
For the export PR
For the purge PR,
Time range of exported data in Purge pipeline.
|
1. C. Does restore script also need to handle 1 hour data? I've based the import_timeseries.py script on the existing load_multi_timeline_for_range.py script. We shouldn’t need to handle 1 hour logic in restore_data.py since:
|
1. D. Exit conditions for purge pipeline The break condition to exit the while loop would be:
While the first condition is fairly obvious, I had to consider the 2nd condition, since I came across an edge case where not all entries exported. Example from sample data:
The first time range gets entries from 1906 to 1498. For the third time range, Now, the next entry 1387 has a Hence, I added the condition to check the timeseries db entries in each iteration. |
Continued discussion of changes made:
This is a lot of duplicated code and eventually either we can refactor and pull out common code for both export and purge scripts. Main difference between existing scripts:
b. import_timeseries based off load_multi_timeline_for_range
2. A. Switched to using Usercache is the first stage and there are not trips loaded hence we work with Since we’re dealing with raw timeseries data for purging / restoring data, we mainly have data.ts in the raw timeseries entries. So I went ahead with using |
2. B. Removed pipeline_states_export file Check commit message for more details To summarize: Code referenced from purge_data Problem faced
What I tried to resolve it
Pipeline states were added in Jan 2018 Commit that added it in load_multi_timeline_for_range.py Why do we need it as explained in commits above
Do we need it for purge PR?
Thus, can we remove it from the purge_restore related code? |
@MukuFlash03 is this currently "Ready for review" or waiting for "Questions from Shankari"? Can you put it into the correct column? |
The pipeline state is part of the analysis pipeline, so is related to analysed data. |
Changes 1. Fetching only loc-like entries from the existing export data logic as the raw timeseries entries. - Found a lot of references that trip and place entries are a part of analysis timeseries database. Almost every place I’ve found uses data.start_ts for “analysis/*” metadata key entries In bin/debug/export_participants_trips_csv.py ``` ts = esta.TimeSeries.get_time_series(user_id) trip_time_query = estt.TimeQuery("data.start_ts", start_day_ts, end_day_ts) ct_df = ts.get_data_df("analysis/confirmed_trip", trip_time_query) ``` --------- In bin/debug/label_stats.py ``` for t in list(edb.get_analysis_timeseries_db().find({"metadata.key": "analysis/inferred_trip", "user_id": sel_uuid})): if t["data"]["inferred_labels"] != []: confirmed_trip = edb.get_analysis_timeseries_db().find_one({"user_id": t["user_id"], "metadata.key": "analysis/confirmed_trip", "data.start_ts": t["data"]["start_ts"]}) ``` Similarly for data.entry_ts. ----------------- On the other hand for data.ts, timeseries_db was used since “background/*” metadata key entries were used: In emission/analysis/intake/segmentation/section_segmentation.py ``` get_loc_for_ts = lambda time: ecwl.Location(ts.get_entry_at_ts("background/filtered_location", "data.ts", time)["data"]) trip_start_loc = get_loc_for_ts(trip_entry.data.start_ts) trip_end_loc = get_loc_for_ts(trip_entry.data.end_ts) ``` ---------------- In emission/analysis/intake/segmentation/trip_segmentation.py ``` untracked_start_loc = ecwe.Entry(ts.get_entry_at_ts("background/filtered_location", "data.ts", last_place_entry.data.enter_ts)).data ``` -------------------------------------- 2. Refactored emission/export/export.py - Added a separate function that returns exported entries so that this function can be reused in the purge pipeline code. - This helped to remove repeated code for re-fetching exported entries. - Also using databases parameter for exporting data from specific db. For the purge usecase, `databases` should only have 'timeseries_db' -------------------------------------- 3. Added raw_timeseries_only parameter to load_multi_timeline_for_range.py - If this argument is set, then pipeline_states will not be loaded since we don't want pipeline states to be restored during restoring raw timeseries data. -------------------------------------- 4. Cleaned up tests - Reduced repetitive code by moving assertion tests to functions that can be reused for both full and incremental export testing. -------------------------------------- 5. Removed export_timeseries.py and import_timeseries.py - No need to have duplicate code since now using existing scripts present in load_multi_timeline_for_range.py and export.py --------------------------------------
Building on top of PR #904 with the objective of integrating this into the current codebase.
Will be analyzing the file storage options (CSV, JSON) as well as including the option for mongodump/restore.
Additionally long-term storage to AWS will be looked at as well.