-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a schema update mode in stats runner #344
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general direction of the PR is perfect!
Is it possible to break up this PR into 2 smaller PRs - one with changes primarily to the DB and a follow-up PR that introduces the new mode?
simple/tests/stats/runner_test.py
Outdated
_test_runner(self, | ||
"input_dir_driven_with_existing_old_schema_data", | ||
is_config_driven=False, | ||
input_db_file_name="sqlite_old_schema_populated.db") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With binary *.db
files, the tests can feel opaque. Is it possible to have the tests take csv files as input and compare against expected csv files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that. Before I do so, are you able to review the non-test portions of the PR so I can make test updates all in one if necessary? Also so I can work during the time zone offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The overall implementation looks good, added a couple of nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't addressed your other comments yet, but for now: how do the *.sql files for test db setup look? I opted for that over CSV since it's not just the table contents that matter but also the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - this totally works. It's text based and a truer representation of a DB vs CSVs.
Didn't know about the iterdump()
utility. Seems like the exact thing we need here.
@@ -285,8 +291,13 @@ class SqlDb(Db): | |||
|
|||
def __init__(self, config: dict) -> None: | |||
self.engine = create_db_engine(config) | |||
self.engine.init_or_update_tables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since clear_tables_and_indexes_for_import()
is now being called explicitly by clients, I'm wondering if init_or_update_tables()
should be as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I tried this but it changed an unexpectedly large surface and seemed to make it easier to "hold it wrong" when creating any Db, even if it's not a SqlDb. I have tried instead to make it very clear with naming and comments what is happening. WDYT?
@@ -137,35 +138,49 @@ def _get_db_config() -> dict: | |||
|
|||
def run(self): | |||
try: | |||
# Run all data imports. | |||
self._run_imports() | |||
if self.mode == RunMode.SCHEMA_UPDATE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we require clients to call init_or_update_tables()
(from the other comment), this mode will need to call it.
nit: For clarity, consider implementing it in a separate _run_schema_update_mode()
to have a named method for it and use it as a container for future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opted not to add _run_schema_update_mode since it makes it harder to tell what operations are shared and right now IMO there is not so much divergence that this method is hard to read. Can revisit in the future if there is more divergence!
If this is for ease of reviewing, OK if I restructure the commits within the PR instead? If it's for ease of bisecting issues if any come up, I can split the PR instead. Having dependent PRs is a headache with squash-merges, so would prefer to keep this all in one PR if possible. |
ok. Let's keep this one as-is. For the future, let's try to break them up into smaller PRs. (and a personal note to me: do reviews more promptly!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
Thanks for the review – promise to break it up into smaller pieces next time! |
Use the `DATA_RUN_MODE` environment variable to decide what mode to pass to run_stats.sh and whether to build embeddings. The mode `schemaupdate` for run_stats.sh is added by datacommonsorg/import#344, which this PR updates the import submodule to include. A docsite page will describe how to pass in this environment variable: datacommonsorg/docsite#527
Use the `DATA_RUN_MODE` environment variable to decide what mode to pass to run_stats.sh and whether to build embeddings. The mode `schemaupdate` for run_stats.sh is added by datacommonsorg/import#344, which this PR updates the import submodule to include. A docsite page will describe how to pass in this environment variable: datacommonsorg/docsite#527
* update submodule for release (datacommonsorg#4681) * update NL goldens after mixer push (datacommonsorg#4680) * Adds logging for autocomplete responses. (datacommonsorg#4678) Logs the response count for autocompletion. Staging is not showing any responses. Would like to better understand where the breakdown is occurring. * Exit cdc_services/run.sh when any background process exits (datacommonsorg#4682) This makes startup errors in Mixer or NL servers more obvious. Bug: b/374820494 Reference: https://docs.docker.com/engine/containers/multi-service_container/#use-a-wrapper-script * update nodejs goldens (datacommonsorg#4685) goldens needed to be updated because of a bunch of recent data updates (data diffs can be seen here: datacommonsorg/mixer#1438, datacommonsorg/mixer#1439) * Update submodules (datacommonsorg#4688) * Pin transformers to 4.45.2 (datacommonsorg#4689) Also updates nl goldens * Support schema update mode for data docker (datacommonsorg#4686) Use the `DATA_RUN_MODE` environment variable to decide what mode to pass to run_stats.sh and whether to build embeddings. The mode `schemaupdate` for run_stats.sh is added by datacommonsorg/import#344, which this PR updates the import submodule to include. A docsite page will describe how to pass in this environment variable: datacommonsorg/docsite#527 * Improves Typo recognition for autocomplete (datacommonsorg#4690) This PR modifies the scoring algorithm for place autocomplete to count a small score for non-exact matches, to account for one typo. With these changes, we will favor "San Diego" over "Dieppe" for the query "Sna Die". Prod: https://screenshot.googleplex.com/Bsx2BbyLZArbQuX Local with this change: https://screenshot.googleplex.com/9jHqKb2uHJLz37k Note that "Sne Die" will still go back to "Dieppe" because that's 2 typos, so San Diego is out even if it was returned by google Maps predictions: https://screenshot.googleplex.com/9LViJoVFni3Lui6 Typo check done as a bag of letters with at most off by one. We do this check on top of the Google Maps Predictions which already take into account typo correction. This part is just to choose the best prediction from google maps. Doing this as part of gaps identified in place autocomplete: https://docs.google.com/document/d/15RVckX9ck5eyyhBHW8Nb9lmxPBDPMIeLbax14HbN-GI/edit?tab=t.0 --------- Co-authored-by: chejennifer <[email protected]> Co-authored-by: Gabriel Mechali <[email protected]> Co-authored-by: natalie <[email protected]>
schemaupdate
to both RunMode and run_stats.sh mode arg. The cdc_data docker image can pass this in based on an environment variable: Support schema update mode for data docker website#4686