Support a schema update mode in stats runner #344

hqpho · 2024-10-22T20:32:58Z

On Db object creation, continue to ensure database exists and apply any schema updates. Pull these behaviors into named methods for clarity.
Move clearing indexes and tables in anticipation of an import out of Db constructor and into an explicit separate Db method. Maintain existing behavior for imports by having Runner call this method before importing.
Add a mode value schemaupdate to both RunMode and run_stats.sh mode arg. The cdc_data docker image can pass this in based on an environment variable: Support schema update mode for data docker website#4686
Add test cases for both Db and Runner that start with an existing SQL database loaded from a file.

keyurva

The general direction of the PR is perfect!

Is it possible to break up this PR into 2 smaller PRs - one with changes primarily to the DB and a follow-up PR that introduces the new mode?

keyurva · 2024-10-22T20:43:50Z

simple/tests/stats/runner_test.py

+    _test_runner(self,
+                 "input_dir_driven_with_existing_old_schema_data",
+                 is_config_driven=False,
+                 input_db_file_name="sqlite_old_schema_populated.db")


With binary *.db files, the tests can feel opaque. Is it possible to have the tests take csv files as input and compare against expected csv files?

Sure, I can do that. Before I do so, are you able to review the non-test portions of the PR so I can make test updates all in one if necessary? Also so I can work during the time zone offset.

Yes. The overall implementation looks good, added a couple of nits.

Haven't addressed your other comments yet, but for now: how do the *.sql files for test db setup look? I opted for that over CSV since it's not just the table contents that matter but also the schema.

Oh - this totally works. It's text based and a truer representation of a DB vs CSVs.

Didn't know about the iterdump() utility. Seems like the exact thing we need here.

keyurva · 2024-10-22T20:54:21Z

simple/stats/db.py

@@ -285,8 +291,13 @@ class SqlDb(Db):

  def __init__(self, config: dict) -> None:
    self.engine = create_db_engine(config)
+    self.engine.init_or_update_tables()


Since clear_tables_and_indexes_for_import() is now being called explicitly by clients, I'm wondering if init_or_update_tables() should be as well?

Hmm, so I tried this but it changed an unexpectedly large surface and seemed to make it easier to "hold it wrong" when creating any Db, even if it's not a SqlDb. I have tried instead to make it very clear with naming and comments what is happening. WDYT?

keyurva · 2024-10-22T20:59:10Z

simple/stats/runner.py

@@ -137,35 +138,49 @@ def _get_db_config() -> dict:

  def run(self):
    try:
-      # Run all data imports.
-      self._run_imports()
+      if self.mode == RunMode.SCHEMA_UPDATE:


If we require clients to call init_or_update_tables() (from the other comment), this mode will need to call it.

nit: For clarity, consider implementing it in a separate _run_schema_update_mode() to have a named method for it and use it as a container for future changes.

Opted not to add _run_schema_update_mode since it makes it harder to tell what operations are shared and right now IMO there is not so much divergence that this method is hard to read. Can revisit in the future if there is more divergence!

hqpho · 2024-10-23T20:59:39Z

The general direction of the PR is perfect!

Is it possible to break up this PR into 2 smaller PRs - one with changes primarily to the DB and a follow-up PR that introduces the new mode?

If this is for ease of reviewing, OK if I restructure the commits within the PR instead? If it's for ease of bisecting issues if any come up, I can split the PR instead. Having dependent PRs is a headache with squash-merges, so would prefer to keep this all in one PR if possible.

keyurva · 2024-10-23T21:21:21Z

so would prefer to keep this all in one PR if possible

ok. Let's keep this one as-is. For the future, let's try to break them up into smaller PRs.

(and a personal note to me: do reviews more promptly!)

keyurva

Thanks for the updates!

hqpho · 2024-10-24T21:20:39Z

Thanks for the updates!

Thanks for the review – promise to break it up into smaller pieces next time!

Use the `DATA_RUN_MODE` environment variable to decide what mode to pass to run_stats.sh and whether to build embeddings. The mode `schemaupdate` for run_stats.sh is added by datacommonsorg/import#344, which this PR updates the import submodule to include. A docsite page will describe how to pass in this environment variable: datacommonsorg/docsite#527

* update submodule for release (datacommonsorg#4681) * update NL goldens after mixer push (datacommonsorg#4680) * Adds logging for autocomplete responses. (datacommonsorg#4678) Logs the response count for autocompletion. Staging is not showing any responses. Would like to better understand where the breakdown is occurring. * Exit cdc_services/run.sh when any background process exits (datacommonsorg#4682) This makes startup errors in Mixer or NL servers more obvious. Bug: b/374820494 Reference: https://docs.docker.com/engine/containers/multi-service_container/#use-a-wrapper-script * update nodejs goldens (datacommonsorg#4685) goldens needed to be updated because of a bunch of recent data updates (data diffs can be seen here: datacommonsorg/mixer#1438, datacommonsorg/mixer#1439) * Update submodules (datacommonsorg#4688) * Pin transformers to 4.45.2 (datacommonsorg#4689) Also updates nl goldens * Support schema update mode for data docker (datacommonsorg#4686) Use the `DATA_RUN_MODE` environment variable to decide what mode to pass to run_stats.sh and whether to build embeddings. The mode `schemaupdate` for run_stats.sh is added by datacommonsorg/import#344, which this PR updates the import submodule to include. A docsite page will describe how to pass in this environment variable: datacommonsorg/docsite#527 * Improves Typo recognition for autocomplete (datacommonsorg#4690) This PR modifies the scoring algorithm for place autocomplete to count a small score for non-exact matches, to account for one typo. With these changes, we will favor "San Diego" over "Dieppe" for the query "Sna Die". Prod: https://screenshot.googleplex.com/Bsx2BbyLZArbQuX Local with this change: https://screenshot.googleplex.com/9jHqKb2uHJLz37k Note that "Sne Die" will still go back to "Dieppe" because that's 2 typos, so San Diego is out even if it was returned by google Maps predictions: https://screenshot.googleplex.com/9LViJoVFni3Lui6 Typo check done as a bag of letters with at most off by one. We do this check on top of the Google Maps Predictions which already take into account typo correction. This part is just to choose the best prediction from google maps. Doing this as part of gaps identified in place autocomplete: https://docs.google.com/document/d/15RVckX9ck5eyyhBHW8Nb9lmxPBDPMIeLbax14HbN-GI/edit?tab=t.0 --------- Co-authored-by: chejennifer <[email protected]> Co-authored-by: Gabriel Mechali <[email protected]> Co-authored-by: natalie <[email protected]>

hqpho added 4 commits October 18, 2024 16:48

Add schema update mode to stats runner

b8dff52

Test new behaviors of Db class

9a644e2

Add tests for new Runner behavior

6800a3d

Reword comment

2c2fb11

keyurva reviewed Oct 22, 2024

View reviewed changes

hqpho requested a review from keyurva October 22, 2024 20:47

hqpho mentioned this pull request Oct 22, 2024

Support schema update mode for data docker datacommonsorg/website#4686

Merged

keyurva reviewed Oct 22, 2024

View reviewed changes

hqpho added 2 commits October 23, 2024 17:33

Use human-readable files for unit test database setup

2ce6286

Make update more explicit with naming and comments

ce28487

hqpho requested a review from keyurva October 24, 2024 21:11

keyurva approved these changes Oct 24, 2024

View reviewed changes

hqpho merged commit 02db128 into datacommonsorg:master Oct 24, 2024
4 checks passed

hqpho deleted the schemaUpdateMode branch October 24, 2024 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a schema update mode in stats runner #344

Support a schema update mode in stats runner #344

hqpho commented Oct 22, 2024 •

edited

Loading

keyurva left a comment

keyurva Oct 22, 2024

hqpho Oct 22, 2024

keyurva Oct 22, 2024

hqpho Oct 23, 2024

keyurva Oct 23, 2024

keyurva Oct 22, 2024

hqpho Oct 24, 2024

keyurva Oct 22, 2024

hqpho Oct 24, 2024 •

edited

Loading

hqpho commented Oct 23, 2024

keyurva commented Oct 23, 2024

keyurva left a comment

hqpho commented Oct 24, 2024

Support a schema update mode in stats runner #344

Support a schema update mode in stats runner #344

Conversation

hqpho commented Oct 22, 2024 • edited Loading

keyurva left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hqpho Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

hqpho commented Oct 23, 2024

keyurva commented Oct 23, 2024

keyurva left a comment

Choose a reason for hiding this comment

hqpho commented Oct 24, 2024

hqpho commented Oct 22, 2024 •

edited

Loading

hqpho Oct 24, 2024 •

edited

Loading