Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a schema update mode in stats runner #344

Merged
merged 6 commits into from
Oct 24, 2024

Conversation

hqpho
Copy link
Contributor

@hqpho hqpho commented Oct 22, 2024

  • On Db object creation, continue to ensure database exists and apply any schema updates. Pull these behaviors into named methods for clarity.
  • Move clearing indexes and tables in anticipation of an import out of Db constructor and into an explicit separate Db method. Maintain existing behavior for imports by having Runner call this method before importing.
  • Add a mode value schemaupdate to both RunMode and run_stats.sh mode arg. The cdc_data docker image can pass this in based on an environment variable: Support schema update mode for data docker website#4686
  • Add test cases for both Db and Runner that start with an existing SQL database loaded from a file.

Copy link
Contributor

@keyurva keyurva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general direction of the PR is perfect!

Is it possible to break up this PR into 2 smaller PRs - one with changes primarily to the DB and a follow-up PR that introduces the new mode?

_test_runner(self,
"input_dir_driven_with_existing_old_schema_data",
is_config_driven=False,
input_db_file_name="sqlite_old_schema_populated.db")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With binary *.db files, the tests can feel opaque. Is it possible to have the tests take csv files as input and compare against expected csv files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that. Before I do so, are you able to review the non-test portions of the PR so I can make test updates all in one if necessary? Also so I can work during the time zone offset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The overall implementation looks good, added a couple of nits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't addressed your other comments yet, but for now: how do the *.sql files for test db setup look? I opted for that over CSV since it's not just the table contents that matter but also the schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - this totally works. It's text based and a truer representation of a DB vs CSVs.

Didn't know about the iterdump() utility. Seems like the exact thing we need here.

@@ -285,8 +291,13 @@ class SqlDb(Db):

def __init__(self, config: dict) -> None:
self.engine = create_db_engine(config)
self.engine.init_or_update_tables()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since clear_tables_and_indexes_for_import() is now being called explicitly by clients, I'm wondering if init_or_update_tables() should be as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so I tried this but it changed an unexpectedly large surface and seemed to make it easier to "hold it wrong" when creating any Db, even if it's not a SqlDb. I have tried instead to make it very clear with naming and comments what is happening. WDYT?

@@ -137,35 +138,49 @@ def _get_db_config() -> dict:

def run(self):
try:
# Run all data imports.
self._run_imports()
if self.mode == RunMode.SCHEMA_UPDATE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we require clients to call init_or_update_tables() (from the other comment), this mode will need to call it.

nit: For clarity, consider implementing it in a separate _run_schema_update_mode() to have a named method for it and use it as a container for future changes.

Copy link
Contributor Author

@hqpho hqpho Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted not to add _run_schema_update_mode since it makes it harder to tell what operations are shared and right now IMO there is not so much divergence that this method is hard to read. Can revisit in the future if there is more divergence!

@hqpho
Copy link
Contributor Author

hqpho commented Oct 23, 2024

The general direction of the PR is perfect!

Is it possible to break up this PR into 2 smaller PRs - one with changes primarily to the DB and a follow-up PR that introduces the new mode?

If this is for ease of reviewing, OK if I restructure the commits within the PR instead? If it's for ease of bisecting issues if any come up, I can split the PR instead. Having dependent PRs is a headache with squash-merges, so would prefer to keep this all in one PR if possible.

@keyurva
Copy link
Contributor

keyurva commented Oct 23, 2024

so would prefer to keep this all in one PR if possible

ok. Let's keep this one as-is. For the future, let's try to break them up into smaller PRs.

(and a personal note to me: do reviews more promptly!)

@hqpho hqpho requested a review from keyurva October 24, 2024 21:11
Copy link
Contributor

@keyurva keyurva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@hqpho
Copy link
Contributor Author

hqpho commented Oct 24, 2024

Thanks for the updates!

Thanks for the review – promise to break it up into smaller pieces next time!

@hqpho hqpho merged commit 02db128 into datacommonsorg:master Oct 24, 2024
4 checks passed
@hqpho hqpho deleted the schemaUpdateMode branch October 24, 2024 21:21
hqpho added a commit to datacommonsorg/website that referenced this pull request Oct 25, 2024
Use the `DATA_RUN_MODE` environment variable to decide what mode to pass
to run_stats.sh and whether to build embeddings. The mode `schemaupdate`
for run_stats.sh is added by
datacommonsorg/import#344, which this PR updates
the import submodule to include.

A docsite page will describe how to pass in this environment variable:
datacommonsorg/docsite#527
hqpho added a commit to hqpho/dc-website that referenced this pull request Oct 29, 2024
Use the `DATA_RUN_MODE` environment variable to decide what mode to pass
to run_stats.sh and whether to build embeddings. The mode `schemaupdate`
for run_stats.sh is added by
datacommonsorg/import#344, which this PR updates
the import submodule to include.

A docsite page will describe how to pass in this environment variable:
datacommonsorg/docsite#527
hqpho added a commit to hqpho/dc-website that referenced this pull request Oct 29, 2024
* update submodule for release (datacommonsorg#4681)

* update NL goldens after mixer push (datacommonsorg#4680)

* Adds logging for autocomplete responses. (datacommonsorg#4678)

Logs the response count for autocompletion. Staging is not showing any
responses. Would like to better understand where the breakdown is
occurring.

* Exit cdc_services/run.sh when any background process exits (datacommonsorg#4682)

This makes startup errors in Mixer or NL servers more obvious.

Bug: b/374820494
Reference:
https://docs.docker.com/engine/containers/multi-service_container/#use-a-wrapper-script

* update nodejs goldens (datacommonsorg#4685)

goldens needed to be updated because of a bunch of recent data updates
(data diffs can be seen here:
datacommonsorg/mixer#1438,
datacommonsorg/mixer#1439)

* Update submodules (datacommonsorg#4688)

* Pin transformers to 4.45.2 (datacommonsorg#4689)

Also updates nl goldens

* Support schema update mode for data docker (datacommonsorg#4686)

Use the `DATA_RUN_MODE` environment variable to decide what mode to pass
to run_stats.sh and whether to build embeddings. The mode `schemaupdate`
for run_stats.sh is added by
datacommonsorg/import#344, which this PR updates
the import submodule to include.

A docsite page will describe how to pass in this environment variable:
datacommonsorg/docsite#527

* Improves Typo recognition for autocomplete (datacommonsorg#4690)

This PR modifies the scoring algorithm for place autocomplete to count a
small score for non-exact matches, to account for one typo.
With these changes, we will favor "San Diego" over "Dieppe" for the
query "Sna Die".
Prod: https://screenshot.googleplex.com/Bsx2BbyLZArbQuX
Local with this change:
https://screenshot.googleplex.com/9jHqKb2uHJLz37k

Note that "Sne Die" will still go back to "Dieppe" because that's 2
typos, so San Diego is out even if it was returned by google Maps
predictions: https://screenshot.googleplex.com/9LViJoVFni3Lui6

Typo check done as a bag of letters with at most off by one. We do this
check on top of the Google Maps Predictions which already take into
account typo correction. This part is just to choose the best prediction
from google maps.

Doing this as part of gaps identified in place autocomplete:
https://docs.google.com/document/d/15RVckX9ck5eyyhBHW8Nb9lmxPBDPMIeLbax14HbN-GI/edit?tab=t.0

---------

Co-authored-by: chejennifer <[email protected]>
Co-authored-by: Gabriel Mechali <[email protected]>
Co-authored-by: natalie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants