Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document data docker schema update mode #527

Closed
wants to merge 4 commits into from

Conversation

hqpho
Copy link
Contributor

@hqpho hqpho commented Oct 22, 2024

This mode is added by datacommonsorg/website#4686. A subsequent PR to mixer will link the new docsite page directly from the schema check error message: datacommonsorg/mixer#1440

@hqpho hqpho changed the title Document the data docker schema update mode Document data docker schema update mode Oct 22, 2024
@hqpho hqpho requested review from keyurva and kmoscoe October 22, 2024 20:59
Copy link
Contributor

@keyurva keyurva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the decision on the env var.


While starting Data Commons services, you may see an error that starts with `SQL schema check failed`. This means your database schema must be updated for compatibility with the latest Data Commons services.

You can update your database by running a data management job with the environment variable `SCHEMA_UPDATE_ONLY` set to `true`. This will alter your database without modifying already-imported data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to be updated based on what we decide in https://github.com/datacommonsorg/website/pull/4686/files#r1815646141

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@hqpho
Copy link
Contributor Author

hqpho commented Oct 25, 2024

Will wait to commit this until the mode is available in the stable data container image.

hqpho added a commit to datacommonsorg/website that referenced this pull request Oct 25, 2024
Use the `DATA_RUN_MODE` environment variable to decide what mode to pass
to run_stats.sh and whether to build embeddings. The mode `schemaupdate`
for run_stats.sh is added by
datacommonsorg/import#344, which this PR updates
the import submodule to include.

A docsite page will describe how to pass in this environment variable:
datacommonsorg/docsite#527
Copy link
Contributor

@kmoscoe kmoscoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't actually include this as a separate page. It's too minor a detail. I would instead just put the text in the relevant existing sections. Can I send you an alternative PR as a proof of concept?

@hqpho
Copy link
Contributor Author

hqpho commented Oct 28, 2024

I wouldn't actually include this as a separate page. It's too minor a detail. I would instead just put the text in the relevant existing sections. Can I send you an alternative PR as a proof of concept?

My thought was to have it on one page for ease of linking from an error message! Is that alright? Maybe we can remove this page from the sidenav if we don't want to feel like we're cluttering the docsite?

@kmoscoe
Copy link
Contributor

kmoscoe commented Oct 28, 2024

You could still link to a subsection using an anchor.

@kmoscoe
Copy link
Contributor

kmoscoe commented Oct 28, 2024

Another thing I don't really like is repeating in entirety the startup procedures. It makes more sense to integrate this info into the existing procedures. Let me send you a PR so you can see what I mean.

@hqpho
Copy link
Contributor Author

hqpho commented Oct 28, 2024

Kara, thanks for putting together an alternate approach! I understand the desire not to duplicate startup commands across multiple pages. That said, I worry that the other approach:

  • Adds cognitive overhead for people running a typical data load command, trying to copy/paste from a doc page and having to learn about a new param that is almost never relevant to them
  • Doesn't explain the mechanism by which the startup time is minimized, so it might come as a surprise to people what exactly the mode does (or rather doesn't) do

Another argument I'd make in favor of having schema update mode be its own page is that we may expand the mode variable in the future to support more different modes, in which case we can easily revise and extend that page.

@keyurva Do you want to weigh in as a decision tiebreaker here?

@kmoscoe
Copy link
Contributor

kmoscoe commented Oct 28, 2024 via email

@keyurva
Copy link
Contributor

keyurva commented Oct 28, 2024

Is there a middle ground where we don't have a separate page for it but are able to make it as self-sufficient as possible?

The typical workflow is likely going to be what Hannah described: a user starts the service, notices the schema failure with a link to the relevant section / page in the docsite and is primarily interested in resolving this failure in the quickest possible manner.

@kmoscoe
Copy link
Contributor

kmoscoe commented Oct 28, 2024 via email

hqpho added a commit to hqpho/dc-website that referenced this pull request Oct 29, 2024
Use the `DATA_RUN_MODE` environment variable to decide what mode to pass
to run_stats.sh and whether to build embeddings. The mode `schemaupdate`
for run_stats.sh is added by
datacommonsorg/import#344, which this PR updates
the import submodule to include.

A docsite page will describe how to pass in this environment variable:
datacommonsorg/docsite#527
hqpho added a commit to hqpho/dc-website that referenced this pull request Oct 29, 2024
* update submodule for release (datacommonsorg#4681)

* update NL goldens after mixer push (datacommonsorg#4680)

* Adds logging for autocomplete responses. (datacommonsorg#4678)

Logs the response count for autocompletion. Staging is not showing any
responses. Would like to better understand where the breakdown is
occurring.

* Exit cdc_services/run.sh when any background process exits (datacommonsorg#4682)

This makes startup errors in Mixer or NL servers more obvious.

Bug: b/374820494
Reference:
https://docs.docker.com/engine/containers/multi-service_container/#use-a-wrapper-script

* update nodejs goldens (datacommonsorg#4685)

goldens needed to be updated because of a bunch of recent data updates
(data diffs can be seen here:
datacommonsorg/mixer#1438,
datacommonsorg/mixer#1439)

* Update submodules (datacommonsorg#4688)

* Pin transformers to 4.45.2 (datacommonsorg#4689)

Also updates nl goldens

* Support schema update mode for data docker (datacommonsorg#4686)

Use the `DATA_RUN_MODE` environment variable to decide what mode to pass
to run_stats.sh and whether to build embeddings. The mode `schemaupdate`
for run_stats.sh is added by
datacommonsorg/import#344, which this PR updates
the import submodule to include.

A docsite page will describe how to pass in this environment variable:
datacommonsorg/docsite#527

* Improves Typo recognition for autocomplete (datacommonsorg#4690)

This PR modifies the scoring algorithm for place autocomplete to count a
small score for non-exact matches, to account for one typo.
With these changes, we will favor "San Diego" over "Dieppe" for the
query "Sna Die".
Prod: https://screenshot.googleplex.com/Bsx2BbyLZArbQuX
Local with this change:
https://screenshot.googleplex.com/9jHqKb2uHJLz37k

Note that "Sne Die" will still go back to "Dieppe" because that's 2
typos, so San Diego is out even if it was returned by google Maps
predictions: https://screenshot.googleplex.com/9LViJoVFni3Lui6

Typo check done as a bag of letters with at most off by one. We do this
check on top of the Google Maps Predictions which already take into
account typo correction. This part is just to choose the best prediction
from google maps.

Doing this as part of gaps identified in place autocomplete:
https://docs.google.com/document/d/15RVckX9ck5eyyhBHW8Nb9lmxPBDPMIeLbax14HbN-GI/edit?tab=t.0

---------

Co-authored-by: chejennifer <[email protected]>
Co-authored-by: Gabriel Mechali <[email protected]>
Co-authored-by: natalie <[email protected]>
@kmoscoe
Copy link
Contributor

kmoscoe commented Oct 29, 2024 via email

@hqpho
Copy link
Contributor Author

hqpho commented Oct 29, 2024

Subsumed by #530

@hqpho hqpho closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants