Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial design for this plugin #1

Closed
simonw opened this issue Feb 10, 2022 · 6 comments
Closed

Initial design for this plugin #1

simonw opened this issue Feb 10, 2022 · 6 comments
Labels
Milestone

Comments

@simonw
Copy link
Collaborator

simonw commented Feb 10, 2022

This plugin will work by providing its own plugin hook that can be used to register "enrichments" - classes that can enrich data in some way, for example:

  • Geocoding text addresses and storing the resulting latitude/longitude
  • Reverse geocoding a latitude/longitude into a place description
  • Running OCR against a linked image
  • Generating a transcript of a linked audio recording
  • ... and much more

Each of these enrichments will itself be a plugin. The datasette-enrichments plugin will be responsible for tracking which enrichments are to run against which columns and tracking progress along the way.

Crucially, many enrichment implementations will be expected to run as separate processes - so this plugin will offer an API that external enrichment processes can use to ask "what do I need to do?" and to then record their results back to the Datasette instance.

@simonw
Copy link
Collaborator Author

simonw commented Feb 10, 2022

It would be interesting if this mechanism could handle human-powered enrichments too - after all, saying "run OCR against everything in this column and write the discovered text back to this other column" isn't really any different from saying "ask a human being to type in the text from this image". They can work from the same APIs!

@simonw
Copy link
Collaborator Author

simonw commented Feb 10, 2022

The main things that need to be designed then are:

  • The database schema for how in-progress enrichments (and enrichment progress and results) are recorded
  • The class structure that plugins will use to implement their own custom enrichments
  • The JSON API that external enrichments will use to find out what they need to do and record their results
  • The user interface to allow Datasette users to kick off the enrichment process against tables and columns in their Datasette instance

@simonw
Copy link
Collaborator Author

simonw commented Nov 3, 2023

I'm inclined to say that enrichments that want to work in parallel should implement that themselves - so a job can only be worked on by a single worker, but that worker is welcome to grab a batch of 100 items at once and execute a massively parallel architecture of some sort to crunch through that batch as fast as possible.

Or grab 10x100 batches and process 1000 in parallel.

That at I can outsource managing that parallelism and keep the core mechanism in Datasette as simple as possible.

simonw added a commit that referenced this issue Nov 6, 2023
@simonw
Copy link
Collaborator Author

simonw commented Nov 6, 2023

The prototype now successfully handles an embedding run against OpenAI! It needs a bunch of tidying up but it's looking very promising.

Here's the table after the demo run completed:

CleanShot 2023-11-05 at 21 22 41@2x

Persisting the OpenAI API key like that is clearly not good.

I'm also not convinced I got the cost calculation right - I think rounding is throwing away too much information.

@simonw
Copy link
Collaborator Author

simonw commented Nov 6, 2023

I'm not sure which of these was that run:

CleanShot 2023-11-05 at 21 24 15@2x

$0.0001 / 1K tokens for 916,000 tokens is 9c so actually yeah I think I got it right, or at least close enough.

@simonw
Copy link
Collaborator Author

simonw commented Nov 6, 2023

Thoughts on the API token problem:

  • That token could be configured as a secret, at which point it's not needed here at all
  • The table can be hidden in the new _internal database inside Datasette, not exposed to users
  • Enrichment classes could have the option to run extra code at the end of their run - they could use that to delete any secrets from their configuration
  • They could also use a custom WTForm field which two-way-encrypts tokens such that the encrypted token is visible in the database but cannot be read

@simonw simonw added this to the First alpha milestone Nov 13, 2023
simonw added a commit that referenced this issue Nov 16, 2023
@simonw simonw closed this as completed Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant