Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design the class structure and plugin hook #3

Closed
simonw opened this issue Feb 10, 2022 · 10 comments
Closed

Design the class structure and plugin hook #3

simonw opened this issue Feb 10, 2022 · 10 comments
Labels
Milestone

Comments

@simonw
Copy link
Collaborator

simonw commented Feb 10, 2022

  • The class structure that plugins will use to implement their own custom enrichments

Originally posted by @simonw in #1 (comment)

@simonw simonw changed the title The design class structure Design the class structure Feb 10, 2022
@simonw
Copy link
Collaborator Author

simonw commented Feb 10, 2022

Some enrichments are going to run entirely in-process - in which case the class itself will implement the code that gets run within Datasette to apply the enrichment.

Others are going to require an external partner via the API in #4.

So the class design should be able to handle both of these cases.

The external ones still need a class, because they need information about what the enrichment is called, how it should be described to the user and what settings (if any) the user can add to an enrichment run - things like the API key to use, and the input columns.

@simonw
Copy link
Collaborator Author

simonw commented Jan 4, 2023

This API design needs to take async into account, since enrichments that call external HTTP APIs might want to do so using httpx in async mode.

@simonw simonw changed the title Design the class structure Design the class structure and plugin hook Jan 4, 2023
@simonw
Copy link
Collaborator Author

simonw commented Jan 4, 2023

I'm going to call the plugin hook register_enrichments because it's likely to end up in Datasette core eventually and I won't want to rename it.

It will look like register_routes() and register_facet_classes().

I think this:

@hookspec
def register_enrichments(datasette):
    """A list of Enrichment subclasses"""

@simonw
Copy link
Collaborator Author

simonw commented Jan 4, 2023

Might be simpler if I enforce ALL enrichment implementations to use async def functions for the actual work that they do.

Based on the table structure in:

id enrichment configuration created_at filters start_count done_count next completed_at actor_id
1 OpenAIEmbeddings {"column":"embedding"} 2021-01-01T00:00:00Z null 100 50 "abcdefg" null 123

This class will have a method that gets called with a batch of rows and Does Stuff to them, then returning information that helps update the done_count column.

@simonw
Copy link
Collaborator Author

simonw commented Jan 4, 2023

I'm going to try to implement this using datasette.client against the existing paginated table API, passing through the filters and next token. Ill use ?_shape=objects (soon to be the default) but only consider the rows and next fields.

@simonw
Copy link
Collaborator Author

simonw commented Jan 4, 2023

Core class method is enrich_batch(db, rows).

Should db be a writable connection? No I think it's a regular database that the method calls write methods on.

@simonw
Copy link
Collaborator Author

simonw commented Jan 4, 2023

Where does the code live that adds the embedding column if it doesn't exist yet? Probably in some kind of initialization method that runs once at the start of the run.

Need to think about how errors will work. They need to be recorded somewhere, ideally the run should continue.

simonw added a commit that referenced this issue Nov 2, 2023
simonw added a commit that referenced this issue Nov 6, 2023
@simonw
Copy link
Collaborator Author

simonw commented Nov 6, 2023

Here's the class structure for my first working OpenAI embeddings prototype:

class Embeddings(Enrichment):
name = "OpenAI Embeddings"
slug = "openai-embeddings"
batch_size = 100
description = (
"Calculate embeddings for text columns in a table. Embeddings are numerical representations which "
"can be used to power semantic search and find related content."
)
runs_in_process = True
cost_per_1000_tokens_in_100ths_cent = 1
async def get_config_form(self, db, table):
choices = [(col, col) for col in await db.table_columns(table)]
# Default template uses all string columns
default = " ".join("${{{}}}".format(col[0]) for col in choices)
class ConfigForm(Form):
template = TextAreaField(
"Template",
description="A template to run against each row to generate text to embed. Use ${column-name} for columns.",
default=default,
)
api_token = PasswordField(
"OpenAI API token",
validators=[DataRequired(message="The token is required.")],
)
# columns = MultiCheckboxField("Columns", choices=choices)
return ConfigForm
async def initialize(self, db, table, config):
# Ensure table exists
embeddings_table = "_embeddings_{}".format(table)
if not await db.table_exists(embeddings_table):
# Create it
pk_names = await db.primary_keys(table)
column_types = {
c.name: c.type for c in await db.table_column_details(table)
}
sql = ["create table [{}] (".format(embeddings_table)]
create_bits = []
for pk in pk_names:
create_bits.append(" [{}] {}".format(pk, column_types[pk]))
create_bits.append(" _embedding blob")
create_bits.append(
" PRIMARY KEY ({})".format(
", ".join("[{}]".format(pk) for pk in pk_names)
)
)
# If there's only one primary key, set up a foreign key constraint
if len(pk_names) == 1:
create_bits.append(
" FOREIGN KEY ([{}]) REFERENCES [{}] ({})".format(
pk_names[0], table, pk_names[0]
)
)
sql.append(",\n".join(create_bits))
sql.append(")")
await db.execute_write("\n".join(sql))
async def enrich_batch(
self,
db: Database,
table: str,
rows: List[dict],
pks: List[str],
config: dict,
job_id: int,
):
template = SpaceTemplate(config["template"][0])
texts = [template.safe_substitute(row) for row in rows]
token = config["api_token"][0]
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/embeddings",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
},
json={"input": texts, "model": "text-embedding-ada-002"},
)
json_data = response.json()
results = json_data["data"]
# Record the cost too
# json_data['usage']
# {'prompt_tokens': 16, 'total_tokens': 16}
cost_per_token_in_100ths_cent = self.cost_per_1000_tokens_in_100ths_cent / 1000
total_cost_in_100ths_of_cents = (
json_data["usage"]["total_tokens"] * cost_per_token_in_100ths_cent
)
# Round up to the nearest integer
total_cost_rounded_up = math.ceil(total_cost_in_100ths_of_cents)
await self.increment_cost(db, job_id, total_cost_rounded_up)
embeddings_table = "_embeddings_{}".format(table)
# Write results to the table
for row, result in zip(rows, results):
vector = result["embedding"]
embedding = struct.pack("f" * len(vector), *vector)
await db.execute_write(
"insert or replace into [{embeddings_table}] ({pks}, _embedding) values ({pk_question_marks}, ?)".format(
embeddings_table=embeddings_table,
pks=", ".join("[{}]".format(pk) for pk in pks),
pk_question_marks=", ".join("?" for _ in pks),
),
list(row[pk] for pk in pks) + [embedding],
)

@simonw
Copy link
Collaborator Author

simonw commented Nov 6, 2023

Next step: wire up the plugin hook so it actually does something, and rewrite the Uppercase example to use the new WTForms mechanism.

@simonw simonw added this to the First alpha milestone Nov 13, 2023
@simonw
Copy link
Collaborator Author

simonw commented Nov 13, 2023

To help test this, I'm going to build a datasette-enrichments/example-enrichments folder full of examples, which in test mode and dev mode can be directly installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant