Design the class structure and plugin hook #3

simonw · 2022-02-10T06:13:19Z

The class structure that plugins will use to implement their own custom enrichments

Originally posted by @simonw in #1 (comment)

simonw · 2022-02-10T18:15:14Z

Some enrichments are going to run entirely in-process - in which case the class itself will implement the code that gets run within Datasette to apply the enrichment.

Others are going to require an external partner via the API in #4.

So the class design should be able to handle both of these cases.

The external ones still need a class, because they need information about what the enrichment is called, how it should be described to the user and what settings (if any) the user can add to an enrichment run - things like the API key to use, and the input columns.

simonw · 2023-01-04T02:31:14Z

This API design needs to take async into account, since enrichments that call external HTTP APIs might want to do so using httpx in async mode.

simonw · 2023-01-04T02:33:33Z

I'm going to call the plugin hook register_enrichments because it's likely to end up in Datasette core eventually and I won't want to rename it.

It will look like register_routes() and register_facet_classes().

I think this:

@hookspec
def register_enrichments(datasette):
    """A list of Enrichment subclasses"""

simonw · 2023-01-04T16:03:43Z

Might be simpler if I enforce ALL enrichment implementations to use async def functions for the actual work that they do.

Based on the table structure in:

Design database schema #2 (comment)

id	enrichment	configuration	created_at	filters	start_count	done_count	next	completed_at	actor_id
1	OpenAIEmbeddings	{"column":"embedding"}	2021-01-01T00:00:00Z	null	100	50	"abcdefg"	null	123

This class will have a method that gets called with a batch of rows and Does Stuff to them, then returning information that helps update the done_count column.

simonw · 2023-01-04T16:06:02Z

I'm going to try to implement this using datasette.client against the existing paginated table API, passing through the filters and next token. Ill use ?_shape=objects (soon to be the default) but only consider the rows and next fields.

simonw · 2023-01-04T16:07:53Z

Core class method is enrich_batch(db, rows).

Should db be a writable connection? No I think it's a regular database that the method calls write methods on.

simonw · 2023-01-04T16:09:26Z

Where does the code live that adds the embedding column if it doesn't exist yet? Probably in some kind of initialization method that runs once at the start of the run.

Need to think about how errors will work. They need to be recorded somewhere, ideally the run should continue.

Refs #1, #2, #3, #5, #6

simonw · 2023-11-06T05:27:32Z

Here's the class structure for my first working OpenAI embeddings prototype:

datasette-enrichments/datasette_enrichments/__init__.py

Lines 213 to 324 in 06b423b

    
           class Embeddings(Enrichment): 
        
               name = "OpenAI Embeddings" 
        
               slug = "openai-embeddings" 
        
               batch_size = 100 
        
               description = ( 
        
                   "Calculate embeddings for text columns in a table. Embeddings are numerical representations which " 
        
                   "can be used to power semantic search and find related content." 
        
               ) 
        
               runs_in_process = True 
        
               cost_per_1000_tokens_in_100ths_cent = 1 
        
               async def get_config_form(self, db, table): 
        
                   choices = [(col, col) for col in await db.table_columns(table)] 
        
                   # Default template uses all string columns 
        
                   default = " ".join("${{{}}}".format(col[0]) for col in choices) 
        
                   class ConfigForm(Form): 
        
                       template = TextAreaField( 
        
                           "Template", 
        
                           description="A template to run against each row to generate text to embed. Use ${column-name} for columns.", 
        
                           default=default, 
        
                       ) 
        
                       api_token = PasswordField( 
        
                           "OpenAI API token", 
        
                           validators=[DataRequired(message="The token is required.")], 
        
                       ) 
        
                       # columns = MultiCheckboxField("Columns", choices=choices) 
        
                   return ConfigForm 
        
               async def initialize(self, db, table, config): 
        
                   # Ensure table exists 
        
                   embeddings_table = "_embeddings_{}".format(table) 
        
                   if not await db.table_exists(embeddings_table): 
        
                       # Create it 
        
                       pk_names = await db.primary_keys(table) 
        
                       column_types = { 
        
                           c.name: c.type for c in await db.table_column_details(table) 
        
                       } 
        
                       sql = ["create table [{}] (".format(embeddings_table)] 
        
                       create_bits = [] 
        
                       for pk in pk_names: 
        
                           create_bits.append("    [{}] {}".format(pk, column_types[pk])) 
        
                       create_bits.append("    _embedding blob") 
        
                       create_bits.append( 
        
                           "    PRIMARY KEY ({})".format( 
        
                               ", ".join("[{}]".format(pk) for pk in pk_names) 
        
                           ) 
        
                       ) 
        
                       # If there's only one primary key, set up a foreign key constraint 
        
                       if len(pk_names) == 1: 
        
                           create_bits.append( 
        
                               "    FOREIGN KEY ([{}]) REFERENCES [{}] ({})".format( 
        
                                   pk_names[0], table, pk_names[0] 
        
                               ) 
        
                           ) 
        
                       sql.append(",\n".join(create_bits)) 
        
                       sql.append(")") 
        
                       await db.execute_write("\n".join(sql)) 
        
               async def enrich_batch( 
        
                   self, 
        
                   db: Database, 
        
                   table: str, 
        
                   rows: List[dict], 
        
                   pks: List[str], 
        
                   config: dict, 
        
                   job_id: int, 
        
               ): 
        
                   template = SpaceTemplate(config["template"][0]) 
        
                   texts = [template.safe_substitute(row) for row in rows] 
        
                   token = config["api_token"][0] 
        
                   async with httpx.AsyncClient() as client: 
        
                       response = await client.post( 
        
                           "https://api.openai.com/v1/embeddings", 
        
                           headers={ 
        
                               "Authorization": f"Bearer {token}", 
        
                               "Content-Type": "application/json", 
        
                           }, 
        
                           json={"input": texts, "model": "text-embedding-ada-002"}, 
        
                       ) 
        
                       json_data = response.json() 
        
                   results = json_data["data"] 
        
                   # Record the cost too 
        
                   # json_data['usage'] 
        
                   # {'prompt_tokens': 16, 'total_tokens': 16} 
        
                   cost_per_token_in_100ths_cent = self.cost_per_1000_tokens_in_100ths_cent / 1000 
        
                   total_cost_in_100ths_of_cents = ( 
        
                       json_data["usage"]["total_tokens"] * cost_per_token_in_100ths_cent 
        
                   ) 
        
                   # Round up to the nearest integer 
        
                   total_cost_rounded_up = math.ceil(total_cost_in_100ths_of_cents) 
        
                   await self.increment_cost(db, job_id, total_cost_rounded_up) 
        
                   embeddings_table = "_embeddings_{}".format(table) 
        
                   # Write results to the table 
        
                   for row, result in zip(rows, results): 
        
                       vector = result["embedding"] 
        
                       embedding = struct.pack("f" * len(vector), *vector) 
        
                       await db.execute_write( 
        
                           "insert or replace into [{embeddings_table}] ({pks}, _embedding) values ({pk_question_marks}, ?)".format( 
        
                               embeddings_table=embeddings_table, 
        
                               pks=", ".join("[{}]".format(pk) for pk in pks), 
        
                               pk_question_marks=", ".join("?" for _ in pks), 
        
                           ), 
        
                           list(row[pk] for pk in pks) + [embedding], 
        
                       )

simonw · 2023-11-06T05:37:20Z

Next step: wire up the plugin hook so it actually does something, and rewrite the Uppercase example to use the new WTForms mechanism.

simonw · 2023-11-13T22:52:00Z

To help test this, I'm going to build a datasette-enrichments/example-enrichments folder full of examples, which in test mode and dev mode can be directly installed.

Refs #1, #2, #3

simonw added the research label Feb 10, 2022

simonw changed the title ~~The design class structure~~ Design the class structure Feb 10, 2022

simonw changed the title ~~Design the class structure~~ Design the class structure and plugin hook Jan 4, 2023

simonw added a commit that referenced this issue Nov 2, 2023

Initial prototype with Uppercase enrichment

21432c7

Refs #1, #2, #3, #5, #6

simonw added a commit that referenced this issue Nov 6, 2023

Working OpenAI embeddings prototype!

06b423b

Refs #1, #2, #3, #5, #6

simonw added this to the First alpha milestone Nov 13, 2023

simonw added a commit that referenced this issue Nov 14, 2023

Implement plugin hook and some examples, refs #3, #6

d6d25bd

simonw closed this as completed Nov 14, 2023

simonw added a commit that referenced this issue Nov 16, 2023

Release 0.1a0

42709c2

Refs #1, #2, #3

simonw mentioned this issue Nov 17, 2023

initialize() and enrich_batch() need access to datasette #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design the class structure and plugin hook #3

Design the class structure and plugin hook #3

simonw commented Feb 10, 2022

simonw commented Feb 10, 2022

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023 •

edited

Loading

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023

simonw commented Nov 6, 2023

simonw commented Nov 6, 2023

simonw commented Nov 13, 2023

Design the class structure and plugin hook #3

Design the class structure and plugin hook #3

Comments

simonw commented Feb 10, 2022

simonw commented Feb 10, 2022

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023 • edited Loading

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023

simonw commented Jan 4, 2023

simonw commented Nov 6, 2023

simonw commented Nov 6, 2023

simonw commented Nov 13, 2023

simonw commented Jan 4, 2023 •

edited

Loading