sqlite-utils extract could handle nested objects #239

simonw · 2021-02-25T15:10:28Z

Imagine a table (imported from a nested JSON file) where one of the columns contains values that look like this:

{"email": "[email protected]", "id": "usrROSHARE0000000", "name": "Anonymous"}

The sqlite-utils extract command already uses single text values in a column to populate a new table. It would not be much of a stretch for it to be able to use JSON instead, including specifying which of those values should be used as the primary key in the new table.

The text was updated successfully, but these errors were encountered:

simonw · 2021-02-25T15:18:03Z

The Python .extract() method currently starts like this:

def extract(self, columns, table=None, fk_column=None, rename=None):
        rename = rename or {}
        if isinstance(columns, str):
            columns = [columns]
        if not set(columns).issubset(self.columns_dict.keys()):
            raise InvalidColumns(
                "Invalid columns {} for table with columns {}".format(
                    columns, list(self.columns_dict.keys())
                )
            )
        ...

Note that it takes a list of columns (and treats a string as a single item list). That's because it can be called with a list of columns and it will use them to populate another table of unique tuples of those column values.

So a new mechanism that can instead read JSON values from a single column needs to be compatible with that existing design.

simonw · 2021-02-25T15:18:46Z

Likewise the sqlite-utils extract command takes one or more columns:

Usage: sqlite-utils extract [OPTIONS] PATH TABLE COLUMNS...

  Extract one or more columns into a separate table

Options:
  --table TEXT             Name of the other table to extract columns to
  --fk-column TEXT         Name of the foreign key column to add to the table
  --rename <TEXT TEXT>...  Rename this column in extracted table

simonw · 2021-02-25T15:19:37Z

For the Python version I'd like to be able to provide a transformation callback function - which can be json.loads but could also be anything else which accepts the value of the current column and returns a Python dictionary of columns and their values to use in the new table.

simonw · 2021-02-25T15:20:02Z

It would be OK if the CLI version only allows you to specify a single column if you are using the --json option.

simonw · 2021-02-25T15:21:02Z

Maybe the Python version takes an optional dictionary mapping column names to transformation functions? It could then merge all of those results together - and maybe throw an error if the same key is produced by more than one column.

    db["Reports"].extract(["Reported by"], transform={"Reported by": json.loads})

Or it could have an option for different strategies if keys collide: first wins, last wins, throw exception, add a prefix to the new column name. That feels a bit too complex for an edge-case though.

simonw · 2021-02-25T15:24:17Z

I'm going to go with last-wins - so if multiple transform functions return the same key the last one will over-write the others.

simonw · 2021-02-25T15:25:21Z

Problem with calling this argument transform= is that the term "transform" already means something else in this library.

I could use convert= instead.

... but that doesn't instantly make me think of turning a value into multiple columns.

How about expand=? I've not used that term anywhere yet.

db["Reports"].extract(["Reported by"], expand={"Reported by": json.loads})

I think that works. You're expanding a single value into several columns of information.

simonw · 2021-02-25T15:37:04Z

Here's the current implementation of .extract():

sqlite-utils/sqlite_utils/db.py

Lines 1049 to 1074 in 806c210

    
           # Populate the lookup table with all of the extracted unique values 
        
           lookup_columns_definition = { 
        
               (rename.get(col) or col): typ 
        
               for col, typ in self.columns_dict.items() 
        
               if col in columns 
        
           } 
        
           if lookup_table.exists(): 
        
               if not set(lookup_columns_definition.items()).issubset( 
        
                   lookup_table.columns_dict.items() 
        
               ): 
        
                   raise InvalidColumns( 
        
                       "Lookup table {} already exists but does not have columns {}".format( 
        
                           table, lookup_columns_definition 
        
                       ) 
        
                   ) 
        
           else: 
        
               lookup_table.create( 
        
                   { 
        
                       **{ 
        
                           "id": int, 
        
                       }, 
        
                       **lookup_columns_definition, 
        
                   }, 
        
                   pk="id", 
        
               ) 
        
           lookup_columns = [(rename.get(col) or col) for col in columns]

Tricky detail here: I create the lookup table first, based on the types of the columns that are being extracted.

I need to do this because extraction currently uses unique tuples of values, so the table has to be created in advance.

But if I'm using these new expand functions to figure out what's going to be extracted, I don't know the names of the columns and their types in advance. I'm only going to find those out during the transformation.

This may turn out to be incompatible with how .extract() works at the moment. I may need a new method, .extract_expand() perhaps? It could be simpler - work only against a single column for example.

I can still use the existing sqlite-utils extract CLI command though, with a --json flag and a rule that you can't run it against multiple columns.

simonw · 2021-02-25T16:36:17Z

WIP in a pull request.

simonw · 2021-02-26T17:44:38Z

This came up in office hours!

simonw · 2021-02-26T17:45:53Z

If there's no primary key in the JSON could use the hash_id mechanism.

simonw · 2021-02-26T18:52:40Z

Could this handle lists of objects too? That would be pretty amazing - if the column has a [{...}, {...}] list in it could turn that into a many-to-many.

tmaier · 2021-11-03T23:28:55Z

I am super interested in this feature.

After reading the other issues you referenced, I think the right way would be to use the current extract feature and then to use sqlite-utils convert to extract the json object into individual columns

tmaier · 2021-11-03T23:35:37Z

I think I only wonder how I would parse the JSON value within such a lambda...

My naive approach would have been $ sqlite-utils convert demo.db statuses statuses 'return value' --multi

hubgit · 2022-09-03T21:26:32Z

I was looking for something like this today, for extracting columns containing objects (and arrays of objects) into separate tables.

Would it make sense (especially for the fields containing arrays of objects) to create a one-to-many relationship, where each row of the newly created table would contain the id of the row that originally contained it?

If the extracted objects have a unique id and are repeated, it could even create a many-to-many relationship, with a third table for the joins.

simonw · 2022-09-03T23:46:02Z

Yeah having a version of this that can setup m2m relationships would definitely be interesting.

gorm · 2024-03-01T23:46:25Z

Could the extract cli command also accept --flatten? E.g. if you have a structure like this:

{"batch_id": 1,"jobs": [{ "job_id": 1, "name": "job 1"}, {"job_id": 2, "name": "job 2" }]}

Then after a sqlite-utils insert the batch table will contain a jobs column of type text with

[{ "job_id": 1, "name": "job 1"}, {"job_id": 2, "name": "job 2"}]

Running extract on this text field will create a new table, but content will still be of type text, but with extract --flatten it could populate a one-to-many relationship and create the job_id and name column on the new table. If it's a nested object, it could populate a one-to-one relationship with the created columns.

I came across this great tool today and tested it, but didn't find a way to handle nested objects or arrays when inserting from JSON. But maybe there are some other ways to achieve this with the convert command?

simonw added the enhancement New feature or request label Feb 25, 2021

simonw mentioned this issue Feb 25, 2021

table.pks_and_rows_where() method returning primary keys along with the rows #240

Closed

simonw added cli-tool python-library labels Feb 25, 2021

simonw added a commit that referenced this issue Feb 25, 2021

First attempt at .extract_expand() method, refs #239

8d641ab

simonw mentioned this issue Feb 25, 2021

Extract expand - work in progress #241

Draft

simonw mentioned this issue Mar 25, 2021

"sqlite-utils convert" command to replace the separate "sqlite-transform" tool #251

Closed

phaer mentioned this issue Sep 19, 2021

Extract expand: Support JSON Arrays #327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlite-utils extract could handle nested objects #239

sqlite-utils extract could handle nested objects #239

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021 •

edited

Loading

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021 •

edited

Loading

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 26, 2021

simonw commented Feb 26, 2021

simonw commented Feb 26, 2021

tmaier commented Nov 3, 2021

tmaier commented Nov 3, 2021 •

edited

Loading

hubgit commented Sep 3, 2022

simonw commented Sep 3, 2022

gorm commented Mar 1, 2024 •

edited

Loading

sqlite-utils extract could handle nested objects #239

sqlite-utils extract could handle nested objects #239

Comments

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021 • edited Loading

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021 • edited Loading

simonw commented Feb 25, 2021

simonw commented Feb 25, 2021

simonw commented Feb 26, 2021

simonw commented Feb 26, 2021

simonw commented Feb 26, 2021

tmaier commented Nov 3, 2021

tmaier commented Nov 3, 2021 • edited Loading

hubgit commented Sep 3, 2022

simonw commented Sep 3, 2022

gorm commented Mar 1, 2024 • edited Loading

simonw commented Feb 25, 2021 •

edited

Loading

simonw commented Feb 25, 2021 •

edited

Loading

tmaier commented Nov 3, 2021 •

edited

Loading

gorm commented Mar 1, 2024 •

edited

Loading