Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Apply ArraySchemaEvolution to specified timestamp #2895

Merged
merged 3 commits into from
Aug 16, 2024

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Aug 14, 2024

Issue and/or context: #2879

Changes: If we've got the array open at a timestamp, and we need to evolve it, evolve from that same timestamp.

Notes for Reviewer:

Material from @nguyenv to be incorporated into unit-test material:

Script:

$ cat ts.py
import tiledb
import tiledbsoma as soma
import pyarrow as pa
import tempfile
import numpy as np
import pandas as pd
import time

with tempfile.TemporaryDirectory() as uri:
    asch = pa.schema([("foo", pa.dictionary(pa.int8(), pa.large_string()))])
    soma.DataFrame.create(uri, schema=asch, tiledb_timestamp=1).close()

    pydict = {}
    pydict["soma_joinid"] = [0, 1, 2]
    pydict["foo"] = pd.Series(['a', 'b', 'a'], dtype='category')
    rb = pa.Table.from_pydict(pydict)

    with soma.DataFrame.open(uri, "w", tiledb_timestamp=2) as sdf:
        sdf.write(rb)

    pydict = {}
    pydict["soma_joinid"] = [3, 4]
    pydict["foo"] = pd.Series(['b', 'b'], dtype='category')
    rb = pa.Table.from_pydict(pydict)

    with soma.DataFrame.open(uri, "w", tiledb_timestamp=3) as sdf:
        sdf.write(rb)

    with soma.DataFrame.open(uri) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=1) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=2) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=3) as sdf:
        table = sdf.read().concat()
        print(table)

Before:

pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[]]
foo: [  -- dictionary:
[]  -- indices:
[]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2]]
foo: [  -- dictionary:
[]  -- indices:
[0,1,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
[]  -- indices:
[0,1,0,0,0]]

After:

pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[]]
foo: [  -- dictionary:
[]  -- indices:
[]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]

Difference:

23c23
< []  -- indices:
---
> ["a","b"]  -- indices:
31c31
< []  -- indices:
---
> ["a","b"]  -- indices:

@johnkerl johnkerl marked this pull request as draft August 14, 2024 18:14
Copy link

codecov bot commented Aug 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.98%. Comparing base (9f82dad) to head (d87cca3).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2895      +/-   ##
==========================================
+ Coverage   89.83%   89.98%   +0.15%     
==========================================
  Files          37       37              
  Lines        3926     3926              
==========================================
+ Hits         3527     3533       +6     
+ Misses        399      393       -6     
Flag Coverage Δ
python 89.98% <ø> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.98% <ø> (+0.15%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@eddelbuettel
Copy link
Contributor

Good. That should also help for R where we documented this first,

@johnkerl
Copy link
Member Author

There is something going on, since

["a","b"]  -- indices:
[0,1,0,0,0]]

should have been

["a","b"]  -- indices:
[0,1,0,1,1]]

since we wrote 'b', 'b' at slots 3,4

    pydict = {}
    pydict["soma_joinid"] = [3, 4]
    pydict["foo"] = pd.Series(['b', 'b'], dtype='category')
    rb = pa.Table.from_pydict(pydict)

@johnkerl
Copy link
Member Author

I've verified that ☝️ is a separate bug. Running this on a built checkout without this PR:

ubuntu@segge[prod][kerl/h2h][TileDB-SOMA]$ cat ts.py
import tiledb
import tiledbsoma as soma
import pyarrow as pa
import tempfile
import numpy as np
import pandas as pd
import time

with tempfile.TemporaryDirectory() as uri:

    asch = pa.schema([
        ("foo", pa.dictionary(pa.int8(), pa.large_string()))
    ])

    # Create at time 1
    soma.DataFrame.create(uri, schema=asch, tiledb_timestamp=1).close()

    # Write at time 2
    atbl = pa.Table.from_pydict({
        "soma_joinid": [0, 1, 2],
        "foo": pd.Series(['a', 'b', 'a'], dtype='category'),
    })
    with soma.DataFrame.open(uri, "w", tiledb_timestamp=2) as sdf:
        sdf.write(atbl)

    # Write at time 3
    atbl = pa.Table.from_pydict({
        "soma_joinid": [3, 4],
        #"foo": pd.Series(['b', 'c'], dtype='category'),
        "foo": pd.Series(['b', 'b'], dtype='category'),
    })
    with soma.DataFrame.open(uri, "w", tiledb_timestamp=3) as sdf:
        sdf.write(atbl)

    with soma.DataFrame.open(uri) as sdf:
        table = sdf.read().concat()
        print()
        print("TABLE AT NOW")
        print(table)
        print()
        print(table.to_pandas())
        print()

I get

TABLE AT NOW
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]

   soma_joinid foo
0            0   a
1            1   b
2            2   a
3            3   a <------ note
4            4   a <------ note

@johnkerl
Copy link
Member Author

Good. That should also help for R where we documented this first,

@eddelbuettel great! If you have a ilnk for that "where" handy, please share it here -- that will be helpful.

@johnkerl
Copy link
Member Author

Given #2896 and the proof above that the issue I identified is orthogonal to this PR (and pre-dates it), I think we should move forward with this in order to facilitate ongoing R-side development. And #2896 remains open.

I'll add a Python unit-test case as noted above.

@johnkerl johnkerl changed the title [c++] Apply ArraySchemaEvolution to specified timestamp [WIP] [c++] Apply ArraySchemaEvolution to specified timestamp Aug 14, 2024
@johnkerl johnkerl marked this pull request as ready for review August 14, 2024 20:06
@johnkerl
Copy link
Member Author

Good. That should also help for R where we documented this first,

@eddelbuettel great! If you have a ilnk for that "where" handy, please share it here -- that will be helpful.

Never mind -- I found #2879

@eddelbuettel
Copy link
Contributor

@johnkerl Here is my actual test script from last week (which is not yet committed to the 'scraps' repo I use for these). With the SOMA version I have here (a few days old) it just comes up empty (no 'strings' for the enumeration). Worth testing on your patch.

!#!/usr/bin/env Rscript                                                                                                                                                            
                                                                                                                                                                                  
uri <- "/tmp/tiledb/somadf-evolve-ts"                                                                                                                                             
if (dir.exists(uri)) unlink(uri, recursive=TRUE)                                                                                                                                  
                                                                                                                                                                                  
schema <- arrow::schema(arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),                                                                                            
                        arrow::field("int", arrow::int32(), nullable = FALSE),                                                                                                    
                        arrow::field("str", arrow::dictionary(index_type = arrow::int8(), value_type = arrow::utf8(), ordered = FALSE), nullable=FALSE))                          
print(schema)                                                                                                                                                                     
ts1 <- rep(as.POSIXct(1, tz="UTC"), 2)                                                                                                                                            
sdf <- tiledbsoma::SOMADataFrameCreate(uri, schema, tiledb_timestamp=ts1[1])                                                                                                      
data <- arrow::arrow_table(soma_joinid = bit64::as.integer64(1L:5L),                                                                                                              
                           int = 101:105L,                                                                                                                                        
                           str = factor(c('a','b','b','a','b')))                                                                                                                  
cat("\n")                                                                                                                                                                         
print(tibble::as_tibble(data))                                                                                                                                                    
ts2 <- rep(as.POSIXct(2, tz="UTC"), 2)                                                                                                                                            
sdf$write(data, ts2)                                                                                                                                                              
                                                                                                                                                                                  
data <- arrow::arrow_table(soma_joinid = bit64::as.integer64(6L:10L),                                                                                                             
                           int = 106:110L,                                                                                                                                        
                           str = factor(c('c','b','c','c','b')))                                                                                                                  
cat("\n")                                                                                                                                                                         
print(tibble::as_tibble(data))                                                                                                                                                    
ts3 <- rep(as.POSIXct(3, tz="UTC"), 2)                                                                                                                                            
sdf$write(data, ts3)                                                                                                                                                              
sdf$close()                                                                                                                                                                       
                                                                                                                                                                                  
cat("\n")                                                                                                                                                                         
tiledb::tiledb_array(uri, return_as="data.table")[]                                                                                                                               
                                                                                                                                                                                  
sdf <- tiledbsoma::SOMADataFrameOpen(uri,  tiledb_timestamp=ts3)                                                                                                                  
res <- sdf$read()$concat()                                                                                                                                                        
print(res)                                                                                                                                                                        
print(res$str)                                                                                                                                                                    
#tibble::as_tibble(res)                                                                                                                                                           
                           

@johnkerl
Copy link
Member Author

Worth testing on your patch.

@eddelbuettel before & after this PR I get the following when I run that:

Schema
soma_joinid: int64 not null
int: int32 not null
str: dictionary<values=string, indices=int8> not null
Error in self$open("WRITE", internal_use_only = "allowed_use") :
  tiledb_timestamp not yet supported for WRITE mode
Calls: <Anonymous> -> <Anonymous> -> <Anonymous> -> stopifnot
Execution halted

This PR should be merged to remove a known defect from the C++ implementation and we can continue from there.

@johnkerl johnkerl merged commit 7241fef into main Aug 16, 2024
15 of 16 checks passed
@johnkerl johnkerl deleted the kerl/schevo-stamp branch August 16, 2024 14:34
github-actions bot pushed a commit that referenced this pull request Aug 16, 2024
* [c++] Apply `ArraySchemaEvolution` to specified timestamp

* Add Python unit-test coverage

* run `make format`
johnkerl added a commit that referenced this pull request Aug 16, 2024
)

* [c++] Apply `ArraySchemaEvolution` to specified timestamp

* Add Python unit-test coverage

* run `make format`

Co-authored-by: John Kerl <[email protected]>
@eddelbuettel
Copy link
Contributor

eddelbuettel commented Aug 17, 2024

@johnkerl when you said above

[...] when I run that:

where you on main or on my branch adding this feature? For me, on the branch, freshly rebased to the current main, all is now good. I altered the test script to show a tibble as we now get a correct arrow dictionary and all is well:

[ ... ]
# A tibble: 5 × 3
  soma_joinid   int str  
        <int> <int> <fct>
1           6   106 c    
2           7   107 b    
3           8   108 c    
4           9   109 c    
5          10   110 b    
$ 

But that is on the de/sc-52516/timestamprange branch.

@johnkerl
Copy link
Member Author

Good to know -- awesome @eddelbuettel ! :)

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Aug 17, 2024

Branch should hopefully be ready "Real Soon Now (TM)" too.

ryan-williams added a commit that referenced this pull request Aug 21, 2024
)"

This reverts commit 7241fef.

It seems to break cellxgene_census_builder/tests/test_builder.py.
ryan-williams added a commit that referenced this pull request Aug 23, 2024
)" (#2921)

This reverts commit 7241fef.

It seems to break cellxgene_census_builder/tests/test_builder.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants