Skip to content

Commit

Permalink
display ID registration prep, formatter registration fixes, index upd…
Browse files Browse the repository at this point in the history
…ates, dtype handlers, expanded data generators (#16)

* add random dataframe generator for convenience

* add logging for renderables

* more callout options; set display ID

* update black/flake8 configs

* version bump; add pyarrow

* register display ID; update black configs

* caffeine fever dream that needs a lot of cleanup later once prototyping is done

* add success!

* logging/callout updates

* callout update

* flatten multiindex series better; pass metadata in ipydisplay

* add settings.LOG_LEVEL; fix stringify columns

* version bump; add helper for generating random dataframe

* rename "default" display mode to "plain"

* add More Info user help section after truncating dataframe

* ensure temporary dx.display calls revert properly

* hotfix

* remove old code

* remove cell_id tracking

* add poetry.toml

* add loggers

* fix bug with .sample() resetting index

* cleanup

* don't mess with the original dataframe object during registration

* remove structlog; update loggers

* SQL instead of parquet/pyarrow

* more attempts at fixing column dtype wrangling

* move filtering out of main formatter

* check for dataframe subset associations before display ID registration

* push-down filter override for max display rows

* fix set_option reference

* add comms config

* add geopandas as extra install

* move utils out of formatters dir

* handle circular imports; add subset filter tracking; extra logging

* separate sampling from utils

* remove column handling from sampling

* add convenience function for adding renderable

* fix reference for testing

* handle missing key

* fix ref

* disable re-rendering for updates; pass applied filters to frontend

* pull filters from comm msg

* fix subset filter tracking; update logging

* update to default random sampling

* allow passing ipython shell for registering/getting display IDs

* pass new display id into sampling if needed

* update applied_filters assignment

* add settings_context; switch renderables to Set instead of List

* move sampling tests to their own file

* add renderables test

* remove truncating/sampling tests, fix custom index test

* assign unique name for unassigned variable tracking

* check setting before index resets

* flake8

* adjust pandas options for display/schema changes

* add structlog back in

* pull display ID registration out of get_display_id(); update logging

* update display mode with settings_context; add log level changes

* pass index flag to dataframe_info metadata; update logging

* updates for logging

* use settings_context; don't display callout with unassigned dataframe; parse df_name for .query()

* use update_display for user query

* remove hyphens from df uuid to enable sqlite filtering

* add top margin

* update logging; handle display ID register after rendering; update user query callout

* fix index and column stringifcation for multiindexing

* more docstrings

* clean up settings

* add pandas option transfer on row/column validation

* add flatten_index; handle index/column flattening behind settings

* add docstring

* don't reset multiindex level names

* more explicit multiindex handling

* add media type prefixes back to settings so env vars don't overwrite them all

* check for custom index before normalizing

* ugh multiindex.

* remove import

* fix log message

* update .gitignore

* break apart utils/helpers into more readable structure to handle new datatype generation, testing, and cleaning

* remove helpers

* remove function imports

* store datetime string format

* move geopandas check out of config

* add faker as extra

* remove comment

* handle flattening/expanding lists/sets/tuples

* handle extra dtype cleaning

* fix for mixed dtypes

* updates for dtype generation and testing

* change log time format

* check default index earlier; don't generate hash over and over

* add ENABLE_DATALINK setting to toggle off all the tracking/hashing/etc

* remove get_applied_filters

* separate cleaning functions between build_table_schema/hash_pandas_object/store_in_sqlite

* fix cleaning; remove extra hashing calls; update docstrings

* add geopandas and faker dev dependencies for dtype testing; add isort dev dep

* make sure we can toggle datalink setting on/off without errors

* generate display ID if not passed (datalink enabled)

* clean columns before rendering with datalink disabled

* update random_dataframe columns and testing

* verbose unit testing in github workflow

* handle no args, test for default data

* enable datalink setting by default

* use settings context

* trigger html.table_schema pandas changes on settings changes

* remove config.py

* refactor display formatter registration

* remove configs

* remove configs

* fix settings tests

* fix registering tests

* refactor dx media type formatter registration; remove configs

* updates for testing to remove redundant mediatype nesting

* more debug logging

* remove flatten_index and fix index/column flattening logic

* comms config behind datalink setting

* bump up log level, disable datalink, be done with this PR

* ugh patch this when datalink is disabled

* disable logging auto-config so it doesn't start showing other loggers

* fix log message

* don't update other loggers levels to INFO

* turn logging back on

* changelog

* this needs more work with the new display formatter registration
  • Loading branch information
shouples authored Aug 21, 2022
1 parent dc9c89f commit df7cc80
Show file tree
Hide file tree
Showing 32 changed files with 2,439 additions and 545 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/unit-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ jobs:
poetry install
- name: Pytest - Unit tests
run: |
poetry run pytest dx/tests -x
poetry run pytest dx/tests -xv
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,6 @@ __pycache__/
dist/

.pytest_cache
.python-versions
.venv
.vscode
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,33 @@
All notable changes will be documented here.

---
## Unreleased
_2022-08-21_

### Added
- `pd.Series` as a default renderable type (to go with the existing `pd.DataFrame` and `np.ndarray` types)
- Support for the following data types inside `pd.DataFrame` columns:
- `type` and `np.dtype`
- `shapely.geometry` objects
- `pd.Timedelta` and `datetime.timedelta`
- `pd.Period`
- `pd.Interval`
- `complex` numbers
- `ipaddress.IPv4Address` and `.IPv6Address`
- Extra dataset generation functions for development/testing under `dx.utils.datatypes`
- `dx.quick_random_dataframe(n_rows, n_columns)` to get a `pd.DataFrame` of 0.0-1.0 floats (convenience wrapper for `pd.DataFrame(np.random.rand(n_rows, n_columns))`)
- `dx.random_dataframe()` with different boolean values to enable based on available datatypes (`dx.DX_DATATYPES`)
- `settings_context` context manager to allow temporarily changing a setting (or multiple)
- Logging via `structlog` (default level: `logging.WARNING`)

### Changed
- Default sampling method changed from `outer` to `random`

### Fixed
- Displaying a dataframe with an out-of-order index (like with `.sample()`) no longer resets the index before sending data to the frontend.
- Index/column flattening and string-formatting is behind settings and is handled more explicitly
- `dx` should no longer interfere with other media type / mime bundles (e.g. matplotlib) formatted by the existing IPython display formatter

## `1.1.3`
_2022-08-05_
### Added
Expand Down
8 changes: 5 additions & 3 deletions dx/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
from .config import *
from .comms import *
from .dx import *
from .formatters import *
from .helpers import *
from .loggers import *
from .settings import *
from .utils import *

__version__ = "1.1.3"
__version__ = "1.2.0"

configure_logging()
set_display_mode("simple")
37 changes: 37 additions & 0 deletions dx/comms.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import structlog
from IPython import get_ipython

from dx.settings import get_settings

settings = get_settings()
logger = structlog.get_logger(__name__)


# ref: https://jupyter-notebook.readthedocs.io/en/stable/comms.html#opening-a-comm-from-the-frontend
def target_func(comm, open_msg):
@comm.on_msg
def _recv(msg):
from dx.filtering import update_display_id

data = msg["content"]["data"]
if "display_id" in data:
update_display_id(
display_id=data["display_id"],
pandas_filter=data.get("pandas_filter"),
sql_filter=data.get("sql_filter"),
filters=data.get("filters"),
output_variable_name=data.get("output_variable_name"),
limit=data["limit"],
)

comm.send({"connected": True})


ipython_shell = get_ipython()
if (
ipython_shell is not None
and getattr(ipython_shell, "kernel", None)
and settings.ENABLE_DATALINK
):
COMM_MANAGER = ipython_shell.kernel.comm_manager
COMM_MANAGER.register_target("datalink", target_func)
33 changes: 0 additions & 33 deletions dx/config.py

This file was deleted.

8 changes: 3 additions & 5 deletions dx/dx.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display as ipydisplay

from dx.settings import set_display_mode, settings
from dx.settings import settings_context
from dx.types import DXDisplayMode


Expand All @@ -29,11 +29,9 @@ def display(
raise ValueError(f"Unsupported file type: `{path.suffix}`")

df = pd.DataFrame(data)
with settings_context(display_mode=mode, ipython_shell=ipython_shell):
ipydisplay(df)

orig_mode = settings.DISPLAY_MODE.value
set_display_mode(mode, ipython_shell=ipython_shell)
ipydisplay(df)
set_display_mode(orig_mode, ipython_shell=ipython_shell)
return


Expand Down
88 changes: 88 additions & 0 deletions dx/filtering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
from typing import Optional

import pandas as pd
import structlog
from IPython.display import update_display

from dx.formatters.callouts import display_callout
from dx.settings import get_settings, settings_context
from dx.utils.formatting import expand_sequences
from dx.utils.tracking import (
DATAFRAME_HASH_TO_VAR_NAME,
DISPLAY_ID_TO_DATAFRAME_HASH,
SUBSET_TO_DATAFRAME_HASH,
generate_df_hash,
)

logger = structlog.get_logger(__name__)

settings = get_settings()


SUBSET_FILTERS = {}


def update_display_id(
display_id: str,
sql_filter: str,
pandas_filter: Optional[str] = None,
filters: Optional[dict] = None,
output_variable_name: Optional[str] = None,
limit: Optional[int] = None,
) -> None:
"""
Filters the dataframe in the cell with the given display_id.
"""
from dx.utils.tracking import sql_engine

global SUBSET_FILTERS

row_limit = limit or settings.DISPLAY_MAX_ROWS
df_hash = DISPLAY_ID_TO_DATAFRAME_HASH[display_id]
df_name = DATAFRAME_HASH_TO_VAR_NAME[df_hash]
table_name = f"{df_name}__{df_hash}"

query_string = sql_filter.format(table_name=table_name)
logger.debug(f"sql query string: {query_string}")
new_df = pd.read_sql(query_string, sql_engine)

# in the event there were nested values stored,
# try to expand them back to their original datatypes
for col in new_df.columns:
new_df[col] = new_df[col].apply(expand_sequences)

# this is associating the subset with the original dataframe,
# which will be checked when the DisplayFormatter.format() is called
# during update_display(), which will prevent re-registering the display ID to the subset
new_df_hash = generate_df_hash(new_df)

# store filters to be passed through metadata to the frontend
logger.debug(f"applying {filters=}")
filters = filters or []
SUBSET_FILTERS[new_df_hash] = filters

logger.debug(f"assigning subset {new_df_hash} to parent {df_hash=}")
SUBSET_TO_DATAFRAME_HASH[new_df_hash] = df_hash

# allow temporary override of the display limit
with settings_context(DISPLAY_MAX_ROWS=row_limit):
logger.debug(f"updating {display_id=} with {min(row_limit, len(new_df))}-row resample")
update_display(new_df, display_id=display_id)

# we can't reference a variable type to suggest to users to perform a `df.query()`
# type operation since it was never declared in the first place
if not df_name.startswith("unk_dataframe_"):
# TODO: replace with custom callout media type
output_variable_name = output_variable_name or "new_df"
filter_code = f"""{output_variable_name} = {df_name}.query("{pandas_filter.format(df_name=df_name)}", engine="python")"""
filter_msg = f"""Copy the following snippet into a cell below to save this subset to a new dataframe:
<pre style="background-color:white; padding:0.5rem; border-radius:5px;">{filter_code}</pre>
"""
display_callout(
filter_msg,
header=False,
icon="info",
level="success",
display_id=display_id + "-primary",
update=True,
)
22 changes: 18 additions & 4 deletions dx/formatters/callouts.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import uuid
from typing import Optional

from IPython.display import HTML, display
from IPython.display import HTML, display, update_display
from pydantic import BaseModel


Expand Down Expand Up @@ -37,11 +37,18 @@ def html(self):
callout_classes.append(f"bp3-icon-{self.icon.value}-sign")
callout_class_str = " ".join(callout_classes)

msg = self.message
if self.use_header:
heading_html = f"<h6 class='bp3-heading'>{self.level.value.title()}</h6>"
return f"""<div class="{callout_class_str}" style="margin-bottom: 0.5rem">{heading_html}{self.message}</div>"""
msg = f"{heading_html}{self.message}"

return f"""<div class="{callout_class_str}" style="margin-bottom: 0.5rem">{self.message}</div>"""
style = ";".join(
[
"margin-bottom: 0.5rem",
"margin-top: 0.5rem",
]
)
return f"""<div class="{callout_class_str}" style="{style}">{msg}</div>"""


def display_callout(
Expand All @@ -50,6 +57,7 @@ def display_callout(
header: bool = True,
icon: Optional[CalloutIcon] = None,
display_id: str = None,
update: bool = False,
) -> None:
callout = Callout(
message=message,
Expand All @@ -61,4 +69,10 @@ def display_callout(

# TODO: coordinate with frontend to replace this with a standalone media type
# instead of rendering HTML with custom classes/styles
display(HTML(callout.html), display_id=display_id)
if update:
update_display(HTML(callout.html), display_id=display_id)
else:
display(
HTML(callout.html),
display_id=display_id,
)
Loading

0 comments on commit df7cc80

Please sign in to comment.