Skip to content

Commit

Permalink
feat: ensure Series.str.len() can get length of array columns (#497)
Browse files Browse the repository at this point in the history
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/python-bigquery-dataframes/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕
  • Loading branch information
tswast authored Mar 22, 2024
1 parent d51fa84 commit 10c0446
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 2 deletions.
2 changes: 0 additions & 2 deletions tests/system/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,8 +357,6 @@ def nested_pandas_df() -> pd.DataFrame:
DATA_DIR / "nested.jsonl",
lines=True,
)
tests.system.utils.convert_pandas_dtypes(df, bytes_col=True)

df = df.set_index("rowindex")
return df

Expand Down
20 changes: 20 additions & 0 deletions tests/system/small/operations/test_strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,26 @@ def test_len(scalars_dfs):
)


def test_len_with_array_column(nested_df, nested_pandas_df):
"""
Series.str.len() is expected to work on columns containing lists as well as strings.
See: https://stackoverflow.com/a/41340543/101923
"""
col_name = "event_sequence"
bf_series: bigframes.series.Series = nested_df[col_name]
bf_result = bf_series.str.len().to_pandas()
pd_result = nested_pandas_df[col_name].str.len()

# One of dtype mismatches to be documented. Here, the `bf_result.dtype` is `Int64` but
# the `pd_result.dtype` is `float64`: https://github.com/pandas-dev/pandas/issues/51948
assert_series_equal(
pd_result.astype(pd.Int64Dtype()),
bf_result,
check_index_type=False,
)


def test_lower(scalars_dfs):
scalars_df, scalars_pandas_df = scalars_dfs
col_name = "string_col"
Expand Down

0 comments on commit 10c0446

Please sign in to comment.