Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: str.cat will align on Series #20347

Merged
merged 10 commits into from
May 2, 2018

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Mar 14, 2018

Fixes issue #18657, fixed existing tests, added new test; all pass.

After I pushed everything and thought about it some more, I realised that one may argue about the default alignment-behavior, and whether it should be changed to join=outer. The behavior as implemented is compatible with the current requirement that everything be of the same length. To me, it is more intuitive that the concatenated other is added to the current series without enlarging it, but I can also see the argument why that restriction is unnecessary.

PS. This is my first PR, tried to follow all the rules. Sorry if I overlooked something.

Edit: Also fixes #20842

@pep8speaks
Copy link

pep8speaks commented Mar 14, 2018

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 02, 2018 at 05:33 Hours UTC

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Mar 14, 2018

I just realised that due to the API-change, a what's-new entry is probably needed somewhere?

@codecov
Copy link

codecov bot commented Mar 14, 2018

Codecov Report

Merging #20347 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20347      +/-   ##
==========================================
+ Coverage   91.79%   91.81%   +0.01%     
==========================================
  Files         153      153              
  Lines       49411    49478      +67     
==========================================
+ Hits        45359    45429      +70     
+ Misses       4052     4049       -3
Flag Coverage Δ
#multiple 90.21% <100%> (+0.01%) ⬆️
#single 41.85% <4%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.63% <100%> (+0.28%) ⬆️
pandas/util/testing.py 84.59% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4da79b...3f77b80. Read the comment docs.

@h-vetinari
Copy link
Contributor Author

There are some useless commits in the appveyor-queue - how can those be cancelled? I'm guessing I don't have sufficient rights to do it. If someone who can should see this, you can cancel builds for commits (starting with): 41daef8d and d1c5543f.

@@ -2760,6 +2761,17 @@ def test_str_cat_raises_intuitive_error(self):
with tm.assert_raises_regex(ValueError, message):
s.str.cat(' ')

def test_str_cat_align(self):
# https://github.com/pandas-dev/pandas/issues/18657
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs another case where this would produce nans on some elements (e.g. the original issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add

@@ -65,6 +78,11 @@ def str_cat(arr, others=None, sep=None, na_rep=None):
If None, concatenates without any separator.
na_rep : string or None, default None
If None, NA in the series are ignored.
align : bool or None, default None
If used between two Series, determines whether they are aligned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag

UserWarning -> FutureWarning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add

and len(others) and isinstance(others, Series)):
if align is None:
align = False
warnings.warn("A future version of pandas will perform alignment "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FutureWarning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if align is None:
align = False
warnings.warn("A future version of pandas will perform alignment "
"when others is a series. To disable alignment (the "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit unclear, there is no 'previous behavior; the default is to not align, but it will change in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text is the suggestion of @TomAugspurger in the original issue #18657

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a whatsnew subsection to explain this change, also pls update io.rst

@jreback jreback added API Design Strings String extension data type and string data labels Mar 15, 2018
@jreback jreback changed the title Fixed issue 18657 API: str.cat will align on Series Mar 15, 2018
@h-vetinari
Copy link
Contributor Author

@jreback, I assume you meant text.rst and not io.rst. Added tests (changed alignment to join='outer'), added some tests, updated v0.23.0.txt and text.rst.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few doc comments.

Have you checked the log output from Travis to see if any warnings in the test suite are uncaught? They're printed at the bottom of the test output.

their indexes will be aligned before concatenation (if ``align=True``) or not (if ``align=False``). As usual, alignment will expand to the union of both
indexes, while introducing ``NaN`` for missing values in the respective other series (which can be easily handled with the ``na_rep``-keyword).

If the ``align`` keyword is not passed, the method will currently fall back to the previous behavior (i.e. ``align=False``),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in a .. warning:: directive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part? I added one starting at "If the align keyword is not passed"

.. ipython:: python

base = Series(['a', 'b', 'c', 'd', 'e'])
s = base.reindex([0, 1, 2, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you indent all these the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

s.str.cat(t, align=True)
s.str.cat(t, align=False, na_rep='')

.. versionadded:: 0.23.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what this versionadded is referring to. The keyword? The .str.cat method? I think it's best to leave that to the API documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

``Series.str.cat`` has gained the ``align`` kwarg
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So far, the method :meth:`Series.str.cat` did not -- in contrast to most of ``pandas`` -- align :class:`Series` on their index before concatenation (see :issue:`18657`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"So far, the method" -> "Previously"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So far, the method :meth:`Series.str.cat` did not -- in contrast to most of ``pandas`` -- align :class:`Series` on their index before concatenation (see :issue:`18657`).
The method has now gained a keyword ``align`` which controls this behavior. If ``False``, the behavior will be as previously. If ``True`` and ``others``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"which controls this behavior" -> "to control alignment"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second sentence and the next paragraph can be simplified.

The default behavior, not aligning, has not changed. If `align` is not specified, a ``FutureWarning``
is issued and the series are not aligned.

To silence the warning and not align, specify ``align=False``. To silence the warning and align
the Series before concatenating, specify ``align=True``.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understood correctly; removed part with na_rep.

.. ipython:: python

base = Series(['a', 'b', 'c', 'd', 'e'])
s = base.reindex([0, 1, 2, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure these have the same indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@h-vetinari
Copy link
Contributor Author

@TomAugspurger @jreback Can one of you (tell me how to) remove all but the last of my commits from the appveyor-queue? It's quite far behind as it is, so no need to choke it with unnecessary commits (in Travis new commits automatically supersede old ones - why not in appveyor...?).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 15, 2018 via email

@h-vetinari
Copy link
Contributor Author

@TomAugspurger , re:appveyor: this is not the case, at least not immediately. I can see in https://ci.appveyor.com/project/pandas-dev/pandas/history that some old commits were tried (while the new ones already existed) and ran for 1-2 minutes. I asked because I saw that some other commits were explicitly cancelled by users, but I don't know how to do that.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test where others is a list of multiple Series ?

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche I wasn't even aware that's a legal signature? I'm guessing all Series would be concatenated with the same sep (and na_rep, etc.)? And it is only legal if all are Series? What about a mix of Series and ndarray?

@jorisvandenbossche
Copy link
Member

I'm guessing all Series would be concatenated with the same sep (and na_rep, etc.)?

Yes, I think so

And it is only legal if all are Series? What about a mix of Series and ndarray?

I suppose we should just process each element in the list separately, so then it does not really matter if it is a mixture.

"'align=True'", FutureWarning, stacklevel=4)
if align:
arr, others = arr.align(others, join='outer')
arrays = [list(arr), list(others)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the conversion to list here is needed (on line 60 they get converted to an array anyhow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was part of the _get_array_list-function as I found it - I didn't investigate how it interplays with str_cat (which is the only place it called from), so I left it as it was.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments. don't worry about the CI.

Concatenating Series
--------------------

The method :meth:`Series.str.cat` can be used to concatenate the records of two :class:`Series`. Depending on the value given to the ``align`` keyword,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe the simpler usecase first (IOW just concat with no other, or other is a simple list). These can just be examples, doesn't have to be so much text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my defense, so far there was no description about str.cat in text.rst. But I will try to write up an overview. Where should it be placed, in your opinion? I would say directly after the splitting-section (natural opposites).

their indexes will be aligned before concatenation (if ``align=True``) or not (if ``align=False``). As usual, alignment will expand to the union of both
indexes, while introducing ``NaN`` for missing values in the respective other series (which can be easily handled with the ``na_rep``-keyword).

.. warning::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this you are already passing the align keyword. These docs should be written as if a user is seeing them w/o benefit of any past history. Just show what they should do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the warning? The text was verbatim from @TomAugspurger, but I will rework this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not necessary here is my point.


base = Series(['a', 'b', 'c', 'd', 'e'])
s = base.reindex([0, 1, 2, 3])
t = base.reindex([3, 0, 4, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than showing a bunch of lines like this. Break this up into a conversation. E.g. show the construction of the Series (call it s), then do a cat with no other, then one with a list, finally with a Series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote a long overview - conversation-style -- in text.rst

``Series.str.cat`` has gained the ``align`` kwarg
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, :meth:`Series.str.cat` did not -- in contrast to most of ``pandas`` -- align :class:`Series` on their index before concatenation (see :issue:`18657`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this simpler. This should be just the first section and a previous and new section, see the other whatsnew entires for hints on structure. Add a reference to the docs in text.rst.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "This should be just the first section and a previous and new section" means exactly, but I tried to copy the style of other whatsnew entries. Reference added.

@@ -429,6 +429,27 @@ String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``.

See also :func:`~pandas.get_dummies`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a reference tag here

Copy link
Contributor Author

@h-vetinari h-vetinari Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what should be added. Reference to what? And where? - In the "concatenation" section I'm writing?

@@ -35,19 +35,32 @@
_shared_docs = dict()


def _get_array_list(arr, others):
def _get_array_list(arr, others, align=True):
from pandas.core.series import Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a mini-doc string here (its an internal function so doesn't have to be full fledged, but Parameters / Returns)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version, I'm not touching anything about _get_array_list anymore. I agree that docstrings would be good, might add a draft.

def cat(self, others=None, sep=None, na_rep=None):
def cat(self, others=None, sep=None, na_rep=None, align=None):
from pandas.core.series import Series
# FutureWarning for align=None emitted in one place only: str_cat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need any of these comments

if align is not None and align and isinstance(others, Series):
# str_cat deals with arrays only;
# make sure index is correct here as well for using _wrap_result
self._orig, others = self._orig.align(others, join='outer')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed here? is this not handled in str_cat? (which dispatched to _get_array_list)?

s = base.reindex([1, 3, 0, 2])
t = base.reindex([3, 0, 4, 1])
expect_rs_aligned = Series(['aa', 'bb', 'cc', 'dd'])
expect_rs_unaligned = Series(['ab', 'bd', 'ca', 'dc'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a blank line between cases. Add a comment to cases if needed. If you are writing things than once, pls parameterize.

Copy link
Contributor Author

@h-vetinari h-vetinari Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlooked this, will change in next commit. You mean parametrisation of things like the following example?

def rt(**kwargs):
    r.str.cat(t, **kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the tests, so that they are separated by lines, commented where necessary, and much easier to read.

@h-vetinari
Copy link
Contributor Author

Is there some pandas default how often a FutureWarning shows up? I'm not getting the align-warning a second time, even I verified the code runs through the corresponding if.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Mar 16, 2018

@jorisvandenbossche
I added the functionality you wanted, but it's potentially a bit unintuitive if there's a mix of ndarray and Series (because treating the elements sequentually would mean that the lengths of an array would have to match the length of the union of the indexes of all previous seriers-like elements). In this case, I demand that the ndarray elements match in length to the Series that's calling str.cat.

It was a substantial rewrite, but the code got better through it (found some bugs, and now its cleaner).
I added a lot of tests about the expected behavior - please have a look at those tests if something is unclear. Maybe some of these examples should make it into whatsnew or text.rst, but I'm too tired now. Feedback welcome.

@h-vetinari
Copy link
Contributor Author

@jreback
Just saw your comments now (thanks!), will work on them tomorrow (if I can). Writing good docs is surely an important part.
The code changed a fair bit with the last commit -- initially, I left the structure as it was (i.e. str_cat and _get_array_list), but that was very much not optimal (and the reason for some of my comments).

Is str_cat part of the public API or just used to construct str.cat? Because I shifted all the relevant logic to str.cat now, and str_cat just provides the doc-string (i.e. align does not do anything for str_cat).

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Mar 16, 2018

@jorisvandenbossche @TomAugspurger @jreback
regarding legal inputs for others - if list of series are allowed, why not DataFrames too? It would be a very simple addition.

I added functionality and tests for it. If you like how everything so far works, I'll update the docs (just wanna leave the doc-writing for when the code has converged).

@jorisvandenbossche
Copy link
Member

Is there some pandas default how often a FutureWarning shows up? I'm not getting the align-warning a second time, even I verified the code runs through the corresponding if.

That's normal. FutureWarnings only show up the first time you use something (you can warnings.filterwarnings to have it show always)

return self._wrap_result(result, use_codes=(not self._is_categorical))
if align and isinstance(others, Series):
# str_cat deals with arrays only
data, others = data.align(others, join='outer')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actually align using a left join. The result of s.str.cat(others) should always preserve the shape and index of s IMO.

Current behaviour:

In [19]: s = pd.Series(['a', 'b'])

In [20]: s.str.cat(pd.Series(['a', 'b'], index=[1, 2]), align=True)
Out[20]: 
0    NaN
1     ba
2    NaN
dtype: object

Copy link
Contributor Author

@h-vetinari h-vetinari Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how I started out (see OP), but it would be inconsistent with how index-alignment is handled elsewhere -- and being consistent in that trumps shape preservation, IMHO. str.cat is already special anyway, in that it is the only str-method that allows other Series as input.

And with na_rep, the behavior gets very intuitive again, IMO.

In [0]: s = pd.Series(['a', 'b'])
In [1]: t = pd.Series(['a', 'b'], index=[1, 2])
In [2]: s.astype(bool) & t.astype(bool)
Out[2]:
0    False
1     True
2    False
dtype: bool
In [3]: s.str.cat(t, align=True)
Out[3]: 
0    NaN
1     ba
2    NaN
dtype: object
In [4]: s.str.cat(t, align=True, na_rep='')
Out[4]: 
0     a
1    ab
2     b
dtype: object
In [5]: s.str.cat(t, align=True, na_rep='x')
Out[5]: 
0    ax
1    ab
2    xb
dtype: object

Copy link
Contributor Author

@h-vetinari h-vetinari Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends a lot on the application which behavior is desired. How about exposing a join="inner"|"left"|"right"|"outer" keyword, with default "left" (or "outer"...)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A keyword with join='left' as the default makes the most sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added join-keyword with default 'left'.

their indexes will be aligned before concatenation (if ``align=True``) or not (if ``align=False``). As usual, alignment will expand to the union of both
indexes, while introducing ``NaN`` for missing values in the respective other series (which can be easily handled with the ``na_rep``-keyword).

.. warning::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not necessary here is my point.

# first achieve maximum extent of data
for x in others:
data, _ = data.align(x, join='outer')
# then bring elements of others to same size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, why are you adding all of this code????

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the code additions from str_cat and _get_array_list because that was messy and for several other reasons not a good way to do it -- all the changes are now in str.cat itself.

To be able to align in all the different cases (list of Series, list of np.ndarrays, mixture of both, plus all the other variants), it's necessary to add (some variant of) the code I added. I tried to keep it clean, non-redundant, documented, and nicely recursive. I don't believe the desired functionality can be added with substantially less code (have a look at the test cases). Currently working on the requested doc changes.

@h-vetinari
Copy link
Contributor Author

@jreback @TomAugspurger @jorisvandenbossche
I added documentation (also for things I haven't touched, like _get_array_list), and wrote and extensive section in text.rst outlining the str.cat-method in general, and the expanded usability in particular.

If you would be so kind, please compile and read text.rst first, before you have a look at the code or tests; it should make for easier reading. I could document the code some more, but it should be quite understandable. Despite the lengths, I think it is not wasteful in terms of lines. It's just not that easy to cover all the desired cases plus all the alignment options.

Feedback welcome.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Mar 17, 2018

The failure in the Travis-CI is an artefact -- the linter (only in the 2.7 run) complained about pandas/_libs/tslibs/period.pyx:1358:12: W291 trailing whitespace, but neither did I touch this file, nor is there any trailing whitespace around the line where the failure supposedly happened.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Mar 18, 2018

@jorisvandenbossche

I suppose we should just process each element in the list separately, so then it does not really matter if it is a mixture.

A comment why this is not so easy (and why I align all elements in a list before starting concatenation):
If one starts concatenating right away, then one might get a problem with index changes induced by later list elements.

>>> s = pd.Series(['a', 'b', 'c', 'd'])
>>> t = pd.Series(['d', 'a', 'e', 'b'], index=[3, 0, 4, 1])
>>>
>>> # the following would be equivalent to
>>> # s.str.cat([t.values, t], align=True, join='outer', na_rep='-')
>>> # if list elements in `other` would be concatenated sequentially
>>> s.str.cat(t.values).str.cat(t, align=True, join='outer', na_rep='-')
0    ada
1    bab
2    ce-
3    dbd
4     -e  # <-- missing a NaN for first column!
dtype: object
>>> s.str.cat([t.values, t], align=True, join='outer', na_rep='-')
0    ada
1    bab
2    ce-
3    dbd
4    --e  # <-- this is more sensible behavior, IMO
dtype: object

Even worse, this would mean that any other arrays following in the list would be of the wrong length and trigger a warning...

@h-vetinari
Copy link
Contributor Author

h-vetinari commented May 2, 2018

Circle-CI failed due to #20906. Rebased onto upstream and restarted.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented May 2, 2018

Annoyingly, there are still unrelated failures in "ci/script_single.sh" of the travis "3.6, NumPy dev" job, but at least that job doesn't fail the build.

=================================== FAILURES ===================================
___________________ TestClipboard.test_round_trip_frame_sep ____________________
self = <pandas.tests.io.test_clipboard.TestClipboard object at 0x7fa9961822e8>
    def test_round_trip_frame_sep(self):
        for dt in self.data_types:
>           self.check_round_trip_frame(dt, sep=',')
pandas/tests/io/test_clipboard.py:81: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_clipboard.py:77: in check_round_trip_frame
    tm.assert_frame_equal(data, result, check_dtype=False)
pandas/util/testing.py:1304: in assert_frame_equal
    '{shape!r}'.format(shape=right.shape))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
obj = 'DataFrame', message = 'DataFrame shape mismatch', left = '(61, 3)'
right = '(5, 0)', diff = None
[...]
pandas/util/testing.py:1018: AssertionError
_____________________ TestClipboard.test_round_trip_frame ______________________
self = <pandas.tests.io.test_clipboard.TestClipboard object at 0x7fa996182ac8>
    def test_round_trip_frame(self):
        for dt in self.data_types:
>           self.check_round_trip_frame(dt)
pandas/tests/io/test_clipboard.py:91: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_clipboard.py:77: in check_round_trip_frame
    tm.assert_frame_equal(data, result, check_dtype=False)
pandas/util/testing.py:1304: in assert_frame_equal
    '{shape!r}'.format(shape=right.shape))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
obj = 'DataFrame', message = 'DataFrame shape mismatch', left = '(61, 3)'
right = '(5, 3)', diff = None
[...]
pandas/util/testing.py:1018: AssertionError
--------------------- generated xml file: /tmp/single.xml ----------------------
===== 2 failed, 64 passed, 316 skipped, 25292 deselected in 36.41 seconds ======

@h-vetinari
Copy link
Contributor Author

@TomAugspurger Green! =)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checking code needs some work

others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
elif isinstance(others, np.ndarray) and others.ndim == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong

i don’t think we can align a ndarray at all like this
let’s can ndarray a that are > 1 dim

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DF-constructor works as expected for a 2-dim ndarray, but I haven't checked if this is tested behaviour. (essentially, df == DataFrame(df.values, columns=df.columns, index=df.index))

I would suggest not to can 2-dim ndarrays, because they are necessary to avoid alignment on the deprecation path for join:

[...] To disable alignment (the behavior before v.0.23) and silence this warning, use .values on any Series/Index/DataFrame in others. [...]

return (los, fu_wrn)
# test if there is a mix of list-like and non-list-like (e.g. str)
elif (any(is_list_like(x) for x in others)
and any(not is_list_like(x) for x in others)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this simpler by just checking for all is not list like (eg strings)

anything else will fail thru to the TypeError

others = list(others) # ensure iterators do not get read twice etc
if all(is_list_like(x) for x in others):
los = []
fu_wrn = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u name this parameter just warn

fu_wrn = False
while others:
nxt = others.pop(0) # list-like as per check above
# safety for iterators and other non-persistent list-likes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole section needs some work it’s way too hard to read and follow

is_legal = ((no_deep and nxt.dtype == object)
or all((isinstance(x, compat.string_types)
or (not is_list_like(x) and isnull(x))
or x is None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnull already checks for None
only 1d objects are valid here (or all scalars)

do this check up front

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 2, 2018 via email

@TomAugspurger TomAugspurger mentioned this pull request May 2, 2018
5 tasks
@TomAugspurger TomAugspurger merged commit f851699 into pandas-dev:master May 2, 2018
@TomAugspurger
Copy link
Contributor

#20922 for the followup.

Thanks!

@h-vetinari
Copy link
Contributor Author

@TomAugspurger @jreback
Will do the follow-up tonight. Thanks for the patience/reviews and helping get this across the line - guess it was a bit ambitious as a first PR... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str.cat inconsistent for CategoricalIndex
5 participants