Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Add DataFrame.index.levels #55437

Merged
merged 28 commits into from
Oct 12, 2023
Merged

DOC: Add DataFrame.index.levels #55437

merged 28 commits into from
Oct 12, 2023

Conversation

shiersansi
Copy link
Contributor

@shiersansi shiersansi commented Oct 7, 2023

Add docstrings for MultiIndex.levels and MultiIndex.codes

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiersansi.

I think you misunderstood the list on the description. You shouldn't ignore the existing items in the list, and add one of what you did. You need to do what the existing items say, and check them to show that you did the relevant points. If you see the CI, you'll see there are error, that you could have detected earlier with the tasks in that list. No big deal having run those things locally, but they'll be helpful to prevent the CI being red, which prevents us from merging this.

pandas/core/indexes/multi.py Show resolved Hide resolved
# create new IndexEngine
# https://github.com/pandas-dev/pandas/issues/31648
"""
Returns a tuple of Index objects representing the levels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the short summary needs to fit in one line. I wouldn't go into technical details like if the return is a tuple. The best I can think of as a short summary is Levels of the MultiIndex.. Then in the following paragraph I would explain what are levels, what they mean conceptually, how they can be seen in pandas...

Each level is an Index object containing unique values
from that level of the MultiIndex.

Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case this is an attribute (MultiIndex.levels not MultiIndex.levels()), so for the users it doesn't return anything, it contains the value, and this section is not expected to exist. I'm not sure how we do for other attributes in the docs, they can be a good reference.

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved
@datapythonista datapythonista added Docs Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 7, 2023
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this @shiersansi, amazing work, I really like your explanations here, they are super clear.

I added some comments that I think can make the examples better, but outstanding work, thanks for it.

"""
Levels of the MultiIndex.

What are levels?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to remove the questions. In isolation I think they are fine, but since we don't use them in any other API documentation page, they feel a bit strange and inconsistent. I think it's easy to understand the explanations without the questions, so no big deal.

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved

Examples
--------
Example 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the Example 1 and just show the example directly.

FrozenList([[0, 1, 2], ['green', 'purple']])

Example 2:
>>> idx = pd.MultiIndex.from_product([['John', 'Josh', 'Alex'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reuse the index from above, so this is less verbose please.

Example 2:
>>> idx = pd.MultiIndex.from_product([['John', 'Josh', 'Alex'],
... list('abcde')], names=['Person', 'Letter'])
>>> large = pd.DataFrame(data=np.random.randn(15, 2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to just call this df instead of large

... list('abcde')], names=['Person', 'Letter'])
>>> large = pd.DataFrame(data=np.random.randn(15, 2),
... index=idx, columns=['one', 'two'])
>>> small = large.loc[['Jo'==d[0:2] for d in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show the dataframe here first, so the user can better understand the data of the example? I think you'll want to avoid using random data in the example, as it'll make your life more difficult. And if you use data that is meaningful instead of some random things, that will also help users understand better. As an idea, you can use a MultiIndex of animals with class/animal (e.g. insect/ant, insect/spides, mammal/goat...) and use a dataframe with just one column, for example number of legs (we use this example in other docs). The example is silly, but meaningful and any user can understand it very quickly, and focus on the concepts, not in understanding the data.

Then, as said, you can show the dataframe first, show the levels of the multiindex from it (i.e. df.index.levels), and then you filter the data many_leg_animals = animals[animals.num_legs > 4] to finally show that even if now there are no mammals, in the data, the multiindex still contains all the levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

屏幕截图 2023-10-08 164751

Thank you for your suggestion. Is it easier to understand the example part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a problem that I may need your help with. When I tested the code after I committed it, four tests failed, but I couldn't figure out exactly what went wrong from the error log provided.

@datapythonista datapythonista changed the title Doc: add DataFrame.index.levels DOC: Add DataFrame.index.levels Oct 8, 2023
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiersansi, this is getting better, but I still have some comments, as I find the examples a bit confusing. Let me know if you have any question or comment.

For the examples, you should make sure PEP-8 is respected (like having one spaces after commas...). I can show you more in detail where to see the errors in the next iteration, when you address the comments.

Thanks again for working on this!


Examples
--------
>>> idx = pd.MultiIndex.from_product([('insect', 'mammal'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do from_product we'll have all the options, like insect/goat, mammal/spider, which we don't want. There are multiple options, one is just to create a "normal" dataframe with all columns, and then set the index to create the MultiIndex. That may be the simpler, but if you find another that is simpler or as clear, feel free to use any other way. But data should be something like:

mammal - goat - 4
mammal - human - 2
insect - ant - 6
insect - spider - 8

Not things like insect - goat - 1 which I think it'll just confuse users, as the data doesn't make sense.


If the DataFrame not using all levels, this method still returns the original
levels of the MultiIndex.
When using this method on a DataFrame index, it may show "extra".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this last sentence talking about the MultiIndex returning all original levels? I personally find the it may show extra comment misleading. If you want to complement the previous sentence, what about something like For example, if a MultiIndex is created with levels A, B, C, and the DataFrame using it filters out all rows of the level C, MultiIndex.levels will still return A, B, C. Something like this should make things easier to understand than this last comment IMHO.

--------
>>> idx = pd.MultiIndex.from_product([('insect', 'mammal'),
... ('goat', 'dog','cat','ant','spides')], names=['class', 'animal'])
>>> print(idx.levels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd better show the whole dataframe instead when it's ready, and you don't need the print, just >>> df or >>> leg_num.

>>> leg_num = pd.DataFrame(data=(1,2,3,4,5,6,7,8,9,2),
... index=idx, columns=['leg'])
>>> insect_leg_num = leg_num.loc[['insect'==d for d in
... leg_num.index.get_level_values('class')]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this, why not something simpler like large_leg_num = leg_num[leg_num.num_legs > 4] or something simpler where the user doesn't need to put a log of effort to understand something that is not what this docstring is about?

>>> insect_leg_num = leg_num.loc[['insect'==d for d in
... leg_num.index.get_level_values('class')]]

>>> print(leg_num.index.levels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd show this as soon as the dataframe is ready, then continue with the filtering and showing that the levels didn't change. You don't need the print here either. And I think it'd be good to add a comment before you show the filtering, to let users know what you're going to show them, which is not obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have revised the doc content again and look forward to your new suggestions

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiersansi, looking better. I think we should show the data so the examples are clearer, and added few more thoughts, but looks good in general.

pandas/core/indexes/multi.py Show resolved Hide resolved
FrozenList([['mammal'], ['cat', 'dog', 'goat', 'human']])

If the number of Legs is greater than 2 as the filter condition,
the levels do not change
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear to me, sorry. I think something like MultiIndex levels will not change even if the DataFrame using the MultiIndex does not contain all them anymore:

pandas/core/indexes/multi.py Show resolved Hide resolved
pandas/core/indexes/multi.py Show resolved Hide resolved
level in the MultiIndex and contains the unique values found in that
specific level.

if a MultiIndex is created with levels A, B, C, and the DataFrame using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if a MultiIndex is created with levels A, B, C, and the DataFrame using
If a MultiIndex is created with levels A, B, C, and the DataFrame using

Feels a bit strange to jump into an example case before explaining the concept, but I think it's easier to understand this way.

FrozenList([['mammal'], ['cat', 'dog', 'goat', 'human']])

MultiIndex levels will not change even if the DataFrame using the MultiIndex
does not contain all them anymore:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can make it even more clear (and I think it's better to leave a blank line between the comment and the code. But other than this small suggestion seems perfect, I think this is a very great addition to the docs.

Suggested change
does not contain all them anymore:
does not contain all them anymore. See How "human" is not in the DataFrame, but it is still in levels:

@datapythonista
Copy link
Member

/preview

@github-actions
Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/55437/

@datapythonista
Copy link
Member

Looks like something is broken in the last part of the examples, you can see it here: https://pandas.pydata.org/preview/55437/docs/reference/api/pandas.MultiIndex.levels.html#pandas.MultiIndex.levels

Can you have a look please?

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @shiersansi, thanks a lot for working on this!

@shiersansi
Copy link
Contributor Author

Amazing work @shiersansi, thanks a lot for working on this!

Thank you for your help and advice. Your help has made me understand more about how to contribute to open source projects, and I have learned a lot through this docstring modification

@datapythonista
Copy link
Member

Amazing work @shiersansi, thanks a lot for working on this!

Thank you for your help and advice. Your help has made me understand more about how to contribute to open source projects, and I have learned a lot through this docstring modification

Amazing, I'm glad to hear. Feel free to continue working in other things, and please let me know if you need help. For this PR we'll wait until the CI finishes, and another core developer has time to review it, in case I missed something, but it should be ready from your side.

@mroeschke mroeschke added this to the 2.2 milestone Oct 12, 2023
@mroeschke mroeschke merged commit 7ef617e into pandas-dev:main Oct 12, 2023
33 checks passed
@mroeschke
Copy link
Member

Thanks @shiersansi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants