-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Add DataFrame.index.levels #55437
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shiersansi.
I think you misunderstood the list on the description. You shouldn't ignore the existing items in the list, and add one of what you did. You need to do what the existing items say, and check them to show that you did the relevant points. If you see the CI, you'll see there are error, that you could have detected earlier with the tasks in that list. No big deal having run those things locally, but they'll be helpful to prevent the CI being red, which prevents us from merging this.
pandas/core/indexes/multi.py
Outdated
# create new IndexEngine | ||
# https://github.com/pandas-dev/pandas/issues/31648 | ||
""" | ||
Returns a tuple of Index objects representing the levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the short summary needs to fit in one line. I wouldn't go into technical details like if the return is a tuple. The best I can think of as a short summary is Levels of the MultiIndex.
. Then in the following paragraph I would explain what are levels, what they mean conceptually, how they can be seen in pandas...
pandas/core/indexes/multi.py
Outdated
Each level is an Index object containing unique values | ||
from that level of the MultiIndex. | ||
|
||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case this is an attribute (MultiIndex.levels
not MultiIndex.levels()
), so for the users it doesn't return anything, it contains the value, and this section is not expected to exist. I'm not sure how we do for other attributes in the docs, they can be a good reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this @shiersansi, amazing work, I really like your explanations here, they are super clear.
I added some comments that I think can make the examples better, but outstanding work, thanks for it.
pandas/core/indexes/multi.py
Outdated
""" | ||
Levels of the MultiIndex. | ||
|
||
What are levels? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to remove the questions. In isolation I think they are fine, but since we don't use them in any other API documentation page, they feel a bit strange and inconsistent. I think it's easy to understand the explanations without the questions, so no big deal.
pandas/core/indexes/multi.py
Outdated
|
||
Examples | ||
-------- | ||
Example 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the Example 1
and just show the example directly.
pandas/core/indexes/multi.py
Outdated
FrozenList([[0, 1, 2], ['green', 'purple']]) | ||
|
||
Example 2: | ||
>>> idx = pd.MultiIndex.from_product([['John', 'Josh', 'Alex'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reuse the index from above, so this is less verbose please.
pandas/core/indexes/multi.py
Outdated
Example 2: | ||
>>> idx = pd.MultiIndex.from_product([['John', 'Josh', 'Alex'], | ||
... list('abcde')], names=['Person', 'Letter']) | ||
>>> large = pd.DataFrame(data=np.random.randn(15, 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to just call this df
instead of large
pandas/core/indexes/multi.py
Outdated
... list('abcde')], names=['Person', 'Letter']) | ||
>>> large = pd.DataFrame(data=np.random.randn(15, 2), | ||
... index=idx, columns=['one', 'two']) | ||
>>> small = large.loc[['Jo'==d[0:2] for d in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show the dataframe here first, so the user can better understand the data of the example? I think you'll want to avoid using random data in the example, as it'll make your life more difficult. And if you use data that is meaningful instead of some random things, that will also help users understand better. As an idea, you can use a MultiIndex of animals with class/animal (e.g. insect/ant, insect/spides, mammal/goat...) and use a dataframe with just one column, for example number of legs (we use this example in other docs). The example is silly, but meaningful and any user can understand it very quickly, and focus on the concepts, not in understanding the data.
Then, as said, you can show the dataframe first, show the levels of the multiindex from it (i.e. df.index.levels
), and then you filter the data many_leg_animals = animals[animals.num_legs > 4]
to finally show that even if now there are no mammals, in the data, the multiindex still contains all the levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a problem that I may need your help with. When I tested the code after I committed it, four tests failed, but I couldn't figure out exactly what went wrong from the error log provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shiersansi, this is getting better, but I still have some comments, as I find the examples a bit confusing. Let me know if you have any question or comment.
For the examples, you should make sure PEP-8 is respected (like having one spaces after commas...). I can show you more in detail where to see the errors in the next iteration, when you address the comments.
Thanks again for working on this!
pandas/core/indexes/multi.py
Outdated
|
||
Examples | ||
-------- | ||
>>> idx = pd.MultiIndex.from_product([('insect', 'mammal'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do from_product
we'll have all the options, like insect/goat, mammal/spider, which we don't want. There are multiple options, one is just to create a "normal" dataframe with all columns, and then set the index to create the MultiIndex. That may be the simpler, but if you find another that is simpler or as clear, feel free to use any other way. But data should be something like:
mammal - goat - 4
mammal - human - 2
insect - ant - 6
insect - spider - 8
Not things like insect - goat - 1
which I think it'll just confuse users, as the data doesn't make sense.
pandas/core/indexes/multi.py
Outdated
|
||
If the DataFrame not using all levels, this method still returns the original | ||
levels of the MultiIndex. | ||
When using this method on a DataFrame index, it may show "extra". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this last sentence talking about the MultiIndex returning all original levels? I personally find the it may show extra
comment misleading. If you want to complement the previous sentence, what about something like For example, if a MultiIndex is created with levels A, B, C, and the DataFrame using it filters out all rows of the level C, MultiIndex.levels will still return A, B, C
. Something like this should make things easier to understand than this last comment IMHO.
pandas/core/indexes/multi.py
Outdated
-------- | ||
>>> idx = pd.MultiIndex.from_product([('insect', 'mammal'), | ||
... ('goat', 'dog','cat','ant','spides')], names=['class', 'animal']) | ||
>>> print(idx.levels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd better show the whole dataframe instead when it's ready, and you don't need the print, just >>> df
or >>> leg_num
.
pandas/core/indexes/multi.py
Outdated
>>> leg_num = pd.DataFrame(data=(1,2,3,4,5,6,7,8,9,2), | ||
... index=idx, columns=['leg']) | ||
>>> insect_leg_num = leg_num.loc[['insect'==d for d in | ||
... leg_num.index.get_level_values('class')]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this, why not something simpler like large_leg_num = leg_num[leg_num.num_legs > 4]
or something simpler where the user doesn't need to put a log of effort to understand something that is not what this docstring is about?
pandas/core/indexes/multi.py
Outdated
>>> insect_leg_num = leg_num.loc[['insect'==d for d in | ||
... leg_num.index.get_level_values('class')]] | ||
|
||
>>> print(leg_num.index.levels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd show this as soon as the dataframe is ready, then continue with the filtering and showing that the levels didn't change. You don't need the print here either. And I think it'd be good to add a comment before you show the filtering, to let users know what you're going to show them, which is not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have revised the doc content again and look forward to your new suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shiersansi, looking better. I think we should show the data so the examples are clearer, and added few more thoughts, but looks good in general.
pandas/core/indexes/multi.py
Outdated
FrozenList([['mammal'], ['cat', 'dog', 'goat', 'human']]) | ||
|
||
If the number of Legs is greater than 2 as the filter condition, | ||
the levels do not change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear to me, sorry. I think something like MultiIndex levels will not change even if the DataFrame using the MultiIndex does not contain all them anymore:
pandas/core/indexes/multi.py
Outdated
level in the MultiIndex and contains the unique values found in that | ||
specific level. | ||
|
||
if a MultiIndex is created with levels A, B, C, and the DataFrame using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a MultiIndex is created with levels A, B, C, and the DataFrame using | |
If a MultiIndex is created with levels A, B, C, and the DataFrame using |
Feels a bit strange to jump into an example case before explaining the concept, but I think it's easier to understand this way.
pandas/core/indexes/multi.py
Outdated
FrozenList([['mammal'], ['cat', 'dog', 'goat', 'human']]) | ||
|
||
MultiIndex levels will not change even if the DataFrame using the MultiIndex | ||
does not contain all them anymore: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can make it even more clear (and I think it's better to leave a blank line between the comment and the code. But other than this small suggestion seems perfect, I think this is a very great addition to the docs.
does not contain all them anymore: | |
does not contain all them anymore. See How "human" is not in the DataFrame, but it is still in levels: | |
/preview |
Website preview of this PR available at: https://pandas.pydata.org/preview/55437/ |
Looks like something is broken in the last part of the examples, you can see it here: https://pandas.pydata.org/preview/55437/docs/reference/api/pandas.MultiIndex.levels.html#pandas.MultiIndex.levels Can you have a look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @shiersansi, thanks a lot for working on this!
Thank you for your help and advice. Your help has made me understand more about how to contribute to open source projects, and I have learned a lot through this docstring modification |
Amazing, I'm glad to hear. Feel free to continue working in other things, and please let me know if you need help. For this PR we'll wait until the CI finishes, and another core developer has time to review it, in case I missed something, but it should be ready from your side. |
Thanks @shiersansi |
Add docstrings for MultiIndex.levels and MultiIndex.codes