Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding notebook example for converting Pandas code to Dask #68 #70

Merged
merged 15 commits into from
Jun 25, 2019

Conversation

sephib
Copy link
Contributor

@sephib sephib commented Apr 24, 2019

Hi,
Following issue #68 the notebook has the following topics:

  1. background
  2. Conceptual shift - from Update to Insert/Delete
    2.1 Rename
    2.2 Column munipilations
    2.3 Drop NA on column
    2.4 Reset Index
    3 Read/Save files
  3. Group By
  4. Consider using Persist / Debugging

Please feel free to amend the notebook or suggest additional topics.

sephib and others added 5 commits April 18, 2019 16:58
@mrocklin
Copy link
Member

Hi @sephib , thanks for the work here. It's clear and gives several good tips.

However, I have two general concerns:

  1. In many cases these are bugs that could be fixed. I'm not sure I would want to solidify these bugs in user-facing examples, which should probably stay around for a while. Rather, I'd prefer that we just spent time to fix them.
  2. Often you choose situations that I don't see come up often. For example in the section on "Convert index into Time column" I think I've seen this come up in an issue maybe once or twice over several years. From my perspective it's not one of the major differences between the two libraries. My guess is that you chose usability issues that you yourself ran into, which makes sense, but this may not be representative of the general experience. I think that to do this effectively we would need to survey a few people to get a sense of very common differences that trip people up.

Thoughts?

@sephib
Copy link
Contributor Author

sephib commented Apr 24, 2019

Sure,
I never know if it is a bug or incorrect coding...

I'll be happy to incorporate any topics that you think represents a more general requirements. Unfortunately I don't have an audience to ask.
Please point out specific sections that you prefer to remove.

@mrocklin
Copy link
Member

I'll be happy to incorporate any topics that you think represents a more general requirements. Unfortunately I don't have an audience to ask.
Please point out specific sections that you prefer to remove.

Well, you could ask on a github issue and try to get people to respond there. You might ask also on the gitter channel. You could also review previous github issues to see what themes are common.

As with most teaching, I think that most of the work here isn't in preparing the notebook, it's in preparing the content that goes into it. Making example notebooks is hard.

@sephib
Copy link
Contributor Author

sephib commented Apr 28, 2019

Hi,
I'll try and get some information from data.stackexchange .
will update when I have additional information.

@sephib
Copy link
Contributor Author

sephib commented May 12, 2019

Hi,
I reviewed the stack overflow posts with tags of dask and pandas having scores above 5 and came up with some additional issues. Please feel free comment on any issue.

  1. Reading csv
    1.1. reading multiple csv files (with ‘*’)
    1.2. reading using kwarg - all **kwarg are available such as compression=’gzip’
    1.3. reading directly from hdfs
  2. Create dataframe
    2.1. Use dd.from_pandas(npartitions n)
  3. Conceptual shift - from Update to Insert/Delete
    3.1. Rename
  4. Data manipulations
    4.1. As is with Pandas - always try to vectorize
    4.2. Working with map_partition vs apply
    4.3. Understanding meta
    4.4. Using Masks /Where
    4.5. Drop NA axis=columns
  5. Understanding index
    5.1. Index per partition
    5.2. Set/Reset Index
  6. Save files
  7. Group By
  8. Consider using Persist
  9. Debugging
    9.1. dd.head() - only uses the first partition (not all partitions are loaded)
    9.2. errors due to corrupted DAG

"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Rename"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this one. I don't consider .rename(..., inplace=True) to be a best practice, and there has been proposals to deprecate inlace in many places in pandas.

I would recommend df = df.rename(columns=...), which works for both pandas and dask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a value to show that if we use inplace=True we get an error

"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2 Column munipilations \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: manipulations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - there were a few typos...

"source": [
"# Dask\n",
"ddf = ddf.assign(Time=ddf.index)\n",
"ddf['Time'] = ddf['Time'].dt.time\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why split this across multiple lines? Does the pandas version not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work on a dask dataframe

"cell_type": "markdown",
"metadata": {},
"source": [
"Dask is in a development mode\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is very useful. It will go out of date when the bug is fixed (it actually seems to be fixed already).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
I'll update the cell without the workaround

sephib added 3 commits May 15, 2019 00:56
1. rename
2. meta (including returning Series and DataFrame)
3. dateime conversion
4. dropna
5. read/write files
@sephib
Copy link
Contributor Author

sephib commented May 20, 2019

waiting for your feedback in order to iterate on the notebook

@TomAugspurger
Copy link
Member

It'll be a bit of time before I can go through in detail.

@sephib
Copy link
Contributor Author

sephib commented May 21, 2019

OK, sure. Thx for all your input until now.

@martindurant
Copy link
Member

ping @TomAugspurger here, in case this one slipped through

@TomAugspurger
Copy link
Member

TomAugspurger commented May 28, 2019 via email

@sephib
Copy link
Contributor Author

sephib commented Jun 4, 2019

Hi, here is an updated version which I presented @pyconil. https://github.com/sephib/dask_pyconil2019/blob/c660db3ce3e56a9241b49ca13e2163895bab3a94/dask_for_pandas-in_ETL.ipynb. Obviously I need to clean it up from the presentation style.

@martindurant
Copy link
Member

@sephib , are you still planning on cleaning up your notebook?

@sephib
Copy link
Contributor Author

sephib commented Jun 18, 2019

Yes!
I will remove all the presentations cells.
Do you have any other inputs / issues that you would like me to address?

@martindurant
Copy link
Member

I haven't looked through in any detail, should all be good once you've responded to @TomAugspurger 's comments, although he may want another look.

@sephib
Copy link
Contributor Author

sephib commented Jun 22, 2019

@TomAugspurger I've cleaned up the notebook and amended it (taking account your comments). Thx for all your work

to  enable running the entire notebook without errors
@martindurant
Copy link
Member

The build reports:

nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
%%time
# Pandas
dir_path = Path(r'data/pd2dd')
concat_df = pd.concat([pd.read_csv(f) 
                       for f in list(dir_path.glob('*.csv'))])
len(concat_df)
------------------
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>
~/miniconda/envs/test/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    226                        keys=keys, levels=levels, names=names,
    227                        verify_integrity=verify_integrity,
--> 228                        copy=copy, sort=sort)
    229     return op.get_result()
    230 
~/miniconda/envs/test/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    260 
    261         if len(objs) == 0:
--> 262             raise ValueError('No objects to concatenate')
    263 
    264         if keys is None:
ValueError: No objects to concatenate

@martindurant
Copy link
Member

(perhaps the path during execution is not what you thought it was)

@sephib
Copy link
Contributor Author

sephib commented Jun 24, 2019

The build error is :

ChunkedEncodingError: ('Connection broken: OSError("(104, 'ECONNRESET')")', OSError("(104, 'ECONNRESET')"))
ChunkedEncodingError: ('Connection broken: OSError("(104, 'ECONNRESET')")', OSError("(104, 'ECONNRESET')"))
You can ignore this error by setting the following in conf.py:
nbsphinx_allow_errors = True
Notebook error:
CellExecutionError in applications/json-data-on-the-web.ipynb:

events.pluck('spec').frequencies(sort=True).take(20)

I've checked my notebook again and it is running smoothly - not sure what I can do about it...

@martindurant
Copy link
Member

Same as #85 ?

@sephib
Copy link
Contributor Author

sephib commented Jun 24, 2019

I think it is something related to travis-ci. The notebook looks OK. will try and work on it from a different computer.

@sephib
Copy link
Contributor Author

sephib commented Jun 24, 2019

@martindurant well it did the trick.
@TomAugspurger please don't hesitate to amend the existing topics or suggest additional ideas for the notebook and i'll try to implement them.

@martindurant
Copy link
Member

I think it may fall under the description of a "flaky" test :)

@TomAugspurger , I'm happy with how this looks, so can merge if you have no further comments.

@martindurant martindurant merged commit 18dd483 into dask:master Jun 25, 2019
@martindurant
Copy link
Member

Thank you, @sephib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants