-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers #14579
Conversation
Test build #63520 has finished for PR 14579 at commit
|
Note: with rdd.map(lambda x: x) as x:
... Clearly this doesn't make a lot of sense. However, I looked at the 2 options of (a) a separate context manager wrapper class returned by The problem with (a) is that The problem with (b) is that the special method So, if we want to avoid that, the only option I see is a variant of (a) above - adding a with cached(rdd) as x:
x.count() This is less "elegant" but more explicit. Any other smart ideas for handling option (b) above, please do shout! |
Thanks @MLnick for taking this on and for breaking down what you've found so far. I took a look through Have you taken a look at that? |
@nchammas I looked at the As far as I can see for the |
Ah, you're right. So if we want to avoid needing magic methods in the main RDD/DataFrame classes and avoid needing a separate utility method like
What do you think of that? |
None of our options seems great, but if I had to rank them I would say:
Adding new internal classes for this use-case honestly seems a bit heavy-handed to me, so if we are against that then I would lean towards not doing anything. |
cc @davies |
One minor thing to keep in mind - the subclassing of RDD approach could cause us to miss out on pipelining if the RDD was used again after it was unpersisted - but I think that is a relatively minor issue. On the whole I think modify base rdd and dataframe classes (option A / option 4) which is the one @MLnick has implemented here is probably one of the more reasonable options - the But if there is a better way to do this I'd be excited to find out as well :) |
How so? Wouldn't |
@nchammas so if we go with the subclassing approach but keep the current cache/persist interface (e.g. no special utility function) a user could easily write something like: I don't believe |
:py:meth:`cache` can be used in a 'with' statement. The RDD will be automatically | ||
unpersisted once the 'with' block is exited. Note however that any actions on the RDD | ||
that require the RDD to be cached, should be invoked inside the 'with' block; otherwise, | ||
caching will have no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor documentation suggestion - but I was thinking maybe a version changed directive could be helpful to call out that its new functionality (both in RDD and DF)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, especially since this is technically a new Public API that we are potentially committing to for the life of the 2.x line.
Sorry, you're right, But I'm not seeing the issue with the example you posted. Reformatting for clarity: magic = rdd.persist()
with magic as awesome:
awesome.count()
magic.map(lambda x: x + 1) Are you saying
|
yeah it would break pipelining - I don't think it will necessarily throw an error though. e.g.
So I think chaining will work, but the pipelined RDD thinks mapped2 is the 1st transformation, while it is actually the 2nd. I think this will just be an efficiency issue rather than a correctness issue however. We could possibly work around it with some type checking etc but it then starts to feel like adding more complexity than the feature is worth... |
Ah, I see. I don't fully understand how
Agreed. At this point, actually, I'm beginning to feel this feature is not worth it. Context managers seem to work best when the objects they're working on have clear open/close-style semantics. File handles, network connections, and the like fit this pattern well. In fact, the doc for
RDDs and DataFrames, on the other hand, don't have a simple open/close or |
Right I wouldn't expect it to error with subclassing - just not pipeline successfully - but only in a very long shot corner case. I think the try/finally with persistance is not an uncommon pattern (we have something similar happen frequently inside of Spark ML/mllib but its in Scala code). |
@nchammas a utility method (e.g. |
After looking at it and considering all the above, I would say the options are (1) do nothing; or (2) if we want to support this use case, then we implement a single utility method (I would say called Even though we could almost achieve things with subclassing, we do sort of break something and add too much risk/complexity vs reward of the feature IMHO. |
@MLnick - Couldn't we also create a scenario (like @holdenk did earlier) where a user does something like this? persisted_rdd = persisted(rdd)
persisted_rdd.map(...).filter(...).count() This would break pipelining too, no? And I think the expectation would be for it not to break pipelining, because existing common context managers in Python don't have a requirement that they must be used in a For example, |
@nchammas to be clear - subclassing only breaks pipelining if the persisted_rdd is later unpersisted (e.g. used with a |
Hmm, OK I see. (Apologies, I don't understand what pipelined RDDs are for, so the examples are going a bit over my head. 😅) |
Sure - so at a bit of a high level and not like exactly on point - copying data out of the Python back to the JVM is kind of slow so if we have multiple python operations we can put together in the same task then we try and do this. Since caching is handled by storing the data inside of the JVM a cached RDD can't be pipelined since we need to copy the result to the JVM. You can see the details in the PipelinedRDD in rdd.py. |
Thanks for the quick overview. That's pretty straightforward, actually! I'll take a look at |
@nchammas to answer your question above (#14579 (comment)) - in short no. The semantics of the utility method will be the same as for the class persisted():
def __init__(self, thing):
self.thing = thing
def __enter__(self):
return self.thing
def __exit__(self, *exc_info):
self.thing.unpersist() If someone tries to do The "file-like" version is what is currently implemented in this PR, and it works if |
@@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri | |||
self._id = jrdd.id() | |||
self.partitioner = None | |||
|
|||
def __enter__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it reasonable just to raise an error saying that the context manager is meant to work only with cached RDD's (Dataframes) if self.is_cached
is not set to True to solve problems such as the usage of with rdd.map(lambda x: x) as x:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do it - but users can still then do with rdd.cache().map(...) as x:
and it would be valid. So it doesn't fully solve the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that true? Doesn't it call __enter__
on the instance of rdd.cache().map(...)
where is_cached
is set to False?
Quick verification:
def __enter__(self)
if self.is_cached:
return self
else:
raise ValueError("r")
with rdd.cache().map(lambda x: x) as t:
pass
raises a ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats an interesting approach @MechCoder I think that could be a way to clarify how to expect to use the context manager to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeas, also known as the "If you don't know what to do; raise an Error" approach :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, yes this does happen to work, because most operations boil down to something like mapPartitions
which creates a new PipelineRDD
which is not cached, or a new RDD
which is again not cached.
I think it will work for DataFrame
too for similar reason - most operations return a new DataFrame
instance.
@nchammas @holdenk @davies @rxin how about the approach of @MechCoder in #14579 (comment)? I think this will work well, so we could raise an error to prevent (almost all I think) usages outside of the intended pattern of |
I like it personally - if no one has a good reason why not it seems like a very reasonable approach. |
Looks good to me. 👍 |
@MLnick still interested in updating this? (Just looking over the older Python PRs) :) |
Yup! just been snowed under :( but will update with the approach above asap. |
Just pinging to see how its going? |
I hope to get to this soon - It's just the test cases that I need to get to also! |
just a gentle ping - would be cool to add this for 2.1 we have the time :) |
Is there any reason why it is not merged yet? I personally like this too. |
Do you have time to update this @MLnick or maybe would it be OK if someone else made an updated PR based on this? It would be a nice feature to have for 2.2 :) |
gentle ping. |
@MLnick Hi, are you still working on this? If so, could you fix conflicts and update please? |
@MLnick - or if you don't have a chance would it be ok for us to find someone (perhaps someone new to the project) to take this over and bring it to the finish line? |
Hey all - sorry I haven't been able to focus on this. It shouldn't be tough to do, but it will need some tests. If we can find someone who wants to take it over I think it makes a decent starter task :) |
JIRA: https://issues.apache.org/jira/browse/SPARK-16921
Context managers are a natural way to capture closely related setup and teardown code in Python. It can be useful to apply this pattern to persisting/unpersisting RDDs and DataFrames.
This PR makes RDDs and DataFrames implement the context manager
__enter__
and__exit__
functions, allowing code such as:How was this patch tested?
New doc tests.