-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901
Conversation
Test build #35260 has finished for PR 6901 at commit
|
LGTM. Eventually we want to address this behavior by forcing a periodic GC (once every 30 minutes or something should be inexpensive). For now this is a better description to have. Merging into master 1.4 and 1.3. |
…park apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <[email protected]> Closes #6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files (cherry picked from commit 4be53d0) Signed-off-by: Andrew Or <[email protected]>
…park apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <[email protected]> Closes #6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files (cherry picked from commit 4be53d0) Signed-off-by: Andrew Or <[email protected]>
@@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s | |||
to disk, incurring the additional overhead of disk I/O and increased garbage collection. | |||
|
|||
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this has been merged, but a annoying issue that I have found in docs (including mine, so I am guilty too) is use of this as of Spark X
. No one remembers searching for this pattern and it never gets updated. Rather we should use markdown variables, as of Spark {{site.SPARK_VERSION_SHORT}}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I think the sense was '... in 1.3 and not before', so it can stay as is. Yes, in cases where the meaning is '... as of the latest version, which is currently 1.3, and maybe beyond' then it makes sense to introduce a replacement, or just remove the text altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I thought you meant it as the latter ... "as of the latest version". This is a little confusing. :/
May be it makes sense to remove it completely. The GC based behavior is present for 4 versions now, since Spark 1.0, and its not gonna change in foreseeable future. So its best to remove it. The only things that may change in Spark 1.5 that we induce GC periodically ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it could be removed too, even if it probably doesn't matter at this point since we are well beyond 1.3.
…park apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <[email protected]> Closes apache#6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files (cherry picked from commit 4be53d0) Signed-off-by: Andrew Or <[email protected]>
Clarify what may cause long-running Spark apps to preserve shuffle files