Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions docs/programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
to disk, incurring the additional overhead of disk I/O and increased garbage collection.

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this has been merged, but a annoying issue that I have found in docs (including mine, so I am guilty too) is use of this as of Spark X. No one remembers searching for this pattern and it never gets updated. Rather we should use markdown variables, as of Spark {{site.SPARK_VERSION_SHORT}}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think the sense was '... in 1.3 and not before', so it can stay as is. Yes, in cases where the meaning is '... as of the latest version, which is currently 1.3, and maybe beyond' then it makes sense to introduce a replacement, or just remove the text altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I thought you meant it as the latter ... "as of the latest version". This is a little confusing. :/
May be it makes sense to remove it completely. The GC based behavior is present for 4 versions now, since Spark 1.0, and its not gonna change in foreseeable future. So its best to remove it. The only things that may change in Spark 1.5 that we induce GC periodically ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it could be removed too, even if it probably doesn't matter at this point since we are well beyond 1.3.

are not cleaned up from Spark's temporary storage until Spark is stopped, which means that
long-running Spark jobs may consume available disk space. This is done so the shuffle doesn't need
to be re-computed if the lineage is re-computed. The temporary storage directory is specified by the
are preserved until the corresponding RDDs are no longer used and are garbage collected.
This is done so the shuffle files don't need to be re-created if the lineage is re-computed.
Garbage collection may happen only after a long period time, if the application retains references
to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
consume a large amount of disk space. The temporary storage directory is specified by the
`spark.local.dir` configuration parameter when configuring the Spark context.

Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the
Expand Down