[SPARK-18658][SQL] Write text records directly to a FileOutputStream #16089

NathanHowell · 2016-11-30T22:55:10Z

What changes were proposed in this pull request?

This replaces uses of TextOutputFormat with an OutputStream, which will either write directly to the filesystem or indirectly via a compressor (if so configured). This avoids intermediate buffering.

The inverse of this (reading directly from a stream) is necessary for streaming large JSON records (when wholeFile is enabled) so I wanted to keep the read and write paths symmetric.

How was this patch tested?

Existing unit tests.

NathanHowell · 2016-11-30T22:58:06Z

This touches a fair number of components. I also haven't done any performance testing to see what the impact of this is. Curious what your thoughts are?

cc/ @marmbrus @rxin @JoshRosen

rxin · 2016-11-30T23:04:55Z

Does this work against file systems with HDFS API (not local posix)? If yes, sounds good!

NathanHowell · 2016-12-01T00:47:49Z

Yep. It uses the Hadoop FileSystem class to open files, just like TextOutputFormat does.

rxin · 2016-12-01T00:48:53Z

Yea then this is definitely fine.

JoshRosen · 2016-12-01T03:02:00Z

Jenkins, this is ok to test

HyukjinKwon · 2016-12-01T04:06:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala

@@ -194,4 +194,8 @@ private[sql] class JacksonGenerator(
      writeFields(row, schema, rootFieldWriters)
    }
  }
+
+  private[sql] def writeLineEnding(): Unit = {
+    gen.writeRaw('\n')


Hm, is this safe to guess the newline is always \n?

HyukjinKwon · 2016-12-01T04:07:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

-    buffer.set(utf8string.getBytes)
-    recordWriter.write(NullWritable.get(), buffer)
+    writer.write(utf8string.getBytes)
+    writer.write('\n')


Here too and for CSV as well.

This mirrors what Hadoop code does, see https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java#L48-L49

HyukjinKwon · 2016-12-01T04:12:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
+import org.apache.hadoop.util.ReflectionUtils
+
+private[spark] object CodecStreams {


I guess we don't need private[spark] here assuming from the commit here (511f52f)

Looks that way, I've removed it.

SparkQA · 2016-12-01T04:33:08Z

Test build #69449 has finished for PR 16089 at commit 298e507.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2016-12-01T05:23:00Z

Doh, forgot to run the Hive tests. Should be fixed now.

SparkQA · 2016-12-01T08:06:17Z

Test build #69457 has finished for PR 16089 at commit 56667bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-01T09:23:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+    }
+  }
+
+  /** Create a new file and open it for writing.


can you fix the comment style here? we don't use scaladoc style comment in Spark.

rxin · 2016-12-01T09:23:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala

@@ -194,4 +194,8 @@ private[sql] class JacksonGenerator(
      writeFields(row, schema, rootFieldWriters)
    }
  }
+
+  private[sql] def writeLineEnding(): Unit = {


remove private[sql] here to be consistent with other methods

rxin · 2016-12-01T09:25:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+    val fs = file.getFileSystem(context.getConfiguration)
+    val outputStream: OutputStream = fs.create(file, false)
+
+    getCompressionCodec(context, Some(file)).fold(outputStream) { codec =>


I know you like Haskell but every time I see a fold I have to spend an extra 5 secs checking what it is doing :) Can we simplify this one?

Yah, it's a terrible name (and it's not a fold). I'll replace them with .map(...).getOrElse.

rxin · 2016-12-01T09:25:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+   * If compression is enabled in the [[JobContext]] the stream will write compressed data to disk.
+   * An exception will be thrown if the file already exists.
+   */
+  def getOutputStream(context: JobContext, file: Path): OutputStream = {


getOutputStream -> createOutputStream

rxin · 2016-12-01T09:25:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+    }
+  }
+
+  def getOutputStreamWriter(


same here - createOutputStreamWriter

rxin · 2016-12-01T09:26:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+
+  /** Returns the compression codec extension to be used in a file name, e.g. ".gzip"). */
+  def getCompressionExtension(context: JobContext): String = {
+    getCompressionCodec(context).fold("") { code =>


maybe

getCompressionCodec(context).map(_. getDefaultExtension).getOrElse("")

rxin · 2016-12-01T09:28:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala


  override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal")

  override protected[sql] def writeInternal(row: InternalRow): Unit = {
    val utf8string = row.getUTF8String(0)
-    buffer.set(utf8string.getBytes)
-    recordWriter.write(NullWritable.get(), buffer)
+    writer.write(utf8string.getBytes)


can you check UTF8String's implementation to make sure we are not creating a new byte array for each row?

It is creating a new array, I'll pass the internal one through instead.

Done, but I'm not 100% sure about the implementation. Can you have someone more familiar with UTF8String's internals double check it?

rxin · 2016-12-01T09:29:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala

-      text.set(lines)
-      recordWriter.write(NullWritable.get(), text)
-    }
+    csvWriter.writeRow(rowToString(row), printHeader)


we can probably optimize this as well - but not a huge deal in the first pr.

The uniVocity CSV writer converts every column to a String before writing so it's (probably?) not possible to further optimize this without doing a whole bunch of work. I only did a quick scan through their code though.

Yup I asked this question but never had time to follow-up: uniVocity/univocity-parsers#99

srowen · 2016-12-01T12:57:10Z

I was going to say, hm, are we sure we want to reimplement / go around the Hadoop support for this? but in practice it looks like it actually simplifies some things. At the moment I can't think of any particular behaviors we're missing by avoiding the Input/OutputFormat.

But CC @vanzin @steveloughran for any comment.

steveloughran · 2016-12-01T17:35:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+  /**
+   * Create a new file and open it for writing.
+   * If compression is enabled in the [[JobContext]] the stream will write compressed data to disk.
+   * An exception will be thrown if the file already exists.


will "probably" be thrown; object stores have issues there

Is this a problem with Hadoop in general? The FileSystem docs also specify this behavior:

/** * Create an FSDataOutputStream at the indicated Path. * @param f the file to create * @param overwrite if a file with this name already exists, then if true, * the file will be overwritten, and if false an exception will be thrown. */

steveloughran · 2016-12-01T17:39:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala

@@ -194,4 +194,8 @@ private[sql] class JacksonGenerator(
      writeFields(row, schema, rootFieldWriters)
    }
  }
+
+  def writeLineEnding(): Unit = {
+    gen.writeRaw('\n')


TextOutputStream actually writes the UTF-8 version of a newline; don't know if that is relevant or not:

"\n".getBytes(StandardCharsets.UTF_8);

7-bit ASCII is a subset of UTF-8, \n is the same in both.

Presumably just a convoluted way to create a byte array containing the byte 0x0a then

That is my assumption. I'm also assuming that writing a single byte is slightly more efficient than writing an array of a single byte.

steveloughran · 2016-12-01T17:49:23Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

@@ -147,6 +147,17 @@ public void writeTo(ByteBuffer buffer) {
    buffer.position(pos + numBytes);
  }

+  public void writeTo(OutputStream out) throws IOException {


always good to have tests for the corner case codepaths here, as they are invariably the official home of off-by-one errors

I've added a few tests for this method.

steveloughran · 2016-12-01T18:03:07Z

AFAIK, the big thing the FileOutputFormat really adds is not the compression, but the output committer and the stuff to go with that (working directories, paths, etc etc). If you aren't going near that, and just want a fast write of .csv and jackson with optional compression, well, I don't see anything in the code I'd run away from.

If you do want to think about how to write CSV files during the output of speculative work in the presence of failures, well, that's where the mapred.lib.output code really comes out to play.

Otherwise, in general PR review mode: Tests? What if the code asks for a committer that isn't there, passes in null sequences in rows to write, tries to hit the buffer corner cases. Hopefully those exist already, but if not, now is a good time to try to break things.

NathanHowell · 2016-12-01T18:35:53Z

@steveloughran Spark is handling the output committing somewhere further up the stack. The path being passed in to OutputWriterFactory.newInstance is to a temporary file, such as /private/var/folders/sq/vmncyd7506q_ch43llrwr8sn6zfknl/T/spark-3db2844b-1f3c-45c2-8bf4-8a3c81440e38/_temporary/0/_temporary/attempt_20161201081833_0000_m_000000_0/part-00000-8dd44cea-c01e-4bfe-ab03-641ebce18afb.txt.

I'll make a pass through the existing tests to see if anything obvious is missing.

steveloughran · 2016-12-01T19:21:23Z

I ask about committers as I'm staring at the V1 and V2 committer APIs right now related to S3 destinations; not directly related to this though.

rxin · 2016-12-01T19:26:57Z

@srowen yea the hadoop format api is pretty awkward to use, and actually makes everything more complicated than needed.

SparkQA · 2016-12-01T19:28:42Z

Test build #69488 has finished for PR 16089 at commit 5707218.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-02T03:43:33Z

Test build #69519 has finished for PR 16089 at commit 27c102d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-02T05:40:37Z

Merging in master. Thanks.

## What changes were proposed in this pull request? This replaces uses of `TextOutputFormat` with an `OutputStream`, which will either write directly to the filesystem or indirectly via a compressor (if so configured). This avoids intermediate buffering. The inverse of this (reading directly from a stream) is necessary for streaming large JSON records (when `wholeFile` is enabled) so I wanted to keep the read and write paths symmetric. ## How was this patch tested? Existing unit tests. Author: Nathan Howell <[email protected]> Closes apache#16089 from NathanHowell/SPARK-18658.

… not expected to be supported ## What changes were proposed in this pull request? This PR excludes an existing UT [`writeToOutputStreamUnderflow()`](https://github.com/apache/spark/blob/master/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java#L519-L532) in `UTF8StringSuite`. As discussed [here](apache#19222 (comment)), the behavior of this test looks surprising. This test seems to access metadata area of the JVM object where is reserved by `Platform.BYTE_ARRAY_OFFSET`. This test is introduced thru apache#16089 by NathanHowell. More specifically, [the commit](apache@27c102d) `Improve test coverage of UTFString.write` introduced this UT. However, I cannot find any discussion about this UT. I think that it would be good to exclude this UT. ```java public void writeToOutputStreamUnderflow() throws IOException { // offset underflow is apparently supported? final ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); final byte[] test = "01234567".getBytes(StandardCharsets.UTF_8); for (int i = 1; i <= Platform.BYTE_ARRAY_OFFSET; ++i) { new UTF8String( new ByteArrayMemoryBlock(test, Platform.BYTE_ARRAY_OFFSET - i, test.length + i)) .writeTo(outputStream); final ByteBuffer buffer = ByteBuffer.wrap(outputStream.toByteArray(), i, test.length); assertEquals("01234567", StandardCharsets.UTF_8.decode(buffer).toString()); outputStream.reset(); } } ``` ## How was this patch tested? Existing UTs Author: Kazuaki Ishizaki <[email protected]> Closes apache#20995 from kiszk/SPARK-23882.

NathanHowell force-pushed the SPARK-18658 branch 4 times, most recently from 1260870 to 298e507 Compare December 1, 2016 02:06

HyukjinKwon reviewed Dec 1, 2016

View reviewed changes

NathanHowell force-pushed the SPARK-18658 branch from 298e507 to 3bd92d6 Compare December 1, 2016 05:22

[SPARK-18658][SQL] Write text records directly to a FileOutputStream

56667bd

NathanHowell force-pushed the SPARK-18658 branch from 3bd92d6 to 56667bd Compare December 1, 2016 05:24

rxin reviewed Dec 1, 2016

View reviewed changes

Nathan Howell added 3 commits December 1, 2016 08:05

Rename getOutputStream to createOutputStream

373d8e0

Make writeLineEnding public

d349d71

Remove extra Array[String] allocation

c178dac

Nathan Howell added 2 commits December 1, 2016 08:08

Change fold to map(…).getOrElse(…)

7742ee4

Use UTF8String’s internal buffer when possible

5707218

steveloughran reviewed Dec 1, 2016

View reviewed changes

Improve test coverage of UTFString.write

27c102d

asfgit closed this in c82f16c Dec 2, 2016

NathanHowell deleted the SPARK-18658 branch December 2, 2016 18:53

kiszk mentioned this pull request Apr 6, 2018

[SPARK-23882][Core] UTF8StringSuite.writeToOutputStreamUnderflow() is not expected to be supported #20995

Closed

[SPARK-18658][SQL] Write text records directly to a FileOutputStream #16089

[SPARK-18658][SQL] Write text records directly to a FileOutputStream #16089

Conversation

NathanHowell commented Nov 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

NathanHowell commented Nov 30, 2016

rxin commented Nov 30, 2016

NathanHowell commented Dec 1, 2016

rxin commented Dec 1, 2016

JoshRosen commented Dec 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 1, 2016

NathanHowell commented Dec 1, 2016

SparkQA commented Dec 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin Dec 1, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Dec 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Dec 1, 2016

NathanHowell commented Dec 1, 2016

steveloughran commented Dec 1, 2016

rxin commented Dec 1, 2016

SparkQA commented Dec 1, 2016

SparkQA commented Dec 2, 2016

rxin commented Dec 2, 2016

rxin Dec 1, 2016 •

edited

Loading