CSV Source

Options

The CSV source is a specific subset of the of the File Source. A library of CSV options to work with Apache Spark, Spark SQL and DataFrames. As always, the path option of the file source must be configured. Please review the File Source if you are not already familiar with it.

delimiter

When a program stores a lot of data, it may use delimiters to separate the data values from each other. A delimiter is one or more characters that is used to separate text strings. The most common delimiters are commas (,), quotes(",'), braces ({}), slashes (/) or pipes (|).

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'delimiter' = '|'
);

mode

Determines how this CSV source should behave when corrupt records are encountered and also ascertains the parsing mode. By default it is PERMISSIVE.

PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
DROPMALFORMED: drops lines which have fewer or more tokens than expected or tokens which do not match the schema.
FAILFAST: aborts with a RuntimeException if encounters any malformed line.

CREATE STREAM foo
FROM CSV
OPTIONS(
  'mode'='failfast'
);

charset

Defaults to 'UTF-8' but can be set to other valid charset names. Their behavior may change as they use platform default encoding, which is objectionable in all cases of spark.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'charset'='UTF8'  
);

quote

Used to represent a string. By default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored.

{"|"} // (|) is ignored.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'quote'='""'  
);

escape

By default the escape character is , but can be set to any character. Escaped quote characters are ignored. A character with Unicode between 0 and 255 may also be represented by an octal escape, i.e. a backslash '' followed by a sequence of up to three octal characters. It is a compile time error if a backslash character in a character or string literal does not start a valid escape sequence.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'escape'='\'  
);

charToEscapeQuoteEscaping

Defines the character used to escape the character used for escaping quotes defined by {@link #getQuoteEscape()}.Identifies whether or not a given character is used to escape the character used for escaping quotes defined by {@link #getQuoteEscape()}. If the character to escape the '' is undefined, the record won't be parsed. The parser will read characters: [a],[],["],[,],[ ],[b] and throw an error because it cannot find a closing quote. Defaults to '\0' (undefined)

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'charToEscapeQuoteEscaping'='null' or '/'  
);

comment

Comment (default empty string): sets the single character used for skipping lines beginning with this character. It can also be used to hide program code details. Skip lines beginning with this character. Default is "#". Disable comments by setting this to null. By default, it is disabled.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'comment'='##'  
);

headerFlag

It is used to write the names of columns as the first line. Default is false

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'headerFlag'='false'  
);

inferSchemaFlag

Automatically infers column types. It requires one extra pass over the data and is false by default. Based on the value of the infer schema (boolean), inferschemaFlag throws an exception stating inferschemaFlag can be true or false.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'inferSchemaFlag'='true'  
);

ignoreLeadingWhiteSpaceInRead

It is used to ignore the leading white space during a read from a CSV file as needed. It can be done with the trailing white space and also in the write operations to a CSV files.It is a flag indicating whether or not leading whitespaces from values being written should be skipped. Default is true. Similarly is ignoreLeadingWhiteSpaceFlagInWrite

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'ignoreLeadingWhiteSpaceInRead'='true'  
);

ignoreTrailingWhiteSpaceInRead

It is a flag indicating defines whether or not trailing whitespaces from values being written should be skipped. Default is true. Similarly is ignoreTrailingWhiteSpaceFlagInWrite

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'ignoreTrailingWhiteSpaceInRead'='true'  
);

columnNameOfCorruptRecord

Specifies an optional column name to capture values of corrupt records. Behavior varies based on the configured mode. This is typically useful in scenarios where you want to route valid records to one location, and corrupt records to another, while capturing the original corrupt value. The column name to capture the corrupt record value must be specified in the schema and be a string column.

{...}
{"key":"value"}

CREATE STREAM foo(key string, corruptData string)
FROM CSV
OPTIONS(
  'columnNameOfCorruptRecord'='corruptData'
);

-- contains {"key":"value"}
CREATE TEMPORARY VIEW validRecords AS
SELECT * FROM foo WHERE corruptData IS NULL;

-- contains {...}
CREATE TEMPORARY VIEW corruptRecords AS
SELECT * FROM foo WHERE corruptData IS NOT NULL;

nullValue

Specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'nullValue'=''  
);

nanValue

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'nanValue'='df.where(col("a").isNaN)'  
);

positiveInf and negativeInf

These are the positive and negative infinity types of the double values. They are represented as Inf for positiveInf and -Inf for negativeInf

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'positiveInf'='Inf'  
);

CompressionCodecs

With spark.broadcast.compress enabled (which is the default), TorrentBroadcast uses compression for broadcast blocks. An implementation of CompressionCodec trait has to offer a constructor that accepts a single argument being SparkConf. Compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified. Based on supported Spark compression codecs.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'CompressionCodecs'='null'  
);

timezone

Specifies the time zone to interpret date and time values in when processing. This is typically useful for interactive use cases where a human or report wishes to interpret time information in a specific time zone. Data processing pipelines should always deal with UTC values so they can be easily interpreted by presentation layers as needed on a case by case basis.

Defaults to the system's default time zone (see java.util.TimeZone.getDefault).

dateFormat

Specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. Sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type Default is yyyy-MM-dd

{"dt":"01-05-2018"}

CREATE STREAM foo
FROM CSV
OPTIONS(
  'dateFormat'='yyyy-MM-dd'
);

timestampFormat

It sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. The default is yyyy-MM-dd'T'HH:mm:ss.SSSXXX

// value: 2018-01-03 18:41:42
// interpreted: 2018-01-03 18:41:00
{"ts": 1515004902}

CREATE STREAM foo
FROM CSV
OPTIONS(
  'timestampFormat'='yyyy-MM-dd'T'HH:mm:ss.SSSXXX'
);

multiLine

Specifies whether files are multi-line CSV documents. Typically large CSV files follow the CSV lines standard, which is essentially one valid CSV object per line. However, you may encounter large CSV documents where a single file is a single CSV object. In these cases, multiLine should be set to true.

Defaults to false.

{
  "key":"value"
}

CREATE STREAM foo
FROM CSV
OPTIONS(
  'multiLine'='true'
);

maxColumns

It give us the max number of available columns in the CSV file that being executed.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'maxColumns'='20480'  
);

maxCharsPerColumn

Defines the maximum number of characters allowed for any given value being read. The default value is 1000000

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'maxCharsPerColumn'='1000000'  
);

escapeQuotes

These are used to escape the quotes if needed while reading a CSV file into the Spark Data-frame. It can be used with any character depending upon the requirement. It is a flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character. True is the default.

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'escapeQuotes'='true'  
);

quoteAll

It is a flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character. Default value is false

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'quoteAll'='false'  
);

inputBufferSize

It give us the information regarding the size of the input buffer. Default is 128

CREATE STREAM data 
FROM CSV 
OPTIONS(
   'inputBufferSize'='128'  
);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly