-
Notifications
You must be signed in to change notification settings - Fork 6
CSV Source
The CSV source is a specific subset of the of the File Source. A library of CSV options to work with Apache Spark, Spark SQL and DataFrames. As always, the path
option of the file source must be configured. Please review the File Source if you are not already familiar with it.
When a program stores a lot of data, it may use delimiters to separate the data values from each other. A delimiter is one or more characters that is used to separate text strings. The most common delimiters are commas (,), quotes(",'), braces ({}), slashes (/) or pipes (|).
CREATE STREAM data
FROM CSV
OPTIONS(
'delimiter' = '|'
);
Determines how this CSV source should behave when corrupt records are encountered and also ascertains the parsing mode. By default it is PERMISSIVE.
-
PERMISSIVE
: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored. -
DROPMALFORMED
: drops lines which have fewer or more tokens than expected or tokens which do not match the schema. -
FAILFAST
: aborts with a RuntimeException if encounters any malformed line.
CREATE STREAM foo
FROM CSV
OPTIONS(
'mode'='failfast'
);
Defaults to 'UTF-8' but can be set to other valid charset names. Their behavior may change as they use platform default encoding, which is objectionable in all cases of spark.
CREATE STREAM data
FROM CSV
OPTIONS(
'charset'='UTF8'
);
Used to represent a string. By default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored.
{"|"} // (|) is ignored.
CREATE STREAM data
FROM CSV
OPTIONS(
'quote'='""'
);
By default the escape character is , but can be set to any character. Escaped quote characters are ignored. A character with Unicode between 0 and 255 may also be represented by an octal escape, i.e. a backslash '' followed by a sequence of up to three octal characters. It is a compile time error if a backslash character in a character or string literal does not start a valid escape sequence.
CREATE STREAM data
FROM CSV
OPTIONS(
'escape'='\'
);
Defines the character used to escape the character used for escaping quotes defined by {@link #getQuoteEscape()}.Identifies whether or not a given character is used to escape the character used for escaping quotes defined by {@link #getQuoteEscape()}. If the character to escape the '' is undefined, the record won't be parsed. The parser will read characters: [a],[],["],[,],[ ],[b] and throw an error because it cannot find a closing quote. Defaults to '\0' (undefined)
CREATE STREAM data
FROM CSV
OPTIONS(
'charToEscapeQuoteEscaping'='null' or '/'
);
Comment (default empty string): sets the single character used for skipping lines beginning with this character.
It can also be used to hide program code details. Skip lines beginning with this character. Default is "#". Disable comments by setting this to null
. By default, it is disabled.
CREATE STREAM data
FROM CSV
OPTIONS(
'comment'='##'
);
It is used to write the names of columns as the first line. Default is false
CREATE STREAM data
FROM CSV
OPTIONS(
'headerFlag'='false'
);
Automatically infers column types. It requires one extra pass over the data and is false by default. Based on the value of the infer schema
(boolean), inferschemaFlag
throws an exception stating inferschemaFlag
can be true or false.
CREATE STREAM data
FROM CSV
OPTIONS(
'inferSchemaFlag'='true'
);
It is used to ignore the leading white space during a read from a CSV file as needed. It can be done with the trailing white space and also in the write operations to a CSV files.It is a flag indicating whether or not leading whitespaces from values being written should be skipped. Default is true
. Similarly is ignoreLeadingWhiteSpaceFlagInWrite
CREATE STREAM data
FROM CSV
OPTIONS(
'ignoreLeadingWhiteSpaceInRead'='true'
);
It is a flag indicating defines whether or not trailing whitespaces from values being written should be skipped. Default is true
. Similarly is ignoreTrailingWhiteSpaceFlagInWrite
CREATE STREAM data
FROM CSV
OPTIONS(
'ignoreTrailingWhiteSpaceInRead'='true'
);
Specifies an optional column name to capture values of corrupt records. Behavior varies based on the configured mode
. This is typically useful in scenarios where you want to route valid records to one location, and corrupt records to another, while capturing the original corrupt value. The column name to capture the corrupt record value must be specified in the schema and be a string column.
{...}
{"key":"value"}
CREATE STREAM foo(key string, corruptData string)
FROM CSV
OPTIONS(
'columnNameOfCorruptRecord'='corruptData'
);
-- contains {"key":"value"}
CREATE TEMPORARY VIEW validRecords AS
SELECT * FROM foo WHERE corruptData IS NULL;
-- contains {...}
CREATE TEMPORARY VIEW corruptRecords AS
SELECT * FROM foo WHERE corruptData IS NOT NULL;
Specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.
CREATE STREAM data
FROM CSV
OPTIONS(
'nullValue'=''
);
NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0
CREATE STREAM data
FROM CSV
OPTIONS(
'nanValue'='df.where(col("a").isNaN)'
);
These are the positive and negative infinity types of the double values. They are represented as Inf
for positiveInf and -Inf
for negativeInf
CREATE STREAM data
FROM CSV
OPTIONS(
'positiveInf'='Inf'
);
With spark.broadcast.compress enabled (which is the default), TorrentBroadcast uses compression for broadcast blocks. An implementation of CompressionCodec
trait has to offer a constructor that accepts a single argument being SparkConf
. Compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified. Based on supported Spark compression codecs.
CREATE STREAM data
FROM CSV
OPTIONS(
'CompressionCodecs'='null'
);
Specifies the time zone to interpret date and time values in when processing. This is typically useful for interactive use cases where a human or report wishes to interpret time information in a specific time zone. Data processing pipelines should always deal with UTC values so they can be easily interpreted by presentation layers as needed on a case by case basis.
Defaults to the system's default time zone (see java.util.TimeZone.getDefault
).
Specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. Sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type Default is yyyy-MM-dd
{"dt":"01-05-2018"}
CREATE STREAM foo
FROM CSV
OPTIONS(
'dateFormat'='yyyy-MM-dd'
);
It sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. The default is yyyy-MM-dd'T'HH:mm:ss.SSSXXX
// value: 2018-01-03 18:41:42
// interpreted: 2018-01-03 18:41:00
{"ts": 1515004902}
CREATE STREAM foo
FROM CSV
OPTIONS(
'timestampFormat'='yyyy-MM-dd'T'HH:mm:ss.SSSXXX'
);
Specifies whether files are multi-line CSV documents. Typically large CSV files follow the CSV lines standard, which is essentially one valid CSV object per line. However, you may encounter large CSV documents where a single file is a single CSV object. In these cases, multiLine should be set to true.
Defaults to false.
{
"key":"value"
}
CREATE STREAM foo
FROM CSV
OPTIONS(
'multiLine'='true'
);
It give us the max number of available columns in the CSV file that being executed.
CREATE STREAM data
FROM CSV
OPTIONS(
'maxColumns'='20480'
);
Defines the maximum number of characters allowed for any given value being read. The default value is
1000000
CREATE STREAM data
FROM CSV
OPTIONS(
'maxCharsPerColumn'='1000000'
);
These are used to escape the quotes if needed while reading a CSV file into the Spark Data-frame. It can be used with any character depending upon the requirement. It is a flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character. True
is the default.
CREATE STREAM data
FROM CSV
OPTIONS(
'escapeQuotes'='true'
);
It is a flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character. Default value is false
CREATE STREAM data
FROM CSV
OPTIONS(
'quoteAll'='false'
);
It give us the information regarding the size of the input buffer. Default is 128
CREATE STREAM data
FROM CSV
OPTIONS(
'inputBufferSize'='128'
);