Skip to content

JSON Source

arumugaguru-muruganandaguru edited this page May 11, 2018 · 2 revisions

The JSON source is a specific subset of the of the File Source. As such, file source options may also be configured on a JSON source. As always, the path option of the file source must be configured. Please review the File Source if you are not already familiar with it.

Options

samplingRatio

Specifies a value greater than 0, and less than or equal to 1. Represents an approximate percent of the amount of data to use when inferring the json schema if it was not provided by the user. Typically we recommend users define their schema(s) if they are well-known and non-volatile. For schema inference to take effect, spark.sql.streaming.schemaInference must be set to true.

Defaults to 1.0.

CREATE STREAM foo
FROM JSON
OPTIONS(
  'samplingRatio'='0.25'
);

primitivesAsString

Specifies whether primitive JSON data types should be interpreted as strings when inferring a JSON schema.

Defaults to false.

{"int":0,"bool":false}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'primitivesAsString'='true'
);

In the above example, columns int and bool will be columns whose data type is a string.

prefersDecimal

Specifies whether JSON floats and doubles should be interpreted as decimals when inferring a JSON schema.

Defaults to false.

{"double":0.0}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'prefersDecimal'='true'
);

In the above examples, column double will be of data type decimal rather than double.

allowComments

Feature that determines whether parser will allow use of Java/C++ style comments (both '/'+'*' and '//' varieties) within parsed content or not.

Since JSON specification does not mention comments as legal construct, this is a non-standard feature; however, in the wild this is extensively used. As such, feature is disabled by default for parsers and must be explicitly enabled.

// some comment here
{"key":"value"}
{"key":"value"/*another comment block here*/}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'allowComments'='true'
);

allowUnquotedFieldNames

Feature that determines whether parser will allow use of unquoted field names (which is allowed by Javascript, but not by JSON specification).

Since JSON specification requires use of double quotes for field names, this is a non-standard feature, and as such disabled by default.

{key:"value"}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'allowUnquotedFieldNames'='true'
);

allowSingleQuotes

Feature that determines whether parser will allow use of single quotes (apostrophe, character ''') for quoting Strings (names and String values). If so, this is in addition to other acceptable markers. but not by JSON specification).

Since JSON specification requires use of double quotes for field names, this is a non-standard feature, and as such disabled by default.

{'key':"value"}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'allowSingleQuotes'='true'
);

allowNumericLeadingZeros

Feature that determines whether parser will allow JSON integral numbers to start with additional (ignorable) zeroes (like: 000001). If enabled, no exception is thrown, and extra nulls are silently ignored (and not included in textual representation exposed via JsonParser.getText()).

Since JSON specification does not allow leading zeroes, this is a non-standard feature, and as such disabled by default.

{"key":00001}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'allowNumericLeadingZeros'='true'
);

allowNonNumericNumbers

Feature that allows parser to recognize set of "Not-a-Number" (NaN) tokens as legal floating number values (similar to how many other data formats and programming language source code allows it). Specific subset contains values that XML Schema (see section 3.2.4.1, Lexical Representation) allows (tokens are quoted contents, not including quotes):

  • "INF" (for positive infinity), as well as alias of "Infinity"
  • "-INF" (for negative infinity), alias "-Infinity"
  • "NaN" (for other not-a-numbers, like result of division by zero)

Since JSON specification does not allow use of such values, this is a non-standard feature, and as such disabled by default.

{"key":NaN}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'allowNonNumericNumbers'='true'
);

allowBackslashEscapingAnyCharacter

Feature that can be enabled to accept quoting of all character using backslash quoting mechanism: if not enabled, only characters that are explicitly listed by JSON specification can be escaped (see JSON spec for small list of these characters).

Since JSON specification requires quoting for all control characters, this is a non-standard feature, and as such disabled by default.

{"key":"val\\""}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'allowBackslashEscapingAnyCharacter'='true'
);

allowUnquotedControlChars

Feature that determines whether parser will allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. If feature is set false, an exception is thrown if such a character is encountered.

Since JSON specification requires quoting for all control characters, this is a non-standard feature, and as such disabled by default.

{"k\tey":"\t"}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'allowUnquotedControlChars'='true'
);

compression

Specifies the compression codec of the JSON files, if any, based on supported Spark compression codecs.

Defaults to none.

// files are snappy compressed
{"key":"value"}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'compression'='snappy'
);

mode

Determines how this JSON source should behave when corrupt records are encountered.

  • permissive: attempts to process malformed records as long as other constraints are not violated
  • dropmalformed: ignores malformed records entirely
  • failfast: throws an exception when malformed records are encountered

Defaults to permissive.

{...} // invalid json, throws exception
CREATE STREAM foo
FROM JSON
OPTIONS(
  'mode'='failfast'
);

columnNameOfCorruptRecord

Specifies an optional column name to capture values of corrupt records. Behavior varies based on the configured mode. This is typically useful in scenarios where you want to route valid records to one location, and corrupt records to another, while capturing the original corrupt value. The column name to capture the corrupt record value must be specified in the schema and be a string column.

{...}
{"key":"value"}
CREATE STREAM foo(key string, corruptData string)
FROM JSON
OPTIONS(
  'columnNameOfCorruptRecord'='corruptData'
);

-- contains {"key":"value"}
CREATE TEMPORARY VIEW validRecords AS
SELECT * FROM foo WHERE corruptData IS NULL;

-- contains {...}
CREATE TEMPORARY VIEW corruptRecords AS
SELECT * FROM foo WHERE corruptData IS NOT NULL;

-- todo: route validRecords and corruptRecords accordingly

timezone

Specifies the time zone to interpret date and time values in when processing. This is typically useful for interactive use cases where a human or report wishes to interpret time information in a specific time zone. Data processing pipelines should always deal with UTC values so they can be easily interpreted by presentation layers as needed on a case by case basis.

Defaults to the system's default time zone (see java.util.TimeZone.getDefault).

dateFormat

Specifies the date format used to coerce JSON values into a desired SQL date column.

Defaults to 'yyyy-MM-dd'.

{"dt":"01-01-2018"}
CREATE STREAM foo(dt date)
FROM JSON
OPTIONS(
  'dateFormat'='MM-dd-yyyy'
);

timestampFormat

Specifies the timestamp format used to coerce JSON values into a desired SQL timestamp column.

Defaults to yyyy-MM-dd'T'HH:mm:ss.SSSXXX.

// value: 2018-01-03 18:41:42
// interpreted: 2018-01-03 18:41:00
{"ts": 1515004902}
CREATE STREAM foo(ts timestamp)
FROM JSON
OPTIONS(
  'timestampFormat'='yyyy-MM-dd HH:mm'
);

multiLine

Specifies whether files are multi-line JSON documents. Typically large JSON files follow the JSON lines standard, which is essentially one valid JSON object per line. However, you may encounter large JSON documents where a single file is a single JSON object. In these cases, multiLine should be set to true.

Defaults to false.

{
  "key":"value"
}
CREATE STREAM foo
FROM JSON
OPTIONS(
  'multiLine'='true'
);

Note the example JSON object spans multiple lines, rather than the JSON lines standard, which is one JSON object per line, as seen below:

{"key":"value"}
{"key":"value"}
Clone this wiki locally