Skip to content

Commit

Permalink
update: restructure read operations page (#447)
Browse files Browse the repository at this point in the history
Co-authored-by: Sarah Haggarty <[email protected]>
  • Loading branch information
sarahhaggarty and sarahhaggarty authored Sep 7, 2023
1 parent 6bb6cf7 commit 66d94fb
Showing 1 changed file with 91 additions and 52 deletions.
143 changes: 91 additions & 52 deletions docs/StardustDocs/topics/read.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,41 @@
[//]: # (title: Read)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Read-->

The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, Apache Arrow input formats.
The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, and Apache Arrow input formats.

`read` method automatically detects input format based on file extension and content
The `.read()` function automatically detects the input format based on file extension and content:

```kotlin
DataFrame.read("input.csv")
```

Input string can be a file path or URL.
The input string can be a file path or URL.

## Reading CSV
## Read from CSV

All these calls are valid:
To read a CSV file, use the `.readCSV()` function.

To read a CSV file from a file:

```kotlin
import java.io.File
import java.net.URL

DataFrame.readCSV("input.csv")
// Alternatively
DataFrame.readCSV(File("input.csv"))
```

To read a CSV file from a URL:

```kotlin
import java.net.URL

DataFrame.readCSV(URL("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"))
```

All `readCSV` overloads support different options.
For example, you can specify custom delimiter if it differs from `,`, charset
and column names if your CSV is missing them
### Specify delimiter

By default, CSV files are parsed using `,` as the delimiter. To specify a custom delimiter, use the `delimiter` argument:

<!---FUN readCsvCustom-->

Expand All @@ -41,7 +50,9 @@ val df = DataFrame.readCSV(

<!---END-->

Column types will be inferred from the actual CSV data. Suppose that CSV from the previous
### Column type inference from CSV

Column types are inferred from the CSV data. Suppose that the CSV from the previous
example had the following content:

<table>
Expand All @@ -51,7 +62,7 @@ example had the following content:
<tr><td>89</td><td>abc</td><td>7.1</td><td>false</td></tr>
</table>

[`DataFrame`](DataFrame.md) schema we get is:
Then the [`DataFrame`](DataFrame.md) schema we get is:

```text
A: Int
Expand All @@ -60,7 +71,7 @@ C: Double
D: Boolean?
```

[`DataFrame`](DataFrame.md) will try to parse columns as JSON, so when reading following table with JSON object in column D:
[`DataFrame`](DataFrame.md) tries to parse columns as JSON, so when reading the following table with JSON object in column D:

<table>
<tr><th>A</th><th>D</th></tr>
Expand All @@ -77,7 +88,7 @@ D:
C: Int
```

For column where values are lists of JSON values:
For a column where values are lists of JSON values:
<table>
<tr><th>A</th><th>G</th></tr>
<tr><td>12</td><td>[{"B":1,"C":2,"D":3},{"B":1,"C":3,"D":2}]</td></tr>
Expand All @@ -92,7 +103,7 @@ G: *
D: Int
```

### Dealing with locale specific numbers
### Work with locale-specific numbers

Sometimes columns in your CSV can be interpreted differently depending on your system locale.

Expand All @@ -102,8 +113,8 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
<tr><td>41,111</td></tr>
</table>

Here comma can be decimal or thousands separator, thus different values.
You can deal with it in two ways
Here a comma can be decimal or thousands separator, thus different values.
You can deal with it in two ways:

1) Provide locale as a parser option

Expand Down Expand Up @@ -132,20 +143,34 @@ val df = DataFrame.readCSV(
<!---END-->


## Reading JSON
## Read from JSON

To read a JSON file, use the `.readJSON()` function. JSON files can be read from a file or a URL.

Note that after reading a JSON with a complex structure, you can get hierarchical
[`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s.

To read a JSON file from a file:

<!---FUN readJson-->

```kotlin
val df = DataFrame.readJson(file)
```

<!---END-->

Basics for reading JSONs are the same: you can read from file or from remote URL.
To read a JSON file from a URL:

```kotlin
DataFrame.readJson("https://covid.ourworldindata.org/data/owid-covid-data.json")
```

Note that after reading a JSON with a complex structure, you can get hierarchical
[`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s.
### Column type inference from JSON

Also note that type inferring process for JSON is much simpler than for CSV.
JSON string literals are always supposed to have String type, number literals
take different `Number` kinds, boolean literals are converted to `Boolean`.
Type inference for JSON is much simpler than for CSV.
JSON string literals are always supposed to have String type. Number literals
take different `Number` kinds. Boolean literals are converted to `Boolean`.

Let's take a look at the following JSON:

Expand Down Expand Up @@ -178,17 +203,13 @@ Let's take a look at the following JSON:
]
```

We can read it from file

<!---FUN readJson-->
We can read it from file:

```kotlin
val df = DataFrame.readJson(file)
```

<!---END-->

Corresponding [`DataFrame`](DataFrame.md) schema will be
The corresponding [`DataFrame`](DataFrame.md) schema is:

```text
A: String
Expand All @@ -200,7 +221,9 @@ D: Boolean?
Column A has `String` type because all values are string literals, no implicit conversion is performed. Column C
has `Number` type because it's the least common type for `Int` and `Double`.

### JSON Reading Options: Type Clash Tactic
### JSON parsing options

#### Manage type clashes

By default, if a type clash occurs when reading JSON, a new column group is created consisting of: "value", "array", and
any number of object properties:
Expand Down Expand Up @@ -251,9 +274,9 @@ For this case, you can set `typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS`

This option is also possible to set in the Gradle- and KSP plugin by providing `jsonOptions`.

### JSON Reading Options: Key/Value Paths
#### Specify Key/Value Paths

If you have some JSON looking like
If you have a JSON like:

```json
{
Expand All @@ -280,10 +303,10 @@ If you have some JSON looking like
}
```

you will get a column for each dog, which becomes an issue when you have a lot of dogs.
This issue is especially noticeable when generating data schemas from the JSON, as you might even run out of memory
when doing that due to the sheer number of generated interfaces.\
Instead, you can use `keyValuePaths` to specify paths to the objects that should be read as key value frame columns.
You will get a column for each dog, which becomes an issue when you have a lot of dogs.
This issue is especially noticeable when generating data schemas from JSON, as you might run out of memory
when doing that due to the sheer number of generated interfaces. Instead, you can use `keyValuePaths` to specify paths
to the objects that should be read as key value frame columns.

This can be the difference between:

Expand Down Expand Up @@ -342,22 +365,35 @@ Only the bracket notation of json path is supported, as well as just double quot

For more examples, see the "examples/json" module.

## Reading Excel
## Read from Excel

Add dependency:
Before you can read data from Excel, add the following dependency:

```kotlin
implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")
```

Right now [`DataFrame`](DataFrame.md) supports reading Excel spreadsheet formats: xls, xlsx.
To read an Excel spreadsheet, use the `.readExcel()` function. Excel spreadsheets can be read from a file or a URL. Supported
Excel spreadsheet formats are: xls, xlsx.

To read an Excel spreadsheet from a file:

```kotlin
val df = DataFrame.readExcel(file)
```

You can read from file or URL.
To read an Excel spreadsheet from a URL:

```kotlin
DataFrame.readExcel("https://example.com/data.xlsx")
```

### Cell type inference from Excel

Cells representing dates will be read as `kotlinx.datetime.LocalDateTime`.
Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double`
Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double`.

Sometimes cells can have wrong format in Excel file, for example you expect to read column of String:
Sometimes cells can have the wrong format in an Excel file. For example, you expect to read a column of `String`:

```text
IDS
Expand All @@ -367,9 +403,9 @@ B100
C100
```

You will get column of Serializable instead (common parent for Double & String)
You will get column of `Serializable` instead (common parent for `Double` and `String`).

You can fix it using convert:
You can fix it using the `.convert()` function:

<!---FUN fixMixedColumn-->

Expand All @@ -387,25 +423,28 @@ df1["IDS"].type() shouldBe typeOf<String>()

<!---END-->

## Reading Apache Arrow formats
## Read Apache Arrow formats

Add dependency:
Before you can read data from Apache Arrow format, add the following dependency:

```kotlin
implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")
```

<warning>
Make sure to follow [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide when using Java 9+
</warning>
To read Apache Arrow formats, use the `.readArrowFeather()` function:

[`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format)
and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files)
from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.
<!---FUN readArrowFeather-->

```kotlin
val df = DataFrame.readArrowFeather(file)
```

<!---END-->

[`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format)
and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files)
from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.

> If you use Java 9+, follow the [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide.
>
{style="note"}

0 comments on commit 66d94fb

Please sign in to comment.