Skip to content

Commit

Permalink
docs: ✏️updates readme
Browse files Browse the repository at this point in the history
  • Loading branch information
MKhalusova authored Aug 8, 2020
1 parent 034a0a8 commit 175815d
Showing 1 changed file with 89 additions and 59 deletions.
148 changes: 89 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,58 @@
# Kotlin for Apache Spark


Your next API to work with [Apache Spark](https://spark.apache.org/).

We are looking to have this as a part of https://github.com/apache/spark repository. Consider this beta-quality software.

Nice-looking rendered [README](https://jetbrains.github.io/kotlin-spark-api/)

## Goal

This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Spark](https://spark.apache.org/).

Despite Kotlin having first-class compatibility API, Kotlin developers may want to use familiar features like data classes and lambda expressions as simple expressions in curly braces or method references.

## Non-goals

There is no goal to replace any currently supported language or provide other APIs with some functionality to support Kotlin language.

## Installation

Currently, there are no kotlin-spark-api artifacts in maven central, but you can obtain a copy using JitPack here: [![](https://jitpack.io/v/JetBrains/kotlin-spark-api.svg)](https://jitpack.io/#JetBrains/kotlin-spark-api)

There is support for `Maven`, `Gradle`, `SBT`, and `leinengen` on JitPack.

This project does not force you to use any specific version of Spark, but it has only been tested it with spark `3.0.0`.

So if you're using Maven you'll have to add the following into your `pom.xml`:
# Kotlin for Apache® Spark™

Your next API to work with [Apache Spark](https://spark.apache.org/).

This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Apache Spark](https://spark.apache.org/).
It allows Kotlin developers to use familiar language features such as data classes, and lambda expressions as simple expressions in curly braces or method references.

We have opened a Spark Project Improvement Proposal: [Kotlin support for Apache Spark](http://issues.apache.org/jira/browse/SPARK-32530#) to work with the community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your opinions and participate in the discussion.

## Table of Contents

- [Supported versions of Apache Spark](#supported-apache-spark)
- [Releases](#releases)
- [How to configure Kotlin for Apache Spark in your project](#how-to-configure-kotlin-for-apache-spark-in-your-project)
- [Kotlin for Apache Spark features](#kotlin-for-apache-spark-features)
- [Creating a SparkSession in Kotlin](#creating-a-sparksession-in-kotlin)
- [Creating a Dataset in Kotlin](#creating-a-dataset-in-kotlin)
- [Null safety](#null-safety)
- [withSpark function](#withspark-function)
- [withCached function](#withcached-function)
- [toList and toArray](#tolist-and-toarray-methods)
- [Examples](#examples)
- [Reporting issues/Support](#reporting-issuessupport)
- [Code of Conduct](#code-of-conduct)
- [License](#license)

## Supported versions of Apache Spark

<table>
<thead>
<tr>
<th>Apache Spark</th>
<th>Kotlin for Apache Spark</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td>3.0.0</td>
<td>0.3 +</td>
</tr>
</tbody>
</table>

## Releases

The list of Kotlin for Apache Spark releases is available [here](https://github.com/JetBrains/kotlin-spark-api/releases/).
The `kotlin-spark-api` artifact can be obtained from [JitPack](https://jitpack.io/#JetBrains/kotlin-spark-api).

[![](https://jitpack.io/v/JetBrains/kotlin-spark-api.svg)](https://jitpack.io/#JetBrains/kotlin-spark-api)

## How to configure Kotlin for Apache Spark in your project

You can add Kotlin for Apache Spark as a dependency to your project: `Maven`, `Gradle`, `SBT`, and `leinengen` are supported.

Here's an example `pom.xml`:

```xml
<repositories>
Expand All @@ -46,19 +73,17 @@ So if you're using Maven you'll have to add the following into your `pom.xml`:
</dependency>
```

Note that `core` is being compiled against Scala version `2.12` and it means you have to use `2.12` build of spark if you want to try out this project.
You can find a complete example with `pom.xml` and `build.gradle` in the [Quick Start Guide](docs/quick-start-guide.md).

## Usage

First (and hopefully last) thing you need to do is to add following import to your Kotlin file:
Note that `core` is being compiled against Scala version `2.12`.
You can find a complete example with `pom.xml` and `build.gradle` in the [Quick Start Guide](docs/quick-start-guide.md).

Once you have configured the dependency, you only need to add the following import to your Kotlin file:
```kotlin
import org.jetbrains.spark.api.*
```
```

Then you can create a SparkSession:
## Kotlin for Apache Spark features

### Creating a SparkSession in Kotlin
```kotlin
val spark = SparkSession
.builder()
Expand All @@ -67,22 +92,19 @@ val spark = SparkSession

```

To create a Dataset you can call `toDS` method:

### Creating a Dataset in Kotlin
```kotlin
spark.toDS("a" to 1, "b" to 2)
```
The example above produces `Dataset<Pair<String, Int>>`.

### Null safety
There are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design.
For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.
`NullPointerException`s are hard to debug in Spark, and we doing our best to make them as rare as possible.

Indeed, this produces `Dataset<Pair<String, Int>>`. There are a couple more `toDS` methods which accept different arguments.

Also, there are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design. For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.

We know that `NullPointerException`s are hard to debug in Spark, and we are trying hard to make them as rare as possible.

## Useful helper methods

### `withSpark`
### withSpark function

We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context.

Expand All @@ -98,14 +120,13 @@ withSpark {

`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs.

### `withCached`

### withCached function
It can easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache`
method. But there it is hard to control when we're using cached `Dataset` and when not.
It is also easy to forget to unpersist cached data, which can break things unexpectably or take more memory
method. However, it becomes difficult to control when we're using cached `Dataset` and when not.
It is also easy to forget to unpersist cached data, which can break things unexpectedly or take up more memory
than intended.

To solve these problems we introduce `withCached` function
To solve these problems we've added `withCached` function

```kotlin
withSpark {
Expand All @@ -121,19 +142,28 @@ withSpark {
}
```

Here we're showing cached `Dataset` for debugging purposes then filtering it. The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory to call the `map` method and collect the resulting `Dataset`.
Here we're showing cached `Dataset` for debugging purposes then filtering it.
The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory t
o call the `map` method and collect the resulting `Dataset`.

### `toList` and `toArray`
### toList and toArray methods

Kotlin uses `to` method on sequences to convert them to collections, so we have `toList` and `toArray` methods in our API for your code to look idiomatic. Usual `collect` method works too, but result should be casted to `Array` because `collect` returns Scala's array, which is not the same as Java/Kotlin one.
For more idiomatic Kotlin code we've added `toList` and `toArray` methods in this API. You can still use the `collect` method as in Scala API, however the result should be casted to `Array`.
This is because `collect` returns a Scala array, which is not the same as Java/Kotlin one.

## Examples

For more, check out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples) module.
To get up and running quickly, check out this [tutorial](docs/quick-start-guide.md).

## Issues and feedback
## Reporting issues/Support
Please use [GitHub issues](https://github.com/JetBrains/kotlin-spark-api/issues) for filing feature requests and bug reports.
You are also welcome to join [kotlin-spark channel](https://kotlinlang.slack.com/archives/C015B9ZRGJF) in the Kotlin Slack.

## Code of Conduct
This project and the corresponding community is governed by the [JetBrains Open Source and Community Code of Conduct](https://confluence.jetbrains.com/display/ALL/JetBrains+Open+Source+and+Community+Code+of+Conduct). Please make sure you read it.

## License
Kotlin for Apache Spark is licensed under the [Apache 2.0 License](LICENSE).

Issues and any feedback are very welcome in `Issues` here.

If you find that we missed some important features — let us know!

0 comments on commit 175815d

Please sign in to comment.