[BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() #473

asalamon74 · 2024-06-06T09:12:39Z

What is the bug?

I wanted to insert some documents to opensearch using spark. When I follow this suggestion ( https://github.com/opensearch-project/opensearch-hadoop/blob/main/USER_GUIDE.md#writing-3 ) and I insert a simple document created on the fly it works.

relevant part of the code:

object SparkTest extends App {
  override def main(args : Array[String]) : Unit = {
    val spark = SparkSession.builder()
    .config("opensearch.nodes", "HOSTNAME")
    .config("opensearch.net.http.auth.user", "admin")
    .config("opensearch.net.http.auth.pass", "admin")
    .getOrCreate()

    val sc = spark.sparkContext
    val doc1 = Map("vendor_id" -> 1)
    val doc2 = Map("vendor_id" -> 2)
    val batch = sc.makeRDD(Seq(doc1, doc2))
    batch.saveToOpenSearch("test_collection")

but when I try to read the data from a json file like this:

val sqlContext = spark.sqlContext
val df = sqlContext.read.option("header", "true").csv("file.csv")
df.rdd.saveToOpenSearch("test_collection");

I've got the following error:

Caused by: org.opensearch.hadoop.OpenSearchHadoopIllegalArgumentException: Spark SQL types are not handled through basic RDD saveToOpenSearch() calls; typically this is a mistake(as the SQL schema will be ignored). Use 'org.opensearch.spark.sql' package instead
        at org.opensearch.spark.serialization.ScalaValueWriter.doWriteScala(ScalaValueWriter.scala:141)
        at org.opensearch.spark.serialization.ScalaValueWriter.write(ScalaValueWriter.scala:55)
        at org.opensearch.hadoop.serialization.builder.ContentBuilder.value(ContentBuilder.java:63)
        at org.opensearch.hadoop.serialization.bulk.TemplatedBulk.doWriteObject(TemplatedBulk.java:81)
        at org.opensearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:68)
        at org.opensearch.hadoop.serialization.bulk.BulkEntryWriter.writeBulkEntry(BulkEntryWriter.java:78)
        ... 13 more

Most likely becase the dataset has a org.apache.spark.sql.Dataset type. I have tried with and without header, same result.

I also tried to load a json (also suggested by the user guide)

val df = sqlContext.read.option("multiline","true").json("test.json")

but got the same result.

What am I missing here? Is it a bug, or I'm supposed to read the json/csv using some other way?

How can one reproduce the bug?

Read a json/csv, try to call saveToOpenSearch

What is the expected behavior?

I expected to add the documents to opensearch.

What is your host/environment?

Linux (RedHat 8)
Spark3
scala 2.12
opensearch 2.12
latest opensearch-hadoop

Do you have any screenshots?

Do you have any additional context?

The text was updated successfully, but these errors were encountered:

dblock · 2024-07-01T16:17:48Z

Could be a bug/missing feature. Will need a deeper dive.

[Catch All Triage - Attendees 1, 2, 3, 4, 5]

Xtansia · 2024-07-24T04:46:20Z

@asalamon74 I notice that on your example has df.rdd.saveToOpenSearch("test_collection"); where as the USER_GUIDE uses df.saveToOpenSearch no .rdd. Can you confirm if you still experience issues with that change?

Example from guide:

import org.opensearch.spark.sql._

val df = sqlContext.read.json("examples/people.json")
df.saveToOpenSearch("spark/people")

asalamon74 added bug Something isn't working untriaged labels Jun 6, 2024

asalamon74 changed the title ~~[BUG]~~ [BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() Jun 6, 2024

dblock removed the untriaged label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() #473

[BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() #473

asalamon74 commented Jun 6, 2024

dblock commented Jul 1, 2024

Xtansia commented Jul 24, 2024

[BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() #473

[BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() #473

Comments

asalamon74 commented Jun 6, 2024

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

dblock commented Jul 1, 2024

Xtansia commented Jul 24, 2024