Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() #473

Open
asalamon74 opened this issue Jun 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@asalamon74
Copy link
Contributor

What is the bug?

I wanted to insert some documents to opensearch using spark. When I follow this suggestion ( https://github.com/opensearch-project/opensearch-hadoop/blob/main/USER_GUIDE.md#writing-3 ) and I insert a simple document created on the fly it works.

relevant part of the code:

object SparkTest extends App {
  override def main(args : Array[String]) : Unit = {
    val spark = SparkSession.builder()
    .config("opensearch.nodes", "HOSTNAME")
    .config("opensearch.net.http.auth.user", "admin")
    .config("opensearch.net.http.auth.pass", "admin")
    .getOrCreate()

    val sc = spark.sparkContext
    val doc1 = Map("vendor_id" -> 1)
    val doc2 = Map("vendor_id" -> 2)
    val batch = sc.makeRDD(Seq(doc1, doc2))
    batch.saveToOpenSearch("test_collection")

but when I try to read the data from a json file like this:

val sqlContext = spark.sqlContext
val df = sqlContext.read.option("header", "true").csv("file.csv")
df.rdd.saveToOpenSearch("test_collection");

I've got the following error:

Caused by: org.opensearch.hadoop.OpenSearchHadoopIllegalArgumentException: Spark SQL types are not handled through basic RDD saveToOpenSearch() calls; typically this is a mistake(as the SQL schema will be ignored). Use 'org.opensearch.spark.sql' package instead
        at org.opensearch.spark.serialization.ScalaValueWriter.doWriteScala(ScalaValueWriter.scala:141)
        at org.opensearch.spark.serialization.ScalaValueWriter.write(ScalaValueWriter.scala:55)
        at org.opensearch.hadoop.serialization.builder.ContentBuilder.value(ContentBuilder.java:63)
        at org.opensearch.hadoop.serialization.bulk.TemplatedBulk.doWriteObject(TemplatedBulk.java:81)
        at org.opensearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:68)
        at org.opensearch.hadoop.serialization.bulk.BulkEntryWriter.writeBulkEntry(BulkEntryWriter.java:78)
        ... 13 more

Most likely becase the dataset has a org.apache.spark.sql.Dataset type. I have tried with and without header, same result.

I also tried to load a json (also suggested by the user guide)

val df = sqlContext.read.option("multiline","true").json("test.json")

but got the same result.

What am I missing here? Is it a bug, or I'm supposed to read the json/csv using some other way?

How can one reproduce the bug?

Read a json/csv, try to call saveToOpenSearch

What is the expected behavior?

I expected to add the documents to opensearch.

What is your host/environment?

Linux (RedHat 8)
Spark3
scala 2.12
opensearch 2.12
latest opensearch-hadoop

Do you have any screenshots?

Do you have any additional context?

@asalamon74 asalamon74 added bug Something isn't working untriaged labels Jun 6, 2024
@asalamon74 asalamon74 changed the title [BUG] [BUG] Spark SQL types are not handled through basic RDD saveToOpenSearch() Jun 6, 2024
@dblock
Copy link
Member

dblock commented Jul 1, 2024

Could be a bug/missing feature. Will need a deeper dive.

[Catch All Triage - Attendees 1, 2, 3, 4, 5]

@dblock dblock removed the untriaged label Jul 1, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Jul 24, 2024

@asalamon74 I notice that on your example has df.rdd.saveToOpenSearch("test_collection"); where as the USER_GUIDE uses df.saveToOpenSearch no .rdd. Can you confirm if you still experience issues with that change?

Example from guide:

import org.opensearch.spark.sql._

val df = sqlContext.read.json("examples/people.json")
df.saveToOpenSearch("spark/people")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants