Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names #1245

Open
dannnnthemannnn opened this issue May 31, 2023 · 1 comment
Assignees
Labels
api: spanner Issues related to the googleapis/java-spanner-jdbc API.

Comments

@dannnnthemannnn
Copy link

dannnnthemannnn commented May 31, 2023

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please include as much information as possible:

Environment details

  1. Specify the API at the beginning of the title. For example, "BigQuery: ...").
    General, Core, and Other are also allowed as types
  2. OS type and version: Mac 12.0
  3. Java version: 20.0.1
  4. version(s):

Steps to reproduce

  1. Hook this connector up in a spark job with the following code:
def querySpanner[T](sqlQuery: String)(implicit spark: SparkSession, enc: org.apache.spark.sql.Encoder[T]): Dataset[T] = {
  val url = "jdbc:cloudspanner:/projects/your-project-id/instances/your-instance-id/databases/your-database-id?credentials=$jsonKeyFilePath"

  // Read data using Spark
  val df = spark.read
    .format("jdbc")
    .option("url", url)
    .option("dbtable", "myTable")
    .option("driver", "com.google.cloud.spanner.jdbc.JdbcDriver")
    .load()

  // Convert DataFrame to Dataset
  df.as[T]
}
  1. See that it is returning the values as if every row is the column name:
    +----+---+
    |name| id|
    +----+---+
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    |name| id|
    +----+---+
    only showing top 20 rows

My query is: "SELECT name, id FROM OrgInfoV2"

Any additional information below

It seems similar to this issue:
https://stackoverflow.com/questions/66983401/spark-mariadb-jdbc-sql-query-returns-column-names-instead-of-column-values
or this one:
https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name

where it appears to be issues with the driver

Following these steps guarantees the quickest resolution possible.

Thanks!

@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/java-spanner-jdbc API. label May 31, 2023
@dannnnthemannnn dannnnthemannnn changed the title Using this connector in Spark is resulting in all values in the spark dataframe being the column names Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names May 31, 2023
@olavloite
Copy link
Collaborator

@dannnnthemannnn

I'm pretty sure that this is the same as for example sparklyr/sparklyr#3196

The problem is that Spark seems to generate a query that looks like this:

select "name", "id"
from OrgInfoV2

Double quotes are used for string literals in Cloud Spanner (and BigQuery). Sparks seems to think that it is a valid way to quote column names in case any of the column names contain any spaces or are equal to reserved keywords.

A possible workaround is probably to explicitly use the MySQL dialect for your connection. MySQL uses the same type of quoting as Cloud Spanner. See https://github.com/apache/spark/blob/071feabbd4325504332679dfa620bc5ee4359370/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala#L108

@rajatbhatta rajatbhatta assigned olavloite and unassigned rajatbhatta Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/java-spanner-jdbc API.
Projects
None yet
Development

No branches or pull requests

3 participants