Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geni not usable wit latest Databricks #356

Open
behrica opened this issue Oct 13, 2024 · 4 comments · May be fixed by #357
Open

Geni not usable wit latest Databricks #356

behrica opened this issue Oct 13, 2024 · 4 comments · May be fixed by #357

Comments

@behrica
Copy link
Collaborator

behrica commented Oct 13, 2024

As mentioned here
#332

I have an issue with geni on databricks.
Apparently this call

(.setCheckpointDir context checkpoint-dir))

which is execute even when using an existing spark session / context,
fails on databricks:

IllegalArgumentException: Path must be absolute: target/checkpoint/3f38a4a8-51e9-47fc-a1d1-7c0f3e2f2520 at com.databricks.common.path.AbstractPath$.fromHadoopPath(AbstractPath.scala:114) at com.databricks.backend.daemon.data.client.DBFSV2.resolveAndGetFileSystem(DatabricksFileSystemV2.scala:148) at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.resolve(DatabricksFileSystemV2.scala:773) at com.databricks.backend.daemon.data.client.Databrick

If I understand the current code correctly, it cannot be avoided that this call is made, neither can the directory be changed.

My comments in #332 where based n the assumption that the code does wrongly "creates" an other session, but I think it's no true.
It does get the session from databrcks, but tries to set the checkpoint directory on the existing session/context, and this fails now. Maybe it did work with older databricks/spark versions.

@behrica
Copy link
Collaborator Author

behrica commented Nov 9, 2024

This issue makes it impossible to use geni with databricks.

@behrica
Copy link
Collaborator Author

behrica commented Nov 9, 2024

I digged into it.
While I was always wondering, if setting options here :

{:configs {:spark.sql.adaptive.enabled "true"

is the "right thing" to do "in general", as there might always be Spark environments which do not support some of them.

Docu says:

Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.

so the options are always used even if an existing spark session is used, which to me is questionable,
specially as it can not be disabled.

The setCheckpointDir is always called as well, even when using an existing spark session.
And apparently my databricks cluster does not support to change the checkpoint dir to the unchangable default of
"target/checkpoint/" (which is clearly a development setting and I clearly don't want to even call in my cluster)
My suggestion would be to not set neither options nor checkpoint-dir by default.

@behrica
Copy link
Collaborator Author

behrica commented Nov 17, 2024

on my fork I made a change which fixes this:
behrica@16080dd

@behrica behrica changed the title default "setCheckpointDir" fails on databricks Geni not usable wit latest Databricks Nov 17, 2024
@anthony-khong
Copy link
Member

Hi Carsten @behrica, would you like to make a PR, and then I'm happy to merge it. Let me know if we need to change the CI workflows as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants