Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark backend does not work on Databricks #4642

Open
HaraldVanWoerkom opened this issue May 20, 2022 · 5 comments
Open

Spark backend does not work on Databricks #4642

HaraldVanWoerkom opened this issue May 20, 2022 · 5 comments
Assignees

Comments

@HaraldVanWoerkom
Copy link

When I create an estimator with a 'spark' backend on Databricks, I have to provide a path to a shared folder. On Databricks that is on /dbfs/...:
est = Estimator.from_keras(model_creator=my_model_creator, backend='spark', model_dir = '/dbfs/tmp/bigdl')

Unfortunately, the fit crashes with an "operation not supported" exception. This is (apparently) because the model writing code uses random access, which is not supported by the file system.
BigDL has a solution for this, it can write to local storage and then copy to the shared folder. This is triggered by setting model_dir to 'dbfs:/tmp/bigdl'. Unfortunately, this causes an exception in bigdl/orca/learn/utils.py, the save_pkl function, because 'open' does not support the dbfs: format (yes, in Databricks some APIs require dbfs:, while others do not support it).

I patched utils.py to support dbfs (line 380):
else:
if path.startswith("file://"):
path = path[len("file://"):]
elif path.startswith("dbfs:/"): # NEW
path = "/dbfs/" + path[len("dbfs:/"):] # NEW
with open(path, 'wb') as f:
pickle.dump(data, f)

This seems to work.

@jason-dai
Copy link
Contributor

@HaraldVanWoerkom Thanks for reporting the issue; we'll take a look.

@Le-Zheng
Copy link
Contributor

@HaraldVanWoerkom many thanks for your posting! I have verified the solution path = "/dbfs/" + path[len("dbfs:/"):] works on the Databricks. I will create a fix in the source code.

@HaraldVanWoerkom
Copy link
Author

@Le-Zheng Thanks for picking this up.

@Le-Zheng
Copy link
Contributor

@Le-Zheng Thanks for picking this up.

sure. Estimator.from_keras API does not call save_pkl function.

@Le-Zheng
Copy link
Contributor

This issue will be fixed in the patch #4674

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants