SPARK-5063 error #22

ramondalmau · 2020-12-11T19:53:11Z

Dear all

Many thanks for this nice contribution. Sparktorch is exactly what I was looking for! :)
I am trying to attach a trained Pytorch network to a fitted ML PipelineModel. My attempt is:

trained_net = ... (a trained Pytorch nn.Module)
pipeline_model = ... (a fitted ML PipelineModel)

pipeline_with_net = attach_pytorch_model_to_pipeline(
    trained_net,
    pipeline_model = pipeline, 
    inputCol = 'features',
    predictionCol = 'predicted',
    useVectorOut = False
)

Unfortunately, I get the following error:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

I also tried with the following (simple) neural network and command, and I receive EXACTLY the same error

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return x

create_spark_torch_model(
    Net(),
    useVectorOut = False
)

However, the following code runs smooth, without errors:

network = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

create_spark_torch_model(
    network,
    useVectorOut = True
)

That is, it seems that the error is not due to my specific neural network, nor the pipeline.
I am using Databricks with Apache Spark 3.0.1. Do you know how to solve this issue?
Many thanks in advance

Ramon

The text was updated successfully, but these errors were encountered:

gunjan075 · 2022-06-07T19:42:40Z

getting same error
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers

gunjan075 · 2022-06-07T19:43:37Z

@ramondalmau what was the solution you found 2 years ago.

srigas · 2023-01-03T00:20:25Z

@gunjan075 Admittedly, I'm kind of late to the party but I just stumbled upon this problem myself, so here's the workaround. First of all, if you check out the documentation of SparkTorch (README page), you will see the following note:

NOTE: One thing to remember is that if your network is not a sequential, it will need to be saved in a separate file and available in the python path.

From my experience, it appears that using a Sequential network is not sufficient in itself, because simply adding a forward pass also leads to this problem.

So, the workaround in Databricks is to upload a separate file locally or by using the Repos functionality. One rather straightforward way of doing this would be to upload the file in DBFS (this can be done manually while in your Databricks notebook by selecting File -> Upload data to DBFS...) and then inform spark with a command similar to

spark.sparkContext.addPyFile("dbfs:/FileStore/mymodel.py"),

supposing that your file is named mymodel.py and was uploaded in this DBFS directory. This file must contain the code that defines your NN model (as a class, as a function, etc.). Then, if for example the NN model is defined through the MyNN class, you would simply add

from mymodel import MyNN

to your code. And that's pretty much it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-5063 error #22

SPARK-5063 error #22

ramondalmau commented Dec 11, 2020 •

edited

Loading

gunjan075 commented Jun 7, 2022

gunjan075 commented Jun 7, 2022

srigas commented Jan 3, 2023

SPARK-5063 error #22

SPARK-5063 error #22

Comments

ramondalmau commented Dec 11, 2020 • edited Loading

gunjan075 commented Jun 7, 2022

gunjan075 commented Jun 7, 2022

srigas commented Jan 3, 2023

ramondalmau commented Dec 11, 2020 •

edited

Loading