Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Pyspark Pipeline has py4j not found exception when lineage is enabled #441

Closed
csun-cpointe opened this issue Oct 28, 2024 · 3 comments · Fixed by #445
Closed

BUG: Pyspark Pipeline has py4j not found exception when lineage is enabled #441

csun-cpointe opened this issue Oct 28, 2024 · 3 comments · Fixed by #445
Assignees
Labels
bug Something isn't working
Milestone

Comments

@csun-cpointe
Copy link
Contributor

csun-cpointe commented Oct 28, 2024

Description

When manual starting the lineage enabled pyspark pipeline, got an Py4JError. The same issue also is found in 1.7.0 release.

2024/10/25 19:43:36 INFO PythonPipeline: STARTED: PythonPipeline driver
Creating the PipelineBase
2024/10/25 19:43:36 INFO LineageUtil: Recording lineage data...
Traceback (most recent call last):
  File "/opt/spark/jobs/pipelines/python-pipeline/python_pipeline_driver.py", line 30, in <module>
    PipelineBase().record_pipeline_lineage_start_event()
  File "/usr/local/lib/python3.11/dist-packages/python_pipeline/generated/pipeline/pipeline_base.py", line 88, in record_pipeline_lineage_start_event
    self._lineage_util.record_lineage(self._emitter, run_event)
  File "/usr/local/lib/python3.11/dist-packages/aissemble_data_lineage/util/lineage_util.py", line 228, in record_lineage
    emitter.emit_run_event(event)
  File "/usr/local/lib/python3.11/dist-packages/aissemble_data_lineage/emitter.py", line 83, in emit_run_event
    self.build_message_client()
  File "/usr/local/lib/python3.11/dist-packages/aissemble_data_lineage/emitter.py", line 63, in build_message_client
    self._emitter = MessagingClient(
                    ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aissemble_messaging/messaging_client.py", line 36, in __init__
    self.service_port = self.start_service_jvm()
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aissemble_messaging/messaging_client.py", line 170, in start_service_jvm
    return launch_gateway(
           ^^^^^^^^^^^^^^^
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 297, in launch_gateway
py4j.protocol.Py4JError: Could not find py4j jar at 

DoD

Resolve py4j jar not found issue

Steps to Reproduce

  1. Using create a new aissemble-based project using the latest archetype snapshot.
mvn archetype:generate '-DarchetypeGroupId=com.boozallen.aissemble' \
                           '-DarchetypeArtifactId=foundation-archetype' \
                           '-DarchetypeVersion=1.10.0-SNAPSHOT' \
                           '-DgroupId=org.test' \
                           '-Dpackage=org.test' \
                           '-DprojectGitUrl=test.org/test.git' \
                           '-DprojectName=Test pyspark lineage' \
                           '-DartifactId=test-lineage' \
    && cd test-lineage
  1. Set your Java version to 17 if it is not currently
  2. Under -model/src/main/resources/pipelines create PythonPipeline.json
    files with below content
{
 "name": "PythonPipeline",
 "package": "validation.test.project.pipeline",
 "type": {
   "name": "data-flow",
   "implementation": "data-delivery-pyspark"
 },
 "dataLineage": true,
 "steps": [
   {
     "name": "Step1",
     "type": "synchronous",
     "persist": {
       "type": "delta-lake"
     },
     "alerting": {
       "enabled": false
     },
     "provenance": {
       "enabled": false
     }
   }
 ]
}
  1. Fully generate the project by running mvn clean install and following manual actions
  2. Build the project without the cache and follow the last manual action.
    mvn clean install -Dmaven.build.cache.skipCache
  3. Deploy the project and wait for all services ready
    tilt up; tilt down
  4. Manually trigger the python-pipeline pod

Expected Behavior

  1. Verify that pipeline should start and complete without any error
  2. Log in to kafka to check message successfully received
  • Go into kafka container: kubectl exec -it kafka-cluster-0 -- sh
  • Check messages in the lineage-event-out topic: /opt/bitnami/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic lineage-event-out --from-beginning

Actual Behavior

Pipeline stopped because of below exectpion

2024/10/25 19:43:36 INFO PythonPipeline: STARTED: PythonPipeline driver
Creating the PipelineBase
2024/10/25 19:43:36 INFO LineageUtil: Recording lineage data...
Traceback (most recent call last):
  File "/opt/spark/jobs/pipelines/python-pipeline/python_pipeline_driver.py", line 30, in <module>
    PipelineBase().record_pipeline_lineage_start_event()
  File "/usr/local/lib/python3.11/dist-packages/python_pipeline/generated/pipeline/pipeline_base.py", line 88, in record_pipeline_lineage_start_event
    self._lineage_util.record_lineage(self._emitter, run_event)
  File "/usr/local/lib/python3.11/dist-packages/aissemble_data_lineage/util/lineage_util.py", line 228, in record_lineage
    emitter.emit_run_event(event)
  File "/usr/local/lib/python3.11/dist-packages/aissemble_data_lineage/emitter.py", line 83, in emit_run_event
    self.build_message_client()
  File "/usr/local/lib/python3.11/dist-packages/aissemble_data_lineage/emitter.py", line 63, in build_message_client
    self._emitter = MessagingClient(
                    ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aissemble_messaging/messaging_client.py", line 36, in __init__
    self.service_port = self.start_service_jvm()
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aissemble_messaging/messaging_client.py", line 170, in start_service_jvm
    return launch_gateway(
           ^^^^^^^^^^^^^^^
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 297, in launch_gateway
py4j.protocol.Py4JError: Could not find py4j jar at 

Additional Context

  • Log output
  • Screenshots (if applicable)
  • Solution Baseline Version
  • Environment details (local, cloud, Azure, AWS, etc.)
@csun-cpointe csun-cpointe added the bug Something isn't working label Oct 28, 2024
@csun-cpointe csun-cpointe self-assigned this Oct 28, 2024
@csun-cpointe
Copy link
Contributor Author

DoD completed with @carter-cundiff

@csun-cpointe csun-cpointe added this to the 1.10.0 milestone Oct 28, 2024
@csun-cpointe
Copy link
Contributor Author

OTS completed with @ewilkins-csi and @carter-cundiff

csun-cpointe added a commit that referenced this issue Oct 30, 2024
#441 set default py4j jar path for python messaging client
@csun-cpointe csun-cpointe reopened this Oct 30, 2024
@carter-cundiff
Copy link
Contributor

Testing passed:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants