BUG: Pyspark pipeline can't run the pipeline with split records #420

csun-cpointe · 2024-10-18T16:32:20Z

Description

When create a combine data records pyspark pipeline project, the pipeline can't start because of split data record dependency error.

Steps to Reproduce

Clear, specific, and detailed steps taken to enable reproduction of the bug for investigation.

Create a new project on 1.10.0-SNAPSHOT.

mvn archetype:generate '-DarchetypeGroupId=com.boozallen.aissemble' \
                       '-DarchetypeArtifactId=foundation-archetype' \
                       '-DarchetypeVersion=1.10.0-SNAPSHOT' \
                       '-DgroupId=org.test' \
                       '-Dpackage=org.test' \
                       '-DprojectGitUrl=test.org/test.git' \
                       '-DprojectName=Test combine records' \
                       '-DartifactId=test-combine-records' \
&& cd test-combine-records

Set your Java version to 17 if it is not currently
Unzip the resources.zip and replace the resources folder at the -pipeline-models/src/main/ directory
Fully generate the project by running mvn clean install and following manual actions
Unzip the krausening.zip and replace the krausening folder at the -docker/test-combine-record-spark-worker-docker/src/main/resources directory
Build the project without the cache and follow the last manual action.
```
mvn clean install -Dmaven.build.cache.skipCache
```
In the -shared/pom.xml, use the the aissemble-data-records-separate-module profile for split records

      <configuration>
          <basePackage>com.boozallen</basePackage>
 -        <profile>aissemble-data-records-combined-module</profile>
 +        <profile>aissemble-data-records-separate-module</profile>
      </configuration>

Build the project without the cache and follow the last manual action.
```
mvn clean install -Dmaven.build.cache.skipCache
```
In the spark-pipeline/pom.xml, update the data-record artifact name

        <dependency>
            <groupId>${project.groupId}</groupId>
-           <artifactId>test-combine-record-data-records-java</artifactId>
+           <artifactId>test-combine-record-data-records-spark-java</artifactId>
            <version>${project.version}</version>
        </dependency>

In the pyspark-pipeline/pom.xml, update the data-record artifact name

        <dependency>
            <groupId>${project.groupId}</groupId>
-           <artifactId>test-combine-record-data-records-python</artifactId>
+           <artifactId>test-combine-record-data-records-spark-python</artifactId>
            <version>${project.version}</version>
        </dependency>

In the pyspark-pipeline/pyproject.toml, update the test-combine-record-data-records-python dependency package name to include spark as following

    test-combine-record-data-records-spark-python = {path = "../../test-combine-record-shared/test-combine-record-data-records-spark-python", develop = true}

Build the project without the cache and follow the last manual action.
```
mvn clean install -Dmaven.build.cache.skipCache
```
Tilt up all services

Expected Behavior

All services are running in ready state.

Actual Behavior

spark-worker-image failed at the below error

Additional Context

Log output
Screenshots (if applicable)
Solution Baseline Version
Environment details (local, cloud, Azure, AWS, etc.)

The text was updated successfully, but these errors were encountered:

csun-cpointe added the bug Something isn't working label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Pyspark pipeline can't run the pipeline with split records #420

BUG: Pyspark pipeline can't run the pipeline with split records #420

csun-cpointe commented Oct 18, 2024 •

edited

Loading

BUG: Pyspark pipeline can't run the pipeline with split records #420

BUG: Pyspark pipeline can't run the pipeline with split records #420

Comments

csun-cpointe commented Oct 18, 2024 • edited Loading

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

csun-cpointe commented Oct 18, 2024 •

edited

Loading