Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing jar files #100

Open
ami07 opened this issue May 20, 2019 · 11 comments
Open

missing jar files #100

ami07 opened this issue May 20, 2019 · 11 comments

Comments

@ami07
Copy link

ami07 commented May 20, 2019

Hi,
I am trying to generate spark code. However, it returns an error:
[info] Loading project definition from /home/ec2-user/dbtoaster/dbtoaster-backend/project
[info] Set current project to dbtoaster (in build file:/home/ec2-user/dbtoaster/dbtoaster-backend/)
[info] Updating {file:/home/ec2-user/dbtoaster/dbtoaster-backend/}lms...
[info] Resolving EPFL#lms_2.11;0.3-SNAPSHOT ...
[warn] module not found: EPFL#lms_2.11;0.3-SNAPSHOT
[warn] ==== local: tried
[warn] /home/ec2-user/.ivy2/local/EPFL/lms_2.11/0.3-SNAPSHOT/ivys/ivy.xml
[warn] ==== local-preloaded-ivy: tried
[warn] /home/ec2-user/.sbt/preloaded/EPFL/lms_2.11/0.3-SNAPSHOT/ivys/ivy.xml
[warn] ==== local-preloaded: tried
[warn] file:////home/ec2-user/.sbt/preloaded/EPFL/lms_2.11/0.3-SNAPSHOT/lms_2.11-0.3-SNAPSHOT.pom
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/EPFL/lms_2.11/0.3-SNAPSHOT/lms_2.11-0.3-SNAPSHOT.pom
[warn] ==== sonatype-snapshots: tried
[warn] https://oss.sonatype.org/content/repositories/snapshots/EPFL/lms_2.11/0.3-SNAPSHOT/lms_2.11-0.3-SNAPSHOT.pom
[info] Resolving jline#jline;2.12 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: EPFL#lms_2.11;0.3-SNAPSHOT: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] EPFL:lms_2.11:0.3-SNAPSHOT (/home/ec2-user/dbtoaster/dbtoaster-backend/ddbtoaster/lms/build.sbt#L10-36)
[warn] +- ch.epfl.data:dbtoaster-lms_2.11:3.0
sbt.ResolveException: unresolved dependency: EPFL#lms_2.11;0.3-SNAPSHOT: not found

So it cannot resolve the dependency added to build.sbt ""EPFL" %% "lms" % "0.3-SNAPSHOT""
How can I fix this?
P.S. when I run the test units, they also fail with the same error.

@losmi83
Copy link
Contributor

losmi83 commented May 20, 2019

You can try this:

git clone https://github.com/epfldata/lms.git
git checkout booster-develop-0.3
git apply compiler_patch.txt (attached, support for Scala 2.11)
sbt publish-local

Now you should be able to generate code in dbtoaster-backend. If not, let me know.

Note: You can generate Spark code only for TPC-H queries.

compiler_patch.txt

@ami07
Copy link
Author

ami07 commented May 20, 2019

Thank you very much.
I am now able to generate code for some of the example queries.

For a simplified TPCH query that I created (joins lineitem, ), I got this error:
Fatal error: exception Failure("Missing partitioning information for COUNTLINEITEM1_DELTA")

One other issue that is confusing me. For the generated file, in the readme, step 4 is: Compile the generated Spark program for the target execution environment.
Does it mean that I create a new project with an build.sbt that includes the jar files from the lms directory that you shared above, or it means that I need to copy the generated code to somewhere in : dbtoaster-backend/ddbtoaster/lms/ and use the build.sbt in it.
Thanks

@losmi83
Copy link
Contributor

losmi83 commented May 21, 2019

For a simplified TPCH query that I created (joins lineitem, ), I got this error:
Fatal error: exception Failure("Missing partitioning information for COUNTLINEITEM1_DELTA")

This version does not support custom queries over the TPC-H schema. I have just updated the frontend to allow that.

cd dbtoaster-a5
git pull
make

You should be able to generate code for custom TPC-H queries.

One other issue that is confusing me. For the generated file, in the readme, step 4 is: Compile the generated Spark program for the target execution environment.
Does it mean that I create a new project with an build.sbt that includes the jar files from the lms directory that you shared above, or it means that I need to copy the generated code to somewhere in : dbtoaster-backend/ddbtoaster/lms/ and use the build.sbt in it.

To compile generated code, you would need to include Spark jar files and DBToaster runtime jar files. The latter you can find in the distribution under dbtoaster/lib/dbt_scala -- or even better run sbt release and under ddbtoaster/release/lib/dbt_scala you will find the latest DBToaster jar files.

@ami07
Copy link
Author

ami07 commented May 30, 2019

Thank you very much, Milos for your reply.
I am finally able to compile the code. Since the unit tests called in run_spark_weak_experiments.sh and run_spark_strong_experiments.sh (which I found in dbtoaster-backend/ddbtoaster/scripts) do not work, I packaged the generated code into a jar file and I am using spark-submit to execute it on the cluster.
The issue that I am facing right now. From the spark scripts, I understand that the arguments used when calling the spark code are: -xvm -xsc -p 2 -s 1 -w 0 -d 500gb -O3 --batch -xbs 10000000 -x -F HEURISTICS-DECOMPOSE-OVER-TABLES
I cannot find where these are explained. The error I am also getting right now is related to the paths set into the configuration file.

I tried to create a conf directory (similar to the one found in: dbtoaster-backend/ddbtoaster/spark/conf ) and add configuration file after changing these paths, and then package it into a jar. However, it seems that the configuration parameters are set somewhere else.

I thought I would ask before I change the generated code to hard code where the conf file is located.

@losmi83
Copy link
Contributor

losmi83 commented May 30, 2019

See ddbtoaster/spark/conf/spark.config for various spark configuration parameters.

@ami07
Copy link
Author

ami07 commented May 30, 2019

Yes, I created a similar configuration file in my spark project and I changed the values of the configuration parameters (the paths) to where my data and outputs should be. However, for some reason, when I execute the jar file that I create it is still assuming that the values are similar to the defaults in spark.config. So I assumed that this is also set in one of the dbtoaster libraries I am including when I packaged my jar file.
I think I am on the right track so I will find a way to read the correct conf file.
Thanks

@ami07
Copy link
Author

ami07 commented Jun 17, 2019

Hi,
I have changed the generated code to read from a custom function that reads the conf parameters since the one referenced in the dbtoaster libraries kept crashing. Now the spark code seemed to be running, however, the jobs that actually executes the data fail.
I am attaching screenshots from the query execution (the query joins 3 TPCH tables). Any advice about what could be causing the error or what I might be doing wrong? The details of the jobs does not really tell much. Only referring to an empty queue!
Thanks
DBToasterSparkAllJobs
DBToasterSparkFailingJob9
Fq4_spark - Details for Job 9_failed.pdf

@losmi83
Copy link
Contributor

losmi83 commented Jun 25, 2019

Hard to tell, you could try reshuffling your input such that each partition is non-empty.

@ami07
Copy link
Author

ami07 commented Jun 25, 2019

Thanks Milos for your reply. I understand that the data are just the csv files generated by the TPCH data generator. Or am I missing something?
I made that assumption because of this line in the README file in experiments/datasets: "Put your datasets here (e.g., 1GB/lineitem.csv) "

@losmi83
Copy link
Contributor

losmi83 commented Jun 25, 2019

Yes, the input is standard TPCH files. If I remember correctly, the code expects that the input is randomly distributed across all nodes to avoid data skew. This would explain your error -- might happen that your data is stored in one partition and others are empty. I would suggest running rdd.repartition after loading your input.

@ami07
Copy link
Author

ami07 commented Jun 25, 2019

I am storing the data in HDFS, so it should be partitioned among the nodes of the cluster. I will try the rdd.repartition.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants