Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller not listening to Pub/Sub commands #4

Open
abaumann opened this issue May 22, 2018 · 9 comments
Open

Controller not listening to Pub/Sub commands #4

abaumann opened this issue May 22, 2018 · 9 comments

Comments

@abaumann
Copy link
Contributor

This is a great and well documented example - much appreciated!

I am having trouble however with the Run a verification job step. When I publish the command=start_gcs_import message to the indexercommands topic I do not get a new input job created. I tried out the cron jobs to see if those were working, and I can see a new job created for startstatscalc only - not the others. I feel like I must have messed up some config step along the way.

Do you have tips for how to debug what is happening? I am not seeing any sort of logging to help me debug, but that's where I'd think to look first. It feels like the controller must not be listening for the topics correctly though...

Thanks!

@datancoffee
Copy link
Contributor

yes, sorry, the controller pipeline has not yet been updated (after a recent upgrade from a pre 2.x SDK to a 2.2 SDK).

For the time being, launch the IndexerPipeline directly using the example in the Release Notes for version 0.6.4 https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/releases/tag/v0.6.4

e.g. using

mvn compile exec:java
-Dexec.mainClass=com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline
-Dexec.args="--project=$PROJECT_ID
...

I will make a note in README not to run the controller pipeline but instead to start the IndexerPipeline

@abaumann
Copy link
Contributor Author

abaumann commented May 22, 2018

So I tried modifying run_controljob.sh to use IndexerPipeline as the mainClass, but I'm getting this error:

	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Class interface com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipelineOptions missing a property named 'controlPubsub'.
	at org.apache.beam.sdk.options.PipelineOptionsFactory.parseObjects(PipelineOptionsFactory.java:1579)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.access$400(PipelineOptionsFactory.java:104)
	at org.apache.beam.sdk.options.PipelineOptionsFactory$Builder.as(PipelineOptionsFactory.java:291)
	at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main(IndexerPipeline.java:110)
	... 6 more

Maybe it needs some additional flags that aren't in those release notes?

You may be adding this to the README and I can check it there

@datancoffee
Copy link
Contributor

Just insert something like this into the shell script (and make sure the $ variables are set)
(this statement is from the release notes)

mvn compile exec:java
-Dexec.mainClass=com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline
-Dexec.args="--project=$PROJECT_ID
--runner=DataflowRunner
--maxNumWorkers=10
--workerMachineType=n1-standard-2
--stagingLocation=gs://$GCS_BUCKET/staging/
--tempLocation=gs://$GCS_BUCKET/temp/
--streaming=false
--autoscalingAlgorithm=THROUGHPUT_BASED
--bigQueryDataset=opinions
--writeTruncate=true
--processedUrlHistorySec=130000
--wrSocialCountHistoryWindowSec=610000
--ratioEnrichWithCNLP=0
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--indexAsShorttext=false
"

@abaumann
Copy link
Contributor Author

Oh actually that worked - I thought I looked closely enough to see the flags were the same there as the existing template, but I guess not, thanks!

@datancoffee
Copy link
Contributor

glad to hear, and sorry for the issue with the controller. I meant to fix it earlier

@abaumann
Copy link
Contributor Author

no problem - I just wanted a baseline of something working so I can start to modify from here to handle our use case, so now that I've seen this working, I'm good. Only other issue I noticed by the way is the instructions to run:

SELECT * FROM opinions.sentiment 
ORDER BY DocumentTime DESC
LIMIT 100

Since this contains repeated fields we get an error like this"
Cannot query the cross product of repeated fields Signals and Tags.GoodAsTopic

I really hate that select * doesn't do some default flattening in BQ, and I didn't do the work to come up with a query and just used the preview pane instead.

Thanks!

@datancoffee
Copy link
Contributor

Uncheck the "Use Legacy SQL" checkbox so that BigQuery uses Standard SQL for that query. You can do that by going to "Show Options"

@datancoffee
Copy link
Contributor

But, good point, I will add the
#standardSQL
tag to the query, so that it is easier to use this sample.

@abaumann
Copy link
Contributor Author

🎆 i did not know about unchecking Legacy SQL being a fix to this - thanks for that tip! 🎆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants