Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with schemas under Redshift #18

Open
ldcasillas-progreso opened this issue Jul 7, 2014 · 8 comments
Open

Problems with schemas under Redshift #18

ldcasillas-progreso opened this issue Jul 7, 2014 · 8 comments

Comments

@ldcasillas-progreso
Copy link

I'm using cascading-jdbc-redshift:2.5.4-wip-83, and I can't get it to sink into my Redshift instance at all.

The table I'm trying to sink into is called etl_demo. If I just use that as the table name for the RedshiftTableDesc, and run the job, I get this error:

cascading.tap.TapException: SQL error code: 0 executing update statement: CREATE TABLE etl_demo ( date DATE, store BIGINT, handle_time INTEGER, count_calls INTEGER, avg_handle_time FLOAT )  DISTKEY (store)  SORTKEY (store, date) 
        at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:478)
        at cascading.jdbc.JDBCTap.createResource(JDBCTap.java:597)
        at cascading.jdbc.RedshiftTap.createResource(RedshiftTap.java:172)
        at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:395)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
        at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
        at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
        at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
        at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
        at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
        at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
        at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
        at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.postgresql.util.PSQLException: ERROR: no schema has been selected to create in
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2077)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1810)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:498)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:300)
        at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:470)
        ... 22 more

So if I'm reading this right, Redshift doesn't like the CREATE TABLE statement because the user doesn't specify a schema to create the table in. If I try "public.etl_demo" as the table name for the RedshiftTableDesc, then I get this:

14/07/07 14:42:08 INFO jdbc.JDBCTap: executing update: CREATE TABLE public.etl_demo ( date DATE, store BIGINT, handle_time INTEGER, count_calls INTEGER, avg_handle_time FLOAT )  DISTKEY (store)  SORTKEY (store, date) 
14/07/07 14:42:09 INFO jdbc.JDBCTap: testing if table exists with DatabaseMetaData
14/07/07 14:42:09 INFO jdbc.JDBCTap: creating connection: jdbc:postgresql://something.example.com:5439/dev    
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [unable to create table: public.etl_demo]
    at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:265)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
    at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
    at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: cascading.tap.TapException: unable to create table: public.etl_demo
    at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:396)
    at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
    at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
    at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
    at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
    at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
    at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
    at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
    at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
    at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
    ... 9 more

Here the CREATE TABLE succeeds, but apparently the DatabaseMetaData check fails because it looks for a table with name public.etl_demo—but no such table exists.

And if I create the table by hand in Redshift, and use "etl_demo" in the RedshiftTableDesc, I get this (SinkMode.REPLACE):

14/07/07 14:54:42 INFO jdbc.JDBCTap: deleting table: etl_demo
14/07/07 14:54:42 INFO jdbc.JDBCTap: creating connection: jdbc:postgresql://something.example.com:5439/dev
14/07/07 14:54:42 INFO jdbc.JDBCTap: executing update: DROP TABLE etl_demo
14/07/07 14:54:43 WARN jdbc.JDBCTap: unable to drop table: etl_demo
14/07/07 14:54:43 WARN jdbc.JDBCTap: sql failure
org.postgresql.util.PSQLException: ERROR: table "etl_demo" does not exist
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2077)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1810)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:498)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:300)
        at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:470)
        at cascading.jdbc.JDBCTap.deleteResource(JDBCTap.java:623)
        at cascading.jdbc.RedshiftTap.deleteResource(RedshiftTap.java:183)
        at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:392)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
        at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
        at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
        at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
        at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
        at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
        at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
        at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
        at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [unable to drop table: etl_demo]
        at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:265)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
        at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
        at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
        at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: cascading.tap.TapException: unable to drop table: etl_demo
        at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:393)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
        at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
        at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
        at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
        at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
        at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
        ... 9 more

My take here is that it can't drop the table because it doesn't know which schema to look in, once more.

If I create the table by hand and use SinkMode.UPDATE instead then I get past that, only to fail later down when it tries an INSERT statement into the unqualified table name:

14/07/07 15:07:56 ERROR db.DBOutputFormat: unable to execute update batch [msglength: 207][totstmts: 453][crntstmts: 453][batch: 1000] Batch entry 0 INSERT INTO etl_demo (date,store,handle_time,count_calls,avg_
org.postgresql.util.PSQLException: ERROR: relation "etl_demo" does not exist
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2077)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1810)
        at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1065)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:398)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2725)
        at cascading.jdbc.db.DBOutputFormat$DBRecordWriter.executeBatch(DBOutputFormat.java:113)
        at cascading.jdbc.db.DBOutputFormat$DBRecordWriter.close(DBOutputFormat.java:88)
        at cascading.jdbc.JDBCTapCollector.close(JDBCTapCollector.java:103)
        at cascading.flow.stream.SinkStage.cleanup(SinkStage.java:120)
        at cascading.flow.stream.StreamGraph.cleanup(StreamGraph.java:176)
        at cascading.flow.hadoop.FlowReducer.close(FlowReducer.java:157)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:471)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)

(Note that it's trying to INSERT into the Redshift table instead of staging into S3 and then using a COPY command, as it's supposed to. I don't know if this is a problem in my end.)

I think it may be possible to work around all of these issues by configuring the Redshift server to add a default search_path, but I don't think this should be necessary to use the tap.

Looking at commit 37364da, perhaps the TableDesc should allow you to specify a schema name in addition to the table name? The DatabaseMetaData call is using null for the schema name, so the following logic (from JDBCTap.java) doesn't work if a database has more than one table with the same name in different schemas:

  DatabaseMetaData dbm = connection.getMetaData();

  // [My comment: the second `null` means "in any schema", which means
  // that we can get false positives in this check.]
  tables = dbm.getTables( null, null, tableDesc.getTableName(), null );

  if ( tables.next() )
    return true;
@fs111
Copy link
Contributor

fs111 commented Jul 8, 2014

Just to be sure, can you try to run the tests against your redshift instance? I want to see if something is different compared to the one I used for testing.

You will need gradle 1.12 and you can run the redshift tests like so:

 gradle cascading-jdbc-redshift:build -i -Dcascading.jdbc.url.redshift=<your-redshift-instance>

Thanks!

@ldcasillas-progreso
Copy link
Author

The unit test gives me lots of errors like this one:

2014-07-08 08:06:36,392 INFO  jdbc.JDBCTap (JDBCTap.java:executeUpdate(466)) - executing update: CREATE TABLE testingtable ( num INT NOT NULL, lwr VARCHAR(100) NOT NULL, upr VARCHAR(100) NOT NULL ) 
2014-07-08 08:06:36,595 ERROR jdbc.JDBCTap (JDBCTap.java:createResource(601)) - unable to create table: testingtable
cascading.tap.TapException: SQL error code: 0 executing update statement: CREATE TABLE testingtable ( num INT NOT NULL, lwr VARCHAR(100) NOT NULL, upr VARCHAR(100) NOT NULL ) 
    at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:478)
    at cascading.jdbc.JDBCTap.createResource(JDBCTap.java:597)
    at cascading.jdbc.RedshiftTap.createResource(RedshiftTap.java:172)
    at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:395)
    at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
    at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
    at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
    at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
    at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
    at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
    at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
    at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
    at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:450)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:426)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:275)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:220)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:202)
    at cascading.jdbc.JDBCTestingBase.testJDBC(JDBCTestingBase.java:104)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:86)
    at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:49)
    at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:69)
    at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:48)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
    at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
    at org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
    at org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
    at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
    at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:105)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
    at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
    at org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:355)
    at org.gradle.internal.concurrent.DefaultExecutorFactory$StoppableExecutorImpl$1.run(DefaultExecutorFactory.java:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.postgresql.util.PSQLException: ERROR: no schema has been selected to create in
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2077)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1810)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:498)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:300)
    at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:470)
    ... 64 more

@ldcasillas-progreso
Copy link
Author

Ooops, pressed the "Close and comment" button by accident. I've reopened it.

@fs111
Copy link
Contributor

fs111 commented Jul 9, 2014

I am unable to reproduce this. I started a new redshift cluster and ran the tests w/o any problems. Log is here:

https://gist.github.com/fs111/a953d7bd0c83aa85bece

Something must be different with your redshift cluster, which I am unable to reproduce.

@ldcasillas-progreso
Copy link
Author

Well, the hypothesis that leads me to is that your freshly created Redshift cluster has different settings than mine, in such a way that an unqualified table reference (tableName instead of schemaName.tableName) will succeed in your cluster, but fail in mine.

I will have a look at Redshift documentation to see which settings might affect that, but I think it's fair to say that the following items are problems:

  1. If I'm sourcing from a database that has two tables with the same name but in different schemas, I should be able to specify to the library which schema I mean.
  2. Users of the library should be able to source from and sink to tables that are not in the database user's default schema(s) or search path. For example, my database may be set up so it has a special schema called staging, and my application's database user may only have create and drop permissions on that schema.
  3. The logic I quoted above for the DatabaseMetaData existence check still is broken. See the Javadoc; metadata.getTables(null, null, tableName) means "get me all tables named tableName in all schemas". This will produce false positives if there is a table with that name in some schema that's not accessible to the user through an unqualified reference. (Note that the JDBC table existence check you're trying to do is tricky, and I don't know how to do it reliably and generically. To uniquely identify a table you need catalog + schema + table name. If all you have is table name, you need to figure out what schema that should be in... but JDBC didn't provide a [Connection#getSchema() method](http://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#getSchema%28%29%29 until Java 7, and the Postgres JDBC driver that Redshift uses doesn't implement this method.)

And note that these issues may well affect all the databases, not just Redshift.

So I think this is morphing into a more general "support specifying a database schema in TableDesc" issue.

@fs111
Copy link
Contributor

fs111 commented Jul 10, 2014

  1. Agreed, that is a limitation right now. The history of cascading-jdbc is that it gives you a quick and easy way to read or write from a DB. We do not encourage this as the one true way of processing on hadoop, since that is really a lot of overhead, but it will get you somewhere. This is not to say, we should not support it, I am just explaining why things are the way they are right now.
  2. Yes, that would be a great feature to have.
  3. Assuming this is a one-of jdbc user with one schema, it will work just fine. The more complex and even likely setup you describe is def. not supported. For Cascading 2.x we have to stay on Java 6, so we will have to work what JDBC has to offer in there.

All that being said, we are more than happy to take pull requests. I think extending it is a good idea, but we have a bit of a bandwidth problem. If you would like to give it a shot, let me know.

@Sicarus
Copy link

Sicarus commented Jul 7, 2016

I'm still running into issues using jdbc with redshift when trying to specify a schema before a table. It results in "schema does not exist". Any luck on a work around?

@fs111
Copy link
Contributor

fs111 commented Jul 8, 2016

@Sicarus This is still an open issue and requires internal changes in cascading-jdbc. I currently do not have the bandwidth to work on that, but if you are willing to give it a try, I can guide you through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants