Spark Scramble Appending #374

commercial-hippie · 2019-05-27T11:43:36Z

I haven't tested the scramble creation using Spark but appending to a scramble generates a invalid query.

The following error happens..

mismatched input '`partner`' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 61)

caused by the following query (stripped out some columns to reduce the length)..

insert into `verdict_scrambles`.`partitioned_flattened_orc` (`partner`,`src`,`ts`,`id`,`verdictdbtier`,`state_code`,`verdictdbblock`) select `partner`,`src`,`ts`,`id`,`verdictdbtier`,`state_code`,`verdictdbblock` from `verdict_scrambles`.`verdictdbtemp_dWH1w8rA`

This is because Spark doesn't support column lists in insert statements.

So instead of:

INSERT into tablename (col1, col2, col3) VALUES ('1', '2','3')

We need to use:

INSERT into tablename VALUES ('1', '2','3')

I've currently hacked around it to get it working temporarily.

Info 1
Info 2

The text was updated successfully, but these errors were encountered:

pyongjoo · 2019-06-05T17:24:02Z

Thanks for letting us know. Right now, I'm more leaning toward a non-SQL approach for scramble creation. I found Google Dataflow (backed by Apache Beam) could be a good solution both for generality and scalability. FYI, it can also easily work with files directly (as Dan suggested).

Let me know what you think.

voycey · 2019-06-06T06:04:49Z

I agree that non-SQL is the way forward for this as much as possible https://beam.apache.org/documentation/io/built-in/ BigQuery is great as an option - although handling this you will need to be aware of the query costs involved - unless it is crafted well they can spiral out of control pretty easily, when you have 60TB of data in there - this all of a sudden becomes an expensive query ($300 per query execution). Avro & Parquet I/O should definitely also be considered (ORC will be important for us but it seems development has stalled but we are good with BQ for now: https://issues.apache.org/jira/browse/BEAM-1861) As that way you are avoiding any kind of 3rd party processes and the results can be moved to whatever query engine is required as a simple file transfer

…

On Thu, 6 Jun 2019 at 03:24, Yongjoo Park ***@***.***> wrote: Thanks for letting us know. Right now, I'm more leaning toward a non-SQL approach for scramble creation. I found Google Dataflow (backed by Apache Beam) <https://beam.apache.org/documentation/io/built-in/google-bigquery/> could be a good solution both for generality and scalability. FYI, it can also easily work with files directly (as Dan suggested). Let me know what you think. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#374>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIEBCSPA4EAWBHRQVBRKNTPY7ZDFANCNFSM4HP3E4OA> .

-- __ Mob: (+61) 0481 301 640 Skype: daniel.voyce LinkedIn: https://www.linkedin.com/in/danielvoyce/

pyongjoo · 2019-06-06T15:32:30Z

For BQ, I think Verdict team will use flat-rate pricing ($10K/month) in the future, so its users can basically run unlimited queries on unlimited-scale data. Yet, this is still down the road.

voycey · 2019-06-06T22:51:40Z

I don't know how that would work? Custodial transfer of data isn't a good idea! It should be cheap enough in BQ vs Spark if the queries are designed correctly

…

On Fri, 7 Jun 2019 at 01:32, Yongjoo Park ***@***.***> wrote: For BQ, I think Verdict team will use flat-rate pricing ($10K/month) in the future, so its users can basically run unlimited queries on unlimited-scale data. Yet, this is still down the road. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#374>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIEBCWWMPZ6R3YUGS54AADPZEUZDANCNFSM4HP3E4OA> .

-- __ Mob: (+61) 0481 301 640 Skype: daniel.voyce LinkedIn: https://www.linkedin.com/in/danielvoyce/

pyongjoo · 2019-06-08T00:49:50Z

Your datasets in BQ can be accessed by a third-party service if you grant access: https://cloud.google.com/bigquery/docs/share-access-views

I'm double-checking a billing-related question, i.e., which party is charged for queries.

voycey · 2019-06-12T04:55:03Z

I think for data provenance reasons this wont be possible - especially in light of the new privacy laws coming in from California.

We are happy to pay for the processing of our own data - we just need the tools to be able to do it :)

pyongjoo · 2019-06-12T16:23:58Z

@voycey It would be great if you can elaborate on data provenance issue. My naive thought was if Google allows the operation (e.g., viewing data without copying them), it must be legal (under the assumption that the data provider grants).

Regarding partitioning, I think the easiest way is to let you specify partition columns for scrambles. Currently, scrambles inherit the partitioning of an original table (thus, the number exploits in combination with verdictdbblock). I believe a combination of (date, verdictdbblock) should good enough for ensuring speed (without state).

In the future, we can reduce the total number of verdictdbblock down to 10 or so, by introducing a clever scheme, but it will take some time (since we are transitioning...)

voycey · 2019-06-13T02:53:06Z

There are many use cases where this would be great, however when dealing with user information allowing 3rd party access to this data you are essentially not only opening up a secondary vector but you are also technically sharing information without notifying the users, this would breach GDPR and the CCPA (I'm sure there are load of other reasons why this wouldn't be allowed as well but that is the first one that pops into my head).

I think if we can just handle the partitioning by date that will be great for this - as the data is only 5% of the total there is little need to have a full granular partitioning scheme.

In the meantime - generating scrambles (in whatever fashion) in BigQuery would be great - BQ simply handles the capacity requirements and they would probably complete very quickly!

pyongjoo self-assigned this May 27, 2019

pyongjoo added the bug label May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Scramble Appending #374

Spark Scramble Appending #374

commercial-hippie commented May 27, 2019

pyongjoo commented Jun 5, 2019

voycey commented Jun 6, 2019 via email

pyongjoo commented Jun 6, 2019

voycey commented Jun 6, 2019 via email

pyongjoo commented Jun 8, 2019

voycey commented Jun 12, 2019

pyongjoo commented Jun 12, 2019

voycey commented Jun 13, 2019 •

edited

Loading

Spark Scramble Appending #374

Spark Scramble Appending #374

Comments

commercial-hippie commented May 27, 2019

pyongjoo commented Jun 5, 2019

voycey commented Jun 6, 2019 via email

pyongjoo commented Jun 6, 2019

voycey commented Jun 6, 2019 via email

pyongjoo commented Jun 8, 2019

voycey commented Jun 12, 2019

pyongjoo commented Jun 12, 2019

voycey commented Jun 13, 2019 • edited Loading

voycey commented Jun 13, 2019 •

edited

Loading