-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark Scramble Appending #374
Comments
Thanks for letting us know. Right now, I'm more leaning toward a non-SQL approach for scramble creation. I found Google Dataflow (backed by Apache Beam) could be a good solution both for generality and scalability. FYI, it can also easily work with files directly (as Dan suggested). Let me know what you think. |
I agree that non-SQL is the way forward for this as much as possible
https://beam.apache.org/documentation/io/built-in/
BigQuery is great as an option - although handling this you will need to
be aware of the query costs involved - unless it is crafted well they can
spiral out of control pretty easily, when you have 60TB of data in there -
this all of a sudden becomes an expensive query ($300 per query execution).
Avro & Parquet I/O should definitely also be considered (ORC will be
important for us but it seems development has stalled but we are good with
BQ for now: https://issues.apache.org/jira/browse/BEAM-1861)
As that way you are avoiding any kind of 3rd party processes and the
results can be moved to whatever query engine is required as a simple file
transfer
…On Thu, 6 Jun 2019 at 03:24, Yongjoo Park ***@***.***> wrote:
Thanks for letting us know. Right now, I'm more leaning toward a non-SQL
approach for scramble creation. I found Google Dataflow (backed by Apache
Beam) <https://beam.apache.org/documentation/io/built-in/google-bigquery/>
could be a good solution both for generality and scalability. FYI, it can
also easily work with files directly (as Dan suggested).
Let me know what you think.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#374>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIEBCSPA4EAWBHRQVBRKNTPY7ZDFANCNFSM4HP3E4OA>
.
--
__
Mob: (+61) 0481 301 640
Skype: daniel.voyce
LinkedIn: https://www.linkedin.com/in/danielvoyce/
|
For BQ, I think Verdict team will use flat-rate pricing ($10K/month) in the future, so its users can basically run unlimited queries on unlimited-scale data. Yet, this is still down the road. |
I don't know how that would work? Custodial transfer of data isn't a good
idea! It should be cheap enough in BQ vs Spark if the queries are designed
correctly
…On Fri, 7 Jun 2019 at 01:32, Yongjoo Park ***@***.***> wrote:
For BQ, I think Verdict team will use flat-rate pricing ($10K/month) in
the future, so its users can basically run unlimited queries on
unlimited-scale data. Yet, this is still down the road.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#374>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIEBCWWMPZ6R3YUGS54AADPZEUZDANCNFSM4HP3E4OA>
.
--
__
Mob: (+61) 0481 301 640
Skype: daniel.voyce
LinkedIn: https://www.linkedin.com/in/danielvoyce/
|
Your datasets in BQ can be accessed by a third-party service if you grant access: https://cloud.google.com/bigquery/docs/share-access-views I'm double-checking a billing-related question, i.e., which party is charged for queries. |
I think for data provenance reasons this wont be possible - especially in light of the new privacy laws coming in from California. We are happy to pay for the processing of our own data - we just need the tools to be able to do it :) |
@voycey It would be great if you can elaborate on data provenance issue. My naive thought was if Google allows the operation (e.g., viewing data without copying them), it must be legal (under the assumption that the data provider grants). Regarding partitioning, I think the easiest way is to let you specify partition columns for scrambles. Currently, scrambles inherit the partitioning of an original table (thus, the number exploits in combination with verdictdbblock). I believe a combination of (date, verdictdbblock) should good enough for ensuring speed (without state). In the future, we can reduce the total number of verdictdbblock down to 10 or so, by introducing a clever scheme, but it will take some time (since we are transitioning...) |
There are many use cases where this would be great, however when dealing with user information allowing 3rd party access to this data you are essentially not only opening up a secondary vector but you are also technically sharing information without notifying the users, this would breach GDPR and the CCPA (I'm sure there are load of other reasons why this wouldn't be allowed as well but that is the first one that pops into my head). I think if we can just handle the partitioning by date that will be great for this - as the data is only 5% of the total there is little need to have a full granular partitioning scheme. In the meantime - generating scrambles (in whatever fashion) in BigQuery would be great - BQ simply handles the capacity requirements and they would probably complete very quickly! |
I haven't tested the scramble creation using Spark but appending to a scramble generates a invalid query.
The following error happens..
caused by the following query (stripped out some columns to reduce the length)..
This is because Spark doesn't support column lists in insert statements.
So instead of:
We need to use:
I've currently hacked around it to get it working temporarily.
Info 1
Info 2
The text was updated successfully, but these errors were encountered: