Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRUCT support in Verdict for scambles #354

Open
voycey opened this issue Mar 14, 2019 · 5 comments
Open

STRUCT support in Verdict for scambles #354

voycey opened this issue Mar 14, 2019 · 5 comments

Comments

@voycey
Copy link

voycey commented Mar 14, 2019

Hi guys,

One of our tables has recently started receiving data in the form of a struct (array / row).

For example:

{city=Jackson, state=WY, zip=83001, county=Teton, msa=null, country=US} 

{city=Cheyenne, state=WY, zip=82001, county=Laramie, msa=null, country=US}

{city=Gillette, state=WY, zip=82718, county=Campbell, msa=null, country=US}

I was wondering how Verdict builds its scrambles based on this kind of data? Is this a data structure you actively support? Would each of the internal items be capable of producing fast aggregations?

For example:

SELECT count(distinct(Location.city)) from table

Our scramble performance has dropped significantly but we aren't sure if this correlates?

@pyongjoo
Copy link
Member

VerdictDB should just work. One possible reason is that columnar format may not be very efficient for such data types.

If you can load sample data into the cluster, we may be able to test them.

@pyongjoo
Copy link
Member

@dongyoungy Can you ask someone to investigate this by comparing different compression formats for our scramble tables? Maybe we can try different formats (e.g., ORC or parquet) with different compression schemes.

@voycey
Copy link
Author

voycey commented Mar 14, 2019

I'm unsure as to the internals for it but yes I agree that structs on a columnar are probably not ideal - they seem to be the preferred way in BigQuery (where this data has originated from). We are considering flattening them out as a last resort but we would prefer to get some information on exactly how verdict handles this before we do anything drastic :)

@pyongjoo
Copy link
Member

pyongjoo commented Apr 1, 2019

@dongyoungy Can you ask @Beastjoe to investigate this issue? I see two related problems:

  1. Performance when the table contains array or struct
  2. Possible performance degradation when samples keep appended

@voycey
Copy link
Author

voycey commented Apr 1, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants