Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale-up #103

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

scale-up #103

wants to merge 7 commits into from

Conversation

galsalomon66
Copy link

@galsalomon66 galsalomon66 commented May 15, 2022

adding the capability to execute the same query across different input-stream(CSV in this case), merge results of each of the streams, and return them to the caller as a single one. the different execution flows can run in parallel to each other

  1. adding s3select_result (replacing std::string) to handle more options for result production
  2. adding shared-queue to handle results of multiple and parallel execution flows
  3. there are 2 main flows in query execution, 1) non-aggregate flow and 2) aggregate flow
    3a. the non-aggregate flow is mainly about merging results of different execution flows.
    3b. the aggregate flow handles the complexities of aggregation queries(sum, min, count ...), it splits the execution into 2 phases, the first is processing the query, the second is merging the results of all processes, in the case of aggregation query, it means that AST nodes, behave differently depends on thier phase.
  4. s3select_scaleup simulates a multi-threaded execution of a single query. the application defines a list of files as a single input data set for the query, each of the input files executed on a dedicated thread, and a single consumer merged the results of all producers. (this flow is for the sake of simulation and measurements)
  5. as for the RGW execution, multiple requests will process a single query, from the user perspective it's a single request.
  6. as for splitting the input (improve scalability in some use-cases), per each data source (CSV, Parquet, JSON) it needs a different flow of input splitting.
  7. TODO long result-rows should split into several entries in the shared queue.
  8. TODO result should be handled by a callback or return to the caller (as value)

Signed-off-by: gal Salomon [email protected]

@galsalomon66 galsalomon66 changed the title adding capability to execute the same query across different input-st… scale-up May 15, 2022
…ream(CSV in this case), and to merge results of each of the streams and return it to caller as a single one. the different executions can ran in parallel to eachother

Signed-off-by: gal salomon <[email protected]>
Signed-off-by: gal salomon <[email protected]>
… node (sum,max ...) has 2 state to handle, first-phase(processing query), second-phase(aggregate results of all participants).

Signed-off-by: gal salomon <[email protected]>
…tiple execution flows for aggregation and non aggregation flow. bug fixes.

Signed-off-by: gal salomon <[email protected]>
…ts to be processed simultaneously

Signed-off-by: gal salomon <[email protected]>
Signed-off-by: galsalomon66 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant