Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk load for greptimedb #405

Closed
killme2008 opened this issue Nov 7, 2022 · 4 comments
Closed

Bulk load for greptimedb #405

killme2008 opened this issue Nov 7, 2022 · 4 comments
Labels
C-enhancement Category Enhancements

Comments

@killme2008
Copy link
Contributor

killme2008 commented Nov 7, 2022

Bulk load data from sources, such as:

  • csv file
  • json file
  • parquet file
  • other tables
  • mysql table
  • ....
@killme2008 killme2008 added the C-enhancement Category Enhancements label Nov 7, 2022
@waynexia
Copy link
Member

waynexia commented Nov 7, 2022

I've invested bulk loading parquet files last week. As parquet is our (and the only) native supported format, we only need to supply some manifest and our specific metadata (in persist storage and in meta server) to make parquet files query-able and even writable.

But what about other format like csv or json? They cannot be directly queried (for now). Two approaches I come up with is

  • an offline converter that converts other format into parquet, and ingest the converted parquet file.
  • add support for those formats.

@sunng87
Copy link
Member

sunng87 commented Nov 10, 2022

make parquet files query-able and even writable.

And in a cluster we should have to split the file according to the table's partition rule as well? This is better done in frontend via some custom sql like COPY INTO

And let frontend to deal with more formats like csv or json. We can convert them to parquet internally.

@waynexia
Copy link
Member

And in a cluster we should have to split the file according to the table's partition rule as well?

Yes. We can let frontend preprocess(split) it and upload them all to OSS.

And let frontend to deal with more formats like csv or json. We can convert them to parquet internally.

I also prefer to convert other formats to parquet. Though support them is not complex but considering the possible modification in the future it would be better to unify the format.

@killme2008
Copy link
Contributor Author

killme2008 commented May 8, 2023

Already implemented in #1038 #1064

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements
Projects
None yet
Development

No branches or pull requests

3 participants