Revamp age csv loader #2044

MuhammadTahaNaveed · 2024-08-13T12:15:12Z

Allow 0 as entry_id
Use batch inserts to improve performance
- Changed heap_insert to heap_multi_insert since it is faster than calling heap_insert() in a loop. When multiple tuples can be inserted on a single page, just a single WAL record covering all of them, and only need to lock/unlock the page once.
- BATCH_SIZE is set to 1000, which is the number of tuples to insert in a single batch. This number was chosen after some experimentation.
- Change some of the field names to avoid confusion.
Use sequence for generating ids for edge and vertex
- Sequence is not used if the id_field_exists is true in load_labels_from_file function, since the entry id is present in the csv.
Add function to create temporary table for ids, this is only used for loading vertices
- A temporary table is created and populated with already generated vertex ids when first time load_labels_from_file function is called. A unique index is created on id column to ensure that new ids generated (using entry id from csv) are unique. This table and index will be deleted automatically whenever the session ends.
- Whenever a row is inserted in labels, the corresponding id is inserted into temp table as well.
Add functions to create graph and label automatically
- These functions will check existence of graph and label, and create them if they don't exist.

- No regression test were impacted by this change.

- Changed heap_insert to heap_multi_insert since it is faster than calling heap_insert() in a loop. When multiple tuples can be inserted on a single page, just a single WAL record covering all of them, and only need to lock/unlock the page once. - BATCH_SIZE is set to 1000, which is the number of tuples to insert in a single batch. This number was chosen after some experimentation. - Change some of the field names to avoid confusion.

- Sequence is not used if the id_field_exists is true in load_labels_from_file function, since the entry id is present in the csv.

- Created a temporary table and populate it with already generated vertex ids. A unique index is created on id column to ensure that new ids generated (using entry id from csv) are unique.

- Insert ids in the temporary table and also update index to enforce uniqueness. - If the entry id provided in the CSV is greater than the current sequence value, the sequence value is updated to match the entry ID. For example: Suppose the current sequence value is 1, and the CSV entry ID is 2. If we use 2 but not update the sequence to 2, next time the CREATE clause is used, 2 will be returned by sequence as an entry id, resulting in duplicate. - Update batch functions

- These functions will check existence of graph and label, and create them if they don't exist.

* Allow 0 as entry_id - No regression test were impacted by this change. * Use batch inserts to improve performance - Changed heap_insert to heap_multi_insert since it is faster than calling heap_insert() in a loop. When multiple tuples can be inserted on a single page, just a single WAL record covering all of them, and only need to lock/unlock the page once. - BATCH_SIZE is set to 1000, which is the number of tuples to insert in a single batch. This number was chosen after some experimentation. - Change some of the field names to avoid confusion. * Use sequence for generating ids for edge and vertex - Sequence is not used if the id_field_exists is true in load_labels_from_file function, since the entry id is present in the csv. * Add function to create temporary table for ids - Created a temporary table and populate it with already generated vertex ids. A unique index is created on id column to ensure that new ids generated (using entry id from csv) are unique. * Insert generated ids in the temporary table to enforce uniqueness - Insert ids in the temporary table and also update index to enforce uniqueness. - If the entry id provided in the CSV is greater than the current sequence value, the sequence value is updated to match the entry ID. For example: Suppose the current sequence value is 1, and the CSV entry ID is 2. If we use 2 but not update the sequence to 2, next time the CREATE clause is used, 2 will be returned by sequence as an entry id, resulting in duplicate. - Update batch functions * Add functions to create graph and label automatically - These functions will check existence of graph and label, and create them if they don't exist. * Add regression tests

MuhammadTahaNaveed added 7 commits August 12, 2024 19:11

Allow 0 as entry_id

90bba62

- No regression test were impacted by this change.

Use sequence for generating ids for edge and vertex

6ce9f01

- Sequence is not used if the id_field_exists is true in load_labels_from_file function, since the entry id is present in the csv.

Add function to create temporary table for ids

6337dd8

- Created a temporary table and populate it with already generated vertex ids. A unique index is created on id column to ensure that new ids generated (using entry id from csv) are unique.

Add functions to create graph and label automatically

cff0cbf

- These functions will check existence of graph and label, and create them if they don't exist.

Add regression tests

2d3d783

MuhammadTahaNaveed requested review from jrgemignani and rafsun42 August 13, 2024 12:15

github-actions bot added master override-stale To keep issues/PRs untouched from stale action labels Aug 13, 2024

jrgemignani approved these changes Aug 14, 2024

View reviewed changes

jrgemignani merged commit e370db3 into apache:master Aug 14, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp age csv loader #2044

Revamp age csv loader #2044

MuhammadTahaNaveed commented Aug 13, 2024 •

edited

Loading

Revamp age csv loader #2044

Revamp age csv loader #2044

Conversation

MuhammadTahaNaveed commented Aug 13, 2024 • edited Loading

MuhammadTahaNaveed commented Aug 13, 2024 •

edited

Loading