Feature / Question: better performance for data ingestion (batching?) #103

jsimonrichard · 2024-06-12T15:59:31Z

Is your feature request related to a problem?

I have some data saved in .xlsx documents (~ 10,500 rows, 4.3 MB in total) that I need to ingest into an on-disk SurrealDB instance. This is my current method:

async def ingest_dicts(
    db,
    dicts,
    table_name,
    append_columns_from_dict={},
    batch_size=100,  # httpx default pool limit
):
    for i in range(0, len(dicts), batch_size):
        tasks = []
        for d in dicts[i : min(i + batch_size, len(dicts))]:
            tasks.append(
                asyncio.create_task(db.create(table_name, d | append_columns_from_dict))
            )
        for task in tasks:
            await task

reports = [
  ...
]

dfs = []
for report in reports:
    with pd.ExcelFile(report) as f:
        df = pd.read_excel(f, sheet_name="Sorted")
        dfs.append(df)

total = 0
for df in dfs:
    total += df.memory_usage(index=True).sum()
print(total/10**6, " MB")

dicts = []
for df in dfs:
    dicts.extend(df.to_dict(orient="records"))
print(len(dicts))

await asyncio.create_task(ingest_dicts(db, dicts, "urgd_reports"))

The last await statement takes 1 minute to complete on its own (I'm using VS Code / Jupyter, which displays the runtimes for each cell).

The Problem/Question: this seems pretty slow to me. Is there a better way to do this in Python (trying to get around the xy problem here)? Or is SurrealDB / SurrealDB.py just not that fast?

Describe the solution

If the solution for this kind of problem is batching with db.query for fewer round trips, then it would be nice to include some kind of batching functionality in db.create instead. The create function could accept an array of dicts and a kwarg specifying the batch size.

If that's not the issue, than I don't know.

Alternative methods

Unknown.

SurrealDB version

1.5.1 for linux on x86_64

surrealdb.py version

surrealdb 0.3.2 for linux on x86_64 using Python 3.10.12

Contact Details

[email protected]

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

TudorAndrei-Pythia · 2024-06-16T20:57:36Z

You can use asyncio.gather to launch all the requests at the same time. Docs.

Another approach would be to create a huge string with all the queries, and send it all at once, maybe in a transaction?

jsimonrichard · 2024-06-17T19:20:06Z

I did try using db.query to send the data all at once, but the query was too large. I haven't tried breaking up that large command yet, but that might also be a good option.

If that does end up working, it might be nice to include some kind of batching functionality in the insert function itself (i.e. maybe allow arrays to be passed to insert). That's the feature request part of this issue.

KarateSnowMachine · 2024-11-13T22:03:52Z

You can use asyncio.gather to launch all the requests at the same time. Docs.

Another approach would be to create a huge string with all the queries, and send it all at once, maybe in a transaction?

I was trying the gather() for batch querying some data that was too big for a single request:

tasks = [db.query(f"select * from {id}") for id in ids]
out = asyncio.gather(*tasks)

However, I'm getting a RuntimeError: cannot call recv while another coroutine is already waiting for the next message -- I haven't dug into it yet, but maybe someone here knows what the issue might be.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature / Question: better performance for data ingestion (batching?) #103

Feature / Question: better performance for data ingestion (batching?) #103

jsimonrichard commented Jun 12, 2024

TudorAndrei-Pythia commented Jun 16, 2024

jsimonrichard commented Jun 17, 2024

KarateSnowMachine commented Nov 13, 2024

Feature / Question: better performance for data ingestion (batching?) #103

Feature / Question: better performance for data ingestion (batching?) #103

Comments

jsimonrichard commented Jun 12, 2024

Is your feature request related to a problem?

Describe the solution

Alternative methods

SurrealDB version

surrealdb.py version

Contact Details

Is there an existing issue for this?

Code of Conduct

TudorAndrei-Pythia commented Jun 16, 2024

jsimonrichard commented Jun 17, 2024

KarateSnowMachine commented Nov 13, 2024