Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature / Question: better performance for data ingestion (batching?) #103

Open
2 tasks done
jsimonrichard opened this issue Jun 12, 2024 · 3 comments
Open
2 tasks done

Comments

@jsimonrichard
Copy link

Is your feature request related to a problem?

I have some data saved in .xlsx documents (~ 10,500 rows, 4.3 MB in total) that I need to ingest into an on-disk SurrealDB instance. This is my current method:

async def ingest_dicts(
    db,
    dicts,
    table_name,
    append_columns_from_dict={},
    batch_size=100,  # httpx default pool limit
):
    for i in range(0, len(dicts), batch_size):
        tasks = []
        for d in dicts[i : min(i + batch_size, len(dicts))]:
            tasks.append(
                asyncio.create_task(db.create(table_name, d | append_columns_from_dict))
            )
        for task in tasks:
            await task

reports = [
  ...
]

dfs = []
for report in reports:
    with pd.ExcelFile(report) as f:
        df = pd.read_excel(f, sheet_name="Sorted")
        dfs.append(df)

total = 0
for df in dfs:
    total += df.memory_usage(index=True).sum()
print(total/10**6, " MB")

dicts = []
for df in dfs:
    dicts.extend(df.to_dict(orient="records"))
print(len(dicts))

await asyncio.create_task(ingest_dicts(db, dicts, "urgd_reports"))

The last await statement takes 1 minute to complete on its own (I'm using VS Code / Jupyter, which displays the runtimes for each cell).

The Problem/Question: this seems pretty slow to me. Is there a better way to do this in Python (trying to get around the xy problem here)? Or is SurrealDB / SurrealDB.py just not that fast?

Describe the solution

If the solution for this kind of problem is batching with db.query for fewer round trips, then it would be nice to include some kind of batching functionality in db.create instead. The create function could accept an array of dicts and a kwarg specifying the batch size.

If that's not the issue, than I don't know.

Alternative methods

Unknown.

SurrealDB version

1.5.1 for linux on x86_64

surrealdb.py version

surrealdb 0.3.2 for linux on x86_64 using Python 3.10.12

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@TudorAndrei-Pythia
Copy link
Contributor

You can use asyncio.gather to launch all the requests at the same time. Docs.

Another approach would be to create a huge string with all the queries, and send it all at once, maybe in a transaction?

@jsimonrichard
Copy link
Author

I did try using db.query to send the data all at once, but the query was too large. I haven't tried breaking up that large command yet, but that might also be a good option.

If that does end up working, it might be nice to include some kind of batching functionality in the insert function itself (i.e. maybe allow arrays to be passed to insert). That's the feature request part of this issue.

@KarateSnowMachine
Copy link

You can use asyncio.gather to launch all the requests at the same time. Docs.

Another approach would be to create a huge string with all the queries, and send it all at once, maybe in a transaction?

I was trying the gather() for batch querying some data that was too big for a single request:

tasks = [db.query(f"select * from {id}") for id in ids]
out = asyncio.gather(*tasks)

However, I'm getting a RuntimeError: cannot call recv while another coroutine is already waiting for the next message -- I haven't dug into it yet, but maybe someone here knows what the issue might be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants