Feature: background ingestion in LOCAL mode #321

kunwar31 · 2024-09-05T21:23:37Z

Solves #319

backend/indexer/indexer.py

kunwar31 · 2024-09-06T13:38:50Z

@chiragjn The main issue is that there is no "manager" for the process i created, and I have intentionally not added it, as this is only meant for LOCAL ingestions.

I can add another deamon thread to manage any running background process, which can update the status as FAILED if the process dies unexpectedly
However, the above solution will again add futher unwanted complexity. I agree that long term we need a proper solution (based on queues + background workers)

kunwar31 · 2024-09-06T15:49:39Z

@chiragjn changed to use ProcessPoolExecutor, initialized in the lifespan context manager of fastAPI.
number of workers can be set using PROCESS_POOL_WORKERS environment variable (have set default to 4 in compose.env, 1 in Settings class)
Now, before fastAPI shuts down it will ensure the there are no tasks running in the process pool.
However, unexpected termination of ingestion process will still not be tracked

chiragjn · 2024-09-06T15:55:08Z

However, unexpected termination of ingestion process will still not be tracked

That's alright, we have to start somewhere :)

chiragjn · 2024-09-06T15:59:58Z

backend/indexer/indexer.py

+                from backend.server.app import process_pool
+                # future of this submission is ignored, ingestion failure due to process termination will not be tracked


Let's not cross import
it is fine to pass pool as input arg which is by default None

We want to keep all the code from main as is and use something on lines of

loop = asyncio.get_running_loop() coro = loop.run_in_executor(sync_data_source_to_collection(...), pool, ...) if pool is None: await coro else: asyncio.create_task(coro)

@chiragjn run_in_executor is used to run synchronous code, so sync_data_source_to_collection can't be async
further, asyncio.create_task(coro) will again block the event loop (waiting for coro to finish at some point)

run_in_executor is used to run synchronous code

You are correct, I am not framing it correctly, give me some time, I'll get back on this

@chiragjn I spend a few hours thinking about it too, I know you want to run the whole ingestion in the same async event loop, but there are blocking synchronous parts in the sync_data_source_to_collection function, mainly the getting of chunks from unstructured_api, which will keep blocking other tasks in the event loop.

this method get_chunks in
backend.modules.parsers.unstructured_io.UnstructuredIoParser
does a very long POST call which blocks everything

if we make somehow make this request non-blocking, we would be in a better shape, but there still wont be any guarantee that it doesn't block the main event loop.
Let me know your thoughts.

Let's not cross import it is fine to pass pool as input arg which is by default None

@chiragjn added a process_pool module to make this cleaner. At some later stage this can be refactored into a queue based worker pool, all its references to pool.submit will still work.

I know you want to run the whole ingestion in the same async event loop,

Actually I don't want to in run in same loop. I want to run stuff in a separate process, my only hesistation is wrapping all async coroutines to be sync again, just trying to see if we can somehow avoid that. Worst case we have to wrap and just accept it

this method get_chunks in
backend.modules.parsers.unstructured_io.UnstructuredIoParser
does a very long POST call which blocks everything

if we make somehow make this request non-blocking, we would be in a better shape, but there still wont be any guarantee that it doesn't block the main event loop.

Yes this and plus the whole vector store implementation also needs to be made async.
At least the unstructured API call should be easy to change, instead of requests we just have to use asyncio httpx or aiohttp

but there still wont be any guarantee that it doesn't block the main event loop.

That is true, some bad parser can still block, hence eventually we will move away to a queue and workers system.

kunwar31 · 2024-09-06T17:49:57Z

Actually I don't want to in run in same loop. I want to run stuff in a separate process, my only hesistation is wrapping all async coroutines to be sync again, just trying to see if we can somehow avoid that. Worst case we have to wrap and just accept it

@chiragjn I think i've understood what you meant. I have simplified the diff, created a generic AsyncProcessPoolExecutor in process_pool module which handles all the steps. no need to wrap individual functions. the initial coroutine is still wrapped though.

The only way to avoid this wrapping hack is to make all synchronous parts of the async method asynchronous, which will be a much longer task. and if there are CPU-bound sync parts, run them in process_pool

backend/indexer/indexer.py

backend/modules/process_pool/__init__.py

chiragjn · 2024-09-06T18:46:18Z

We are almost there, I like this version of the code, just some code organizing changes. Thanks for being really patient and iterating with us!

kunwar31 · 2024-09-06T19:40:48Z

We are almost there, I like this version of the code, just some code organizing changes. Thanks for being really patient and iterating with us!

@chiragjn Refactored

AsyncProcessPoolExecutor is now in utils.py
api.state contains the process_pool, which is passed to ingestion function as an optional arg
had to "hack" around in router's ingest method to get the app.state, not ideal

kunwar31 · 2024-09-06T19:55:25Z

Tested this by running 4 ingestion jobs, 2 data sources and 2 collections, all 4 run in background parallely (workers=4) and backend is responsive

chiragjn

🎉
Thanks a lot for this contribution

kunwar31 added 2 commits September 6, 2024 02:45

run ingestion in background process

f74aa60

remove asyncio.set_event_loop

1b91ef7

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/indexer/indexer.py Outdated Show resolved Hide resolved

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/indexer/indexer.py Outdated Show resolved Hide resolved

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/indexer/indexer.py Outdated Show resolved Hide resolved

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/indexer/indexer.py Outdated Show resolved Hide resolved

kunwar31 added 2 commits September 6, 2024 21:08

use ProcessPoolExecutor created in FastAPI lifespan

cf0d537

merge main

ca0de2f

kunwar31 mentioned this pull request Sep 6, 2024

Feature: remove vector points when unassociating datasource from collection #322

Closed

handle process_pool=None

758fc45

chiragjn reviewed Sep 6, 2024

View reviewed changes

kunwar31 added 4 commits September 6, 2024 22:16

process_pool as a backend module

a4930c5

clean up, use only one wrapper

05d313b

formatting

8b5d6b0

Generic AsyncProcessPoolExecutor

2517cbf

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/indexer/indexer.py Outdated Show resolved Hide resolved

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/modules/process_pool/__init__.py Outdated Show resolved Hide resolved

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/modules/process_pool/__init__.py Outdated Show resolved Hide resolved

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/modules/process_pool/__init__.py Outdated Show resolved Hide resolved

chiragjn reviewed Sep 6, 2024

View reviewed changes

backend/modules/process_pool/__init__.py Outdated Show resolved Hide resolved

kunwar31 added 2 commits September 7, 2024 01:00

refactor: move process_pool in state, use asyncio.run

c554ddd

refactor: rename to pool

8f46203

chiragjn approved these changes Sep 6, 2024

View reviewed changes

chiragjn added 2 commits September 7, 2024 02:00

Update ingest data to take both dto and request instances

b6cf53b

Fix lint issues

5960192

chiragjn merged commit c58dafc into truefoundry:main Sep 6, 2024
1 check passed

chiragjn mentioned this pull request Sep 10, 2024

The backend API (http://localhost:8000) stop responding with the error "upstream request timeout" #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: background ingestion in LOCAL mode #321

Feature: background ingestion in LOCAL mode #321

kunwar31 commented Sep 5, 2024

kunwar31 commented Sep 6, 2024

kunwar31 commented Sep 6, 2024

chiragjn commented Sep 6, 2024 •

edited

Loading

chiragjn Sep 6, 2024

chiragjn Sep 6, 2024 •

edited

Loading

kunwar31 Sep 6, 2024

chiragjn Sep 6, 2024

kunwar31 Sep 6, 2024 •

edited

Loading

kunwar31 Sep 6, 2024 •

edited

Loading

chiragjn Sep 6, 2024 •

edited

Loading

kunwar31 commented Sep 6, 2024 •

edited

Loading

chiragjn commented Sep 6, 2024

kunwar31 commented Sep 6, 2024 •

edited

Loading

kunwar31 commented Sep 6, 2024

chiragjn left a comment •

edited

Loading

		from backend.server.app import process_pool
		# future of this submission is ignored, ingestion failure due to process termination will not be tracked

Feature: background ingestion in LOCAL mode #321

Feature: background ingestion in LOCAL mode #321

Conversation

kunwar31 commented Sep 5, 2024

kunwar31 commented Sep 6, 2024

kunwar31 commented Sep 6, 2024

chiragjn commented Sep 6, 2024 • edited Loading

chiragjn Sep 6, 2024

Choose a reason for hiding this comment

chiragjn Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

kunwar31 Sep 6, 2024

Choose a reason for hiding this comment

chiragjn Sep 6, 2024

Choose a reason for hiding this comment

kunwar31 Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

kunwar31 Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

chiragjn Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

kunwar31 commented Sep 6, 2024 • edited Loading

chiragjn commented Sep 6, 2024

kunwar31 commented Sep 6, 2024 • edited Loading

kunwar31 commented Sep 6, 2024

chiragjn left a comment • edited Loading

Choose a reason for hiding this comment

chiragjn commented Sep 6, 2024 •

edited

Loading

chiragjn Sep 6, 2024 •

edited

Loading

kunwar31 Sep 6, 2024 •

edited

Loading

kunwar31 Sep 6, 2024 •

edited

Loading

chiragjn Sep 6, 2024 •

edited

Loading

kunwar31 commented Sep 6, 2024 •

edited

Loading

kunwar31 commented Sep 6, 2024 •

edited

Loading

chiragjn left a comment •

edited

Loading