-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: unstructured ingest as a pipeline #1551
Conversation
b780f47
to
37813ce
Compare
37813ce
to
2df5fd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some general comments:
- can we switch to use concurrent executor pool from
concurrent.futures
instead of multiprocessing pool? The executor is better at avoiding thread locks than pool. Given we potentially run models with their own mp backend (e.g., huggingface tokenizer uses rust backend) we can get into trouble with threadlock easily - the rust backend from huggingface is also an example for careful setting up the pipeline: there are nodes in this design that inherently does multi-threading/procesing on its own: While the attribute
support_multiprocessing
can help id them it doesn't help when some are preferred to not be called AFTER or BEFORE python's multi processing: e.g., python version hangs on encode_batch when run in subprocess huggingface/tokenizers#187 - how does this pipeline optimize a workflow where part of the process is on GPU and we can utilize the CPU for other tasks at the same time (e.g., one of the reformat node uses a GPU model and another runs a randomforest model on CPU and they don't depend on each other)? Maybe we want to think about aync tasks? Which would favor likely a pool of concurrent executors as well
@badGarnet valid points, I'm not sure if some of them should be scoped to their own effort? To limit how much is changed in this PR, the pool approach is what's already being used, this only splits it up into discrete steps run by a pipeline class. |
yes, I think this is a pretty reasonable expectation for the simple source connector only flow, which folks likely will still hit. I think adding that final copy step makes sense to me. |
I know we hash the download location by url, but what do you mean by hash ingest docs ? where do IngestDocs themselves get hashed? or is this in the new flow..is this a hash used for storing data from the intermediate nodes/stages? if it’s the latter, then I agree, at a minimum the hash should include any parameters at that node/stage and all of the parameters from upstream nodes/stages. |
I think we’ll still want a way to just allow a user to reprocess the whole pipeline. maybe --reprocess still does that and add a new flag to specify reprocessing a specific stage? presumably you can only specify one specific one since anything downstream of the earliest reprocessed node would also have to get reprocessed? |
@ryannikolaidis, regarding the hash, the pipeline context that gets passed around across nodes contains a ingest_docs_map: dict Which gets populated via ingest_docs_map[get_ingest_doc_hash(doc)] = doc Where hashed = hashlib.sha256(json_as_dict.get("filename").encode()).hexdigest()[:32] This allows every stage be able to map the content they're producing back to the original ingest doc class. |
got it. yea, then I agree, you definitely can't just rely on download location for this hash. |
247958d
to
2474bd2
Compare
given how much this changes, probably worth a refresh to the README and supporting diagrams to describe the entire flow in the context of these changes. |
c9f9c31
to
4bb636b
Compare
@ryannikolaidis, I added a new issue to update all documentation given this new approach. If I was to do that in the same PR, it would a ton more changed files to the already large list of changes. |
yep, all good. thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! thanks for all of the changes.
Description
As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change:
node
, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment:download_dir
parameter.IngestDocJsonMixin
which adds in all the@property
fields to the serialized json already being created via theDataClassJsonMixin
$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/
._output_filename
value of the ingest doc.multiprocessing.manager.DictProxy
to make sure any interaction with it is behind a lock.Minor refactors included:
@property
for source metadata on base ingest doc to add logic to callupdate_source_metadata
if it's stillNone
at the time it's fetched.Additional bug fixes included
ingest_doc_cls
. This was removed from the fields captured by the@dataclass
decorator and added in a__post_init__
method.update_source_metadata
misnamed asupdate_source_metadata_metadata
so it was never being called.Flow Diagram