-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: refactor ingest #3009
feat: refactor ingest #3009
Conversation
This is really cool! And its also a lot to understand. Could you add notes on how to test it out in the description? |
002824f
to
cc14e48
Compare
|
for s in src_v2: | ||
src_dict[s.name] = s | ||
for d in dest_v2: | ||
dest_dict[d.name] = d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, so we just defer to v2 for cli if there exists a v2 command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, at least right now as soon as the new connector exists, it replaces the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is actually pretty nice because it means that nothing breaks as far as users know. Also means we can add deprecations warnings for the Python users for whom it will break once we switch over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth doing that for the ones that you're bumping here? And probably bump the docs for the respective connectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we manage the docs for connectors now? I thought that changed recently and not sure if that's still being managed by this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question! so, we now host the docs themselves in this repo: https://github.com/Unstructured-IO/docs
and it pulls source code snippets from here: https://github.com/Unstructured-IO/unstructured/tree/main/docs/source/ingest/source_connectors/code
and
https://github.com/Unstructured-IO/unstructured/tree/main/docs/source/ingest/destination_connectors/code
4cfd6a1
to
aed69c3
Compare
Overall I think there are a lot of good changes in here! This does come at the cost of quite a bit of complexity (though to be fair some of this is coming back in a bit cold). My biggest concern is that it's going to be tough for an average person to walk in and debug or be effective here, and currently you have to do a lot of tracing to grok what's going on. My main ask: can we equivalently (in size and scope) push the documentation? Including a lot more comments and docstrings as well as a comprehensive breakdown of everything happening with maybe some diagrams as well? |
58cdb61
to
253f158
Compare
0f216ed
to
90d8d30
Compare
701bd89
to
9c35a59
Compare
This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: rbiseck3 <[email protected]>
This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: rbiseck3 <[email protected]>
This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: rbiseck3 <[email protected]>
5103807
to
aa53711
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should mention in the PR description the bug fix (which is actually what caused the PR's surface area to bloat a bunch)
Thanks for the additions so far. Just noting here from conversation, more extensive documentation (mentioned in comments) will be added before we switch over from original code and current flow. In the meantime this should be non-breaking for users.
Description
This refactors the current ingest CLI process to support better granularity in how the steps are ran
Callouts
Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more.
Testing
v2
short help text next to those commands when running the current cli:PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia
You can run any of the local or s3 specific ingest tests and these should now work.