-
-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray based document parsing of more file types #94
Conversation
@ellisonbg |
OK, this is ready for review. |
loader_kwargs: Optional[Dict] = None, | ||
recursive: bool = False, | ||
path, | ||
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add .tsx
and .txt
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and .jsx
as well.
recursive: bool = False, | ||
path, | ||
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'}, | ||
exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to ignore .git
and .DS_Store
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we parse .gitignore
and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, let's explore this in a follow up PR.
Co-authored-by: Jason Weill <[email protected]>
Co-authored-by: Jason Weill <[email protected]>
Co-authored-by: Jason Weill <[email protected]>
recursive: bool = False, | ||
path, | ||
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'}, | ||
exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we parse .gitignore
and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.
All points addressed or will iterate in further PRs.
* Ray based document parsing of more file types. * Renaming to learn/ask to make for human centered. * Improvements to the learn/ask commands. * fix typo Co-authored-by: Jason Weill <[email protected]> * improve grammar Co-authored-by: Jason Weill <[email protected]> * improve wording Co-authored-by: Jason Weill <[email protected]> * Adding new extensions and excludes. * Update langchain to version 0.144. --------- Co-authored-by: david qiu <[email protected]> Co-authored-by: Jason Weill <[email protected]>
* Ray based document parsing of more file types. * Renaming to learn/ask to make for human centered. * Improvements to the learn/ask commands. * fix typo Co-authored-by: Jason Weill <[email protected]> * improve grammar Co-authored-by: Jason Weill <[email protected]> * improve wording Co-authored-by: Jason Weill <[email protected]> * Adding new extensions and excludes. * Update langchain to version 0.144. --------- Co-authored-by: david qiu <[email protected]> Co-authored-by: Jason Weill <[email protected]>
This is a general improvement to the file indexing capabilities: