Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/finish implementing document level language detection for other doctypes #1534

Closed
shreyanid opened this issue Sep 26, 2023 · 0 comments · Fixed by #1627
Closed

feat/finish implementing document level language detection for other doctypes #1534

shreyanid opened this issue Sep 26, 2023 · 0 comments · Fixed by #1627
Assignees
Labels
enhancement New feature or request

Comments

@shreyanid
Copy link
Contributor

If the document languages are not provided by the user, use langdetect to detect the language of the text (on a document level for speed). If confidence in result is high enough, we can assume all elements are in the detected language.

Pattern has been established in this PR for text partitioning. Apply this pattern - adding the languages parameter for user input, detecting the document language, and adding the resulting document language to the element metadata - to all other non-image-based documents (all but pdf and image) partitioning functions as well as auto partition.

@shreyanid shreyanid added the enhancement New feature or request label Sep 26, 2023
@Coniferish Coniferish self-assigned this Sep 26, 2023
github-merge-queue bot pushed a commit that referenced this issue Oct 11, 2023
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: Coniferish <[email protected]>
Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: Austin Walker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants