-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H-3614: Introduce chonky
PDF Embeddings
#5673
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JesusFileto!
I noticed that you have used curl
which means there is no asynchronous execution of requests possible. In particular for AI calls, asynchronous calls are preferred. In addition, curl
pulls in libcurl
(and OpenSSL), while reqwest
is written in Rust and can use rustls
.
As async backend we use tokio
. If you add the "macros"
and "rt-multi-thread"
to tokio
you can use #[tokio::main]
on fn main()
and #[tokio::test]
instead of #[test]
which makes the function async
.
As this effectively makes the code async
we can also use tokio
to read files (by using the fs
feature on tokio
and tokio::fs
instead of std::fs
)
Cargo.toml
Outdated
bumpalo = { version = "=3.16.0", default-features = false } | ||
bytes = { version = "1.6.0" } | ||
clap_builder = { version = "=4.5.21", default-features = false, features = ["std"] } | ||
criterion = { version = "=0.5.1" } | ||
curl = { version = "=0.4.47" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use reqwest
instead. It's already configured in the root-manifest. You probably want to enable the json
feature so we don't need to convert JSON to string first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok have pushed out some changes that do this, but unable to solve the tokio
linter errors that occur from moves during the await
function.
libs/chonky/src/embedding.rs
Outdated
@@ -0,0 +1,451 @@ | |||
pub mod multi_modal_embedding { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split modules into files now. This means that embedding.rs
becomes embedding/mod.rs
and multi_modul_embedding
is located in embedding/multi_modul_embedding.rs
(or probably embedding/multi-modal
as the parent already is called embedding
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be done now!
libs/chonky/src/embedding.rs
Outdated
let formatted_json = serde_json::to_string_pretty(&json!({ | ||
"instances": [ | ||
{ | ||
"image": { | ||
"bytesBase64Encoded": base64_encoded_img | ||
} | ||
} | ||
] | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why you convert the JSON to a string? I'm not sure about curl
but reqwest
does support JSON directly (and is async
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we use reqwest
this has been removed.
chonky
PDF Embeddings
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5673 +/- ##
==========================================
+ Coverage 19.83% 23.15% +3.32%
==========================================
Files 515 561 +46
Lines 17327 18882 +1555
Branches 2548 2666 +118
==========================================
+ Hits 3437 4373 +936
- Misses 13852 14457 +605
- Partials 38 52 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Benchmark results
|
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph |
representative_read_multiple_entities
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_property | depths: DT=255, PT=255, ET=255, E=255 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=0, E=0 | Flame Graph | |
entity_by_property | depths: DT=2, PT=2, ET=2, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=0, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=2, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=2, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=255, PT=255, ET=255, E=255 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=0, E=0 | Flame Graph | |
link_by_source_by_property | depths: DT=2, PT=2, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=0, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=2, ET=2, E=2 | Flame Graph |
representative_read_entity_type
Function | Value | Mean | Flame graphs |
---|---|---|---|
get_entity_type_by_id | Account ID: d4e16033-c281-4cde-aa35-9085bf2e7579
|
Flame Graph |
scaling_read_entity_complete_one_depth
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 50 entities | Flame Graph | |
entity_by_id | 5 entities | Flame Graph | |
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 25 entities | Flame Graph |
scaling_read_entity_linkless
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 100 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 1000 entities | Flame Graph | |
entity_by_id | 10000 entities | Flame Graph |
scaling_read_entity_complete_zero_depth
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 50 entities | Flame Graph | |
entity_by_id | 5 entities | Flame Graph | |
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 25 entities | Flame Graph |
🌟 What is the purpose of this PR?
This PR outlines the work for the embeddings that will be done in the PDF and preliminary structs with how they will be stored. The main goal is to split embeddings into multiple levels of textual information such as table embeddings. Future PRs will focus on implementing the XML metadata that will accompany these embeddings to be used for entity extraction.
🚫 Blocked by
🔍 What does this change?
Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
🐾 Next steps
🛡 What tests cover this?