Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H-3614: Introduce chonky PDF Embeddings #5673

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

H-3614: Introduce chonky PDF Embeddings #5673

wants to merge 4 commits into from

Conversation

JesusFileto
Copy link
Member

🌟 What is the purpose of this PR?

This PR outlines the work for the embeddings that will be done in the PDF and preliminary structs with how they will be stored. The main goal is to split embeddings into multiple levels of textual information such as table embeddings. Future PRs will focus on implementing the XML metadata that will accompany these embeddings to be used for entity extraction.

🚫 Blocked by

  • Requires Setup of HASH API keys for Google Vertex API and including the HuggingFaceToken
  • ...

🔍 What does this change?

  • ...

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

  • does not modify any publishable blocks or libraries, or modifications do not need publishing

📜 Does this require a change to the docs?

The changes in this PR:

  • require changes to docs which are not made in this PR
    • Further modifications of current format for API calls and setup may change

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

  • do not affect the execution graph

⚠️ Known issues

  • When first running the API calls for HuggingFace, the model is initially "cold" and takes 20 seconds to boot up the model for use. Will be implementing a redrive mechanism for this scenario later.

🐾 Next steps

  • Add XML Metadata from the pdf to further enrich the embeddings
  • Investigate how we wish to store all this data

🛡 What tests cover this?

  • New tests were added

@github-actions github-actions bot added area/deps Relates to third-party dependencies (area) area/libs Relates to first-party libraries/crates/packages (area) area/tests New or updated tests area/libs > chonky Affects the `chonky` crate (library) labels Nov 20, 2024
Copy link
Member

@TimDiekmann TimDiekmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JesusFileto!
I noticed that you have used curl which means there is no asynchronous execution of requests possible. In particular for AI calls, asynchronous calls are preferred. In addition, curl pulls in libcurl (and OpenSSL), while reqwest is written in Rust and can use rustls.

As async backend we use tokio. If you add the "macros" and "rt-multi-thread" to tokio you can use #[tokio::main] on fn main() and #[tokio::test] instead of #[test] which makes the function async.
As this effectively makes the code async we can also use tokio to read files (by using the fs feature on tokio and tokio::fs instead of std::fs)

Cargo.toml Outdated
bumpalo = { version = "=3.16.0", default-features = false }
bytes = { version = "1.6.0" }
clap_builder = { version = "=4.5.21", default-features = false, features = ["std"] }
criterion = { version = "=0.5.1" }
curl = { version = "=0.4.47" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use reqwest instead. It's already configured in the root-manifest. You probably want to enable the json feature so we don't need to convert JSON to string first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok have pushed out some changes that do this, but unable to solve the tokio linter errors that occur from moves during the await function.

@@ -0,0 +1,451 @@
pub mod multi_modal_embedding {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split modules into files now. This means that embedding.rs becomes embedding/mod.rs and multi_modul_embedding is located in embedding/multi_modul_embedding.rs (or probably embedding/multi-modal as the parent already is called embedding)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be done now!

Comment on lines 29 to 37
let formatted_json = serde_json::to_string_pretty(&json!({
"instances": [
{
"image": {
"bytesBase64Encoded": base64_encoded_img
}
}
]
}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why you convert the JSON to a string? I'm not sure about curl but reqwest does support JSON directly (and is async).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we use reqwest this has been removed.

@vilkinsons vilkinsons changed the title Introduce PDF Embeddings H-3614: Introduce chonky PDF Embeddings Nov 20, 2024
Copy link

codecov bot commented Nov 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 23.15%. Comparing base (f185387) to head (5d9d336).
Report is 78 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5673      +/-   ##
==========================================
+ Coverage   19.83%   23.15%   +3.32%     
==========================================
  Files         515      561      +46     
  Lines       17327    18882    +1555     
  Branches     2548     2666     +118     
==========================================
+ Hits         3437     4373     +936     
- Misses      13852    14457     +605     
- Partials       38       52      +14     
Flag Coverage Δ
apps.hash-ai-worker-ts 1.32% <ø> (-0.06%) ⬇️
apps.hash-api 1.16% <ø> (-0.01%) ⬇️
blockprotocol.type-system 46.42% <ø> (-0.98%) ⬇️
local.harpc-client 72.03% <ø> (?)
local.hash-backend-utils 8.80% <ø> (ø)
local.hash-graph-sdk 100.00% <ø> (ø)
local.hash-isomorphic-utils 1.04% <ø> (-0.01%) ⬇️
local.hash-subgraph 24.54% <ø> (ø)
rust.deer 6.66% <ø> (ø)
rust.error-stack 72.51% <ø> (ø)
rust.sarif 87.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

Benchmark results

@rust/hash-graph-benches – Integrations

representative_read_entity

Function Value Mean Flame graphs
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1 $$16.4 \mathrm{ms} \pm 222 \mathrm{μs}\left({\color{gray}-2.709 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1 $$16.8 \mathrm{ms} \pm 204 \mathrm{μs}\left({\color{gray}2.41 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2 $$16.5 \mathrm{ms} \pm 193 \mathrm{μs}\left({\color{gray}-0.828 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1 $$16.5 \mathrm{ms} \pm 200 \mathrm{μs}\left({\color{gray}-3.232 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1 $$17.0 \mathrm{ms} \pm 188 \mathrm{μs}\left({\color{red}5.54 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1 $$16.9 \mathrm{ms} \pm 192 \mathrm{μs}\left({\color{gray}-0.340 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1 $$15.9 \mathrm{ms} \pm 197 \mathrm{μs}\left({\color{gray}-0.994 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1 $$16.8 \mathrm{ms} \pm 181 \mathrm{μs}\left({\color{gray}-0.762 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1 $$16.6 \mathrm{ms} \pm 190 \mathrm{μs}\left({\color{gray}-1.167 \mathrm{\%}}\right) $$ Flame Graph

representative_read_multiple_entities

Function Value Mean Flame graphs
entity_by_property depths: DT=255, PT=255, ET=255, E=255 $$69.7 \mathrm{ms} \pm 490 \mathrm{μs}\left({\color{gray}1.23 \mathrm{\%}}\right) $$ Flame Graph
entity_by_property depths: DT=0, PT=0, ET=0, E=0 $$39.2 \mathrm{ms} \pm 256 \mathrm{μs}\left({\color{gray}-1.039 \mathrm{\%}}\right) $$ Flame Graph
entity_by_property depths: DT=2, PT=2, ET=2, E=2 $$58.2 \mathrm{ms} \pm 358 \mathrm{μs}\left({\color{gray}-0.718 \mathrm{\%}}\right) $$ Flame Graph
entity_by_property depths: DT=0, PT=0, ET=0, E=2 $$43.7 \mathrm{ms} \pm 330 \mathrm{μs}\left({\color{gray}-1.833 \mathrm{\%}}\right) $$ Flame Graph
entity_by_property depths: DT=0, PT=0, ET=2, E=2 $$49.7 \mathrm{ms} \pm 283 \mathrm{μs}\left({\color{gray}0.784 \mathrm{\%}}\right) $$ Flame Graph
entity_by_property depths: DT=0, PT=2, ET=2, E=2 $$53.8 \mathrm{ms} \pm 339 \mathrm{μs}\left({\color{gray}-1.030 \mathrm{\%}}\right) $$ Flame Graph
link_by_source_by_property depths: DT=255, PT=255, ET=255, E=255 $$108 \mathrm{ms} \pm 657 \mathrm{μs}\left({\color{gray}-0.067 \mathrm{\%}}\right) $$ Flame Graph
link_by_source_by_property depths: DT=0, PT=0, ET=0, E=0 $$41.8 \mathrm{ms} \pm 299 \mathrm{μs}\left({\color{gray}0.264 \mathrm{\%}}\right) $$ Flame Graph
link_by_source_by_property depths: DT=2, PT=2, ET=2, E=2 $$98.3 \mathrm{ms} \pm 506 \mathrm{μs}\left({\color{gray}-1.367 \mathrm{\%}}\right) $$ Flame Graph
link_by_source_by_property depths: DT=0, PT=0, ET=0, E=2 $$80.2 \mathrm{ms} \pm 564 \mathrm{μs}\left({\color{gray}0.646 \mathrm{\%}}\right) $$ Flame Graph
link_by_source_by_property depths: DT=0, PT=0, ET=2, E=2 $$88.1 \mathrm{ms} \pm 404 \mathrm{μs}\left({\color{gray}0.056 \mathrm{\%}}\right) $$ Flame Graph
link_by_source_by_property depths: DT=0, PT=2, ET=2, E=2 $$92.8 \mathrm{ms} \pm 469 \mathrm{μs}\left({\color{gray}-2.209 \mathrm{\%}}\right) $$ Flame Graph

representative_read_entity_type

Function Value Mean Flame graphs
get_entity_type_by_id Account ID: d4e16033-c281-4cde-aa35-9085bf2e7579 $$1.39 \mathrm{ms} \pm 4.61 \mathrm{μs}\left({\color{gray}-0.700 \mathrm{\%}}\right) $$ Flame Graph

scaling_read_entity_complete_one_depth

Function Value Mean Flame graphs
entity_by_id 50 entities $$259 \mathrm{ms} \pm 1.54 \mathrm{ms}\left({\color{gray}-0.026 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 5 entities $$25.8 \mathrm{ms} \pm 245 \mathrm{μs}\left({\color{gray}2.54 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 1 entities $$20.7 \mathrm{ms} \pm 117 \mathrm{μs}\left({\color{gray}3.17 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 10 entities $$52.0 \mathrm{ms} \pm 294 \mathrm{μs}\left({\color{gray}0.918 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 25 entities $$88.2 \mathrm{ms} \pm 637 \mathrm{μs}\left({\color{red}25.2 \mathrm{\%}}\right) $$ Flame Graph

scaling_read_entity_linkless

Function Value Mean Flame graphs
entity_by_id 1 entities $$1.87 \mathrm{ms} \pm 10.5 \mathrm{μs}\left({\color{gray}0.321 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 100 entities $$2.06 \mathrm{ms} \pm 8.91 \mathrm{μs}\left({\color{gray}0.082 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 10 entities $$1.92 \mathrm{ms} \pm 10.2 \mathrm{μs}\left({\color{gray}0.297 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 1000 entities $$2.87 \mathrm{ms} \pm 17.9 \mathrm{μs}\left({\color{gray}-2.695 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 10000 entities $$13.5 \mathrm{ms} \pm 77.4 \mathrm{μs}\left({\color{gray}-1.793 \mathrm{\%}}\right) $$ Flame Graph

scaling_read_entity_complete_zero_depth

Function Value Mean Flame graphs
entity_by_id 50 entities $$4.26 \mathrm{ms} \pm 47.5 \mathrm{μs}\left({\color{gray}0.381 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 5 entities $$1.94 \mathrm{ms} \pm 8.29 \mathrm{μs}\left({\color{gray}1.55 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 1 entities $$1.90 \mathrm{ms} \pm 14.2 \mathrm{μs}\left({\color{gray}0.667 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 10 entities $$2.10 \mathrm{ms} \pm 16.9 \mathrm{μs}\left({\color{gray}-0.342 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 25 entities $$2.78 \mathrm{ms} \pm 16.8 \mathrm{μs}\left({\color{lightgreen}-16.980 \mathrm{\%}}\right) $$ Flame Graph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deps Relates to third-party dependencies (area) area/libs > chonky Affects the `chonky` crate (library) area/libs Relates to first-party libraries/crates/packages (area) area/tests New or updated tests
Development

Successfully merging this pull request may close these issues.

2 participants