feat: new JobFile model to store UrsaDB iterator batch files #420

mickol34 · 2024-10-09T13:32:32Z

Your checklist for this pull request

I've read the contributing guideline.
I've tested my changes by building and running mquery, and testing changed functionality (if applicable)
I've added automated tests for my change (if applicable, optional)
I've updated documentation to reflect my change (if applicable)

What is the current behaviour?

Currently query to UrsaDB returns iterator and it's length which is then passed to batcher and YARA parser to unpack.

What is the new behaviour?
Now iterator is immediately popped after querying and several JobFile objects are created which store batches of files to execute YARA on.

Test plan

App should work the same way for various queries and rules. In case of other system failure during executing YARA there should remain JobFile instances (since those are deleted upon successful YARA execution).

Closing issues

fixes #381

src/lib/ursadb.py

msm-cert · 2024-10-09T13:56:40Z

src/models/queryresult.py

+
+class QueryResult(SQLModel, table=True):
+  job_id: str = Field(foreign_key="job.internal_id", primary_key=True)
+  files: List[str] = Field(sa_column=Column(ARRAY(String)))


Don't use SQL arrays for this, it should be a regular record (file: str)

So this means I should multiple objects for every file returned from .query()?

src/lib/ursadb.py

src/models/queryresult.py

src/tasks.py

msm-cert · 2024-10-16T16:11:38Z

BTW. this fails because QueryResult table does not exist in the DB. You need to create alembic migration for it (see also recent PR about enum rework). I didn't re-review the rest of the code yet.

mickol34 · 2024-10-17T13:55:50Z

Referring to #381 I'm not sure if ursadb.pop() is dead code, if it's used in all_indexed_files and all_indexed_names functions. If there are any other cleanups and/or tests to add, please comment below for me to fix them.

msm-cert · 2024-10-18T15:30:03Z

Ugh, this is tough with regards to RAM usage. Nothing wrong with this PR by itself, but I'll have to think what to suggest. Let's keep it open as a draft for now. Sorry!

By the way, in the current form storing the files in the database (QueryResult objects) is a bit pointless, since nobody ever uses them, right? 🙃

msm-cert

Ok, I think we can continue with this but with some large changes:

Restore query to the previous version (so return an iterator)
When querying, use that iterator to immediately add all the files from the result to the database (call pop in a loop and insert to the database, no need to handle was_locked state)
I think QueryResult should be something like JobFile (id: int, job_id: int, file: str), so workers don't have to get all the results at once (can just select with limit and offset)
Alternatively, insert multiple QueryResults into the database, each with a batch of files (this is easier to implement but is less flexible. But still OK probably)
Importantly, when adding tasks: agent.queue.enqueue(... don't add filenames there. That's because it's stored in redis (== ram) and can potentially be very large. Instead, store just offset or batch_id (depending on which option in the previous you picked)

The reasoning here is that we prefer to have files in the database instead of in the opaque and weird ursadb iterator, but we have to be careful to avoid keeping the entire set of files at once in memory. In most cases this set will be small, but we can't OOM if someone does a huge query (this was actually a problem in the past).

I realise this may sound a bit convoluted, feel free to contact me in case of any questions

msm-cert reviewed Oct 9, 2024

View reviewed changes

mickol34 added 3 commits October 17, 2024 12:14

Draft: fix: rewrite query_ursadb not to use iterators

06ed883

fix: moved logic to more suitable classes and files

e5fd1a2

style: e2e logs

173fa53

mickol34 force-pushed the fix/store-job-files-in-new-model-381 branch from a3470ba to 173fa53 Compare October 17, 2024 10:16

fix: added alembic migration

a7b16a0

mickol34 marked this pull request as ready for review October 17, 2024 13:55

mickol34 requested a review from msm-cert October 17, 2024 13:56

msm-cert changed the title ~~Draft: fix: rewrite query_ursadb not to use iterators~~ Draft: rewrite query_ursadb not to use iterators Oct 18, 2024

msm-cert requested changes Oct 31, 2024

View reviewed changes

fix: create batch files to pass IDs to save Redis RAM

e81a04a

mickol34 force-pushed the fix/store-job-files-in-new-model-381 branch from caed157 to e81a04a Compare November 6, 2024 15:49

mickol34 added 2 commits November 6, 2024 16:50

fix: lint v2

2311c88

fix: rebase migration down revision

8801269

mickol34 requested a review from msm-cert November 6, 2024 15:58

mickol34 changed the title ~~Draft: rewrite query_ursadb not to use iterators~~ feat: new JobFile model to store UrsaDB iterator batch files Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: new JobFile model to store UrsaDB iterator batch files #420

feat: new JobFile model to store UrsaDB iterator batch files #420

mickol34 commented Oct 9, 2024 •

edited

Loading

msm-cert Oct 9, 2024

mickol34 Oct 10, 2024

msm-cert commented Oct 16, 2024

mickol34 commented Oct 17, 2024

msm-cert commented Oct 18, 2024

msm-cert left a comment

feat: new JobFile model to store UrsaDB iterator batch files #420

Are you sure you want to change the base?

feat: new JobFile model to store UrsaDB iterator batch files #420

Conversation

mickol34 commented Oct 9, 2024 • edited Loading

msm-cert Oct 9, 2024

Choose a reason for hiding this comment

mickol34 Oct 10, 2024

Choose a reason for hiding this comment

msm-cert commented Oct 16, 2024

mickol34 commented Oct 17, 2024

msm-cert commented Oct 18, 2024

msm-cert left a comment

Choose a reason for hiding this comment

mickol34 commented Oct 9, 2024 •

edited

Loading