-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: new JobFile model to store UrsaDB iterator batch files #420
base: master
Are you sure you want to change the base?
Conversation
src/models/queryresult.py
Outdated
|
||
class QueryResult(SQLModel, table=True): | ||
job_id: str = Field(foreign_key="job.internal_id", primary_key=True) | ||
files: List[str] = Field(sa_column=Column(ARRAY(String))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use SQL arrays for this, it should be a regular record (file: str
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means I should multiple objects for every file returned from .query()
?
BTW. this fails because QueryResult table does not exist in the DB. You need to create alembic migration for it (see also recent PR about enum rework). I didn't re-review the rest of the code yet. |
a3470ba
to
173fa53
Compare
Referring to #381 I'm not sure if |
Ugh, this is tough with regards to RAM usage. Nothing wrong with this PR by itself, but I'll have to think what to suggest. Let's keep it open as a draft for now. Sorry! By the way, in the current form storing the files in the database (QueryResult objects) is a bit pointless, since nobody ever uses them, right? 🙃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think we can continue with this but with some large changes:
- Restore query to the previous version (so return an iterator)
- When querying, use that iterator to immediately add all the files from the result to the database (call
pop
in a loop and insert to the database, no need to handlewas_locked
state) - I think QueryResult should be something like
JobFile (id: int, job_id: int, file: str)
, so workers don't have to get all the results at once (can just select with limit and offset) - Alternatively, insert multiple QueryResults into the database, each with a batch of files (this is easier to implement but is less flexible. But still OK probably)
- Importantly, when adding tasks:
agent.queue.enqueue(...
don't add filenames there. That's because it's stored in redis (== ram) and can potentially be very large. Instead, store just offset or batch_id (depending on which option in the previous you picked)
The reasoning here is that we prefer to have files in the database instead of in the opaque and weird ursadb iterator, but we have to be careful to avoid keeping the entire set of files at once in memory. In most cases this set will be small, but we can't OOM if someone does a huge query (this was actually a problem in the past).
I realise this may sound a bit convoluted, feel free to contact me in case of any questions
caed157
to
e81a04a
Compare
Your checklist for this pull request
What is the current behaviour?
Currently query to UrsaDB returns iterator and it's length which is then passed to batcher and YARA parser to unpack.
What is the new behaviour?
Now iterator is immediately popped after querying and several JobFile objects are created which store batches of files to execute YARA on.
Test plan
App should work the same way for various queries and rules. In case of other system failure during executing YARA there should remain JobFile instances (since those are deleted upon successful YARA execution).
Closing issues
fixes #381