-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shadowserver dynamic config #2372
Conversation
@elsif2 the license check is failing |
From the log:
License added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long waiting for the review. I have noted a few things that are important for me, mainly from the perspective of the IntelMQ user. I have marked with 🔴 things that are the most critical for me.
Generally, I like separating the schema from IntelMQ releases. I have unfortunately doubts in some decisions as I'd like to better control when and if the update is happening. I'd also recommend taking a look on how other bots handle such topic, e.g. ASN expert: https://github.com/certtools/intelmq/tree/bcccbf501ffa8b64c2ea011d457c3409ad1dd4cd/intelmq/bots/experts/asn_lookup
I have also identified a few points, where the code can break e.g. in our environment.
In addition, I'd recommend adding the regular update to the crontab installed with native packages: https://github.com/certtools/intelmq/tree/develop/contrib/cron-jobs
If you have any questions or need any help, you can hit me on the ShadowServer MM or email (but unfortunately I go on the vacation and be back at 21.08).
In addition, I'd like to ask if you can publish a policy on changes in the schema, the most important for me is in which cases you can change classification config (e.g. identifier, taxonomy) |
I haven't seen a discussion of this architectural change or and IEP yet. Or was it discussed on a non-public list? |
77ac732
to
3d6a87e
Compare
I started a Schema contract section to address concerns in the |
@sebix I had a good chat with elsif2 today and we said
So, TL;DR: don't worry :) we are adressing concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share a lot of concerns with @kamil-certat here, but my main concern is: is there a way to be as little disruptive as possible? Or let me re-formulate: let's think hard how we can achieve this.
Because most of the time intelmq runs unattended, automatically, at night.
People don't notice it in the logs if something dynamic (like a config update ) happened.
Nor will they debug it.
We agreed in a chat today that
- I would try out the change
- We iterate over it and improve it
- We document it
- We document a type of "contract" between users/-admins and the dynamic update service
- We need to have a opt-in/opt-out process for this.
I want to re-iterate that I think it is very very beneficial if intelmq users can easily get the latest and best shadowserver parsers. We just need to make sure it gets done right.
CHANGELOG.md
Outdated
|
||
#### Parsers | ||
- `intelmq.bots.parsers.shadowserver._config`: | ||
- Reset detected `feedname` at shutdown to re-detect the feedname on reloads (PR#2361 by @elsif2, fixes #2360). | ||
- Switch to dynamic configuration to decouple report schema changes from IntelMQ releases. Please see intelmq/bots/parsers/shadowserver/README.md for details. (PR#2372) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add a section in the official intelmq docs for the shadowserver feeds, since they are so crucial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README is not the right place to document the bot for the users.
user documentation is in docs/user/bots.rst
(https://intelmq.readthedocs.io/en/latest/user/bots.html#shadowserver) and intelmq/etc/feeds.yml
(https://intelmq.readthedocs.io/en/latest/user/feeds.html). For upgrade steps, use NEWS.md
.
Please move the information from the README there.
For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision. | ||
The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set. | ||
|
||
For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - what happens when people don't update in air-gapped environments? Is the parsing fail-safe then ? Like HTML ignores fields and element attributes it does not understand?
'classification.identifier': 'spam-url', | ||
}, | ||
} | ||
functions = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't check this huge diff, but I assume that's valid since automatically generated stuff gets removed and moved to the dynamic confgen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is unchanged from the existing release. Errors are logged for new report types. Fields added to existing reports are ignored.
} | ||
|
||
|
||
def reload(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- and have a mechanism to test the new config automatically. If it fails, the admin needs to know. And the developers need to know.
For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision. | ||
The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set. | ||
|
||
For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with sebix here: this would be a good initial step which helps it to work implicitly.
destination_queues: | ||
_default: [shadowserver-parser-queue] | ||
file_format: csv | ||
api_key: "$API_KEY_received_from_the_shadowserver_foundation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to also to give a specific query (as in the shadowserver API ) here? like asn:12345?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to also to give a specific query (as in the shadowserver API ) here? like asn:12345?
No. Some organizations do have separate report sets that can be handled independently using the reports option.
This is a good step forward!
I'm not worried personally, as I'm not directly affected, but many users are. I want to contribute my experience to get the change into operation smoothly. Needless to say, the proposed approach has its advantages. Here are some thoughts:
|
Let me try to address some topics mentioned.
nope, since you can always fetch the newest mappings manually and review. So, I don't see a problem here.
Again see above, it will also work in airgapped mode and pulling the latest mappings will be similar to doing an intelmq update manually.
This is a very relevant point. I think this needs to be in the contract that historic fields don't get remapped or changed or deleted. |
Thanks for changes, @elsif2 I see most of my technical concerns have been addressed :) About the general discussion: re: delegation of schema outside IMQ I see this as a trade-off problem, but I think ShadowServer has a very good reason to delegate it outside as their reports are developing much quicker than IntelMQ. In addition, I see a possibility for automated upgrades of this schema as a big facilitation for admins - even with a very quick adoption in IMQ packages, you still need to upgrade the whole IntelMQ instead of just the schema. Yes, it has some cons, but I think the pros are stronger in this case. re: air-gaped systems This is why I'd suggest keeping default, regularly updated (e.g. once a month) schema in the repo. Then your air-gaped system can still work after installation, just without additional configuration it would have a delay in adoption of schema changes. re: historical schema This is a good point. But I'd just suggest keeping the history of schemas, in to ways:
As I understand, the schema is versioned by time, so it should be easy to choose the right one if you need to perform a reparsing of old reports - assuming that ShadowServer would ensure that date of creation a schema means, that starting this or the next day the schema is used (please clarify it in the contract) |
|
||
__config.feedname_mapping.clear() | ||
__config.filename_mapping.clear() | ||
for schema_file in [__config.schema_file, __config.schema_base]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain a little why load schema_base
, that looks to be a test schema, every time? If the only usage is for testing, I'd suggest loading it just during tests by specifying bot's parameters.
If this has more usage, I'd suggest loading base first. Otherwise, the base can override the updated schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test_mode to switch between schema files.
For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision. | ||
The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set. | ||
|
||
For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which is better (starting schema vs. download when missing), but I can accept any of them.
The complete revision history of the schema files will be available at https://github.com/The-Shadowserver-Foundation/report_schema. |
d435ea3
to
516d534
Compare
516d534
to
61c756d
Compare
Thank you! |
…can_exchange report.
ec0369d
to
ac04471
Compare
Already reviewed some parts. Major parts are reviewed by Kamil and Aaron
looks like my requests have been met. |
I wanted to cancel the pending review for myself requested by Aaron, but it looks like I instead canceled one of Aaron's reviews 🤔 Anyway, I won't review this PR again, as I already reviewed some parts of it (core, docs). The major and more relevant parts have already been reviewed by Kamil and Aaron. I don't want to duplicate their efforts. I shared my thoughts on the change in general already above (#2372 (comment)) |
@sebix all good, I am still looking at it. It's a mega mega mega commit. Need to do a smaller one the next time to make the review process faster. Most of us have other duties / daytime jobs etc. |
This branch is currently in production usage on our instance, so far I do not see any issues. |
We would really like to see this feature merged in IntelMQ. We know the pain of manually merging the _config.py whenever new reports are added too well. The changes made to host(name) on the 18th of November showed the problem where we would require a different config from one day to another. The dynamic config would fix those kinds of issues as well. We do however see a problem. We run a custom config with a couple of changes compared to the upstream config:
We would not be surprised if other organizations have their own changes that they made to the config. |
Hey, we indeed have a couple of custom changes (e.g. filters by tags, own identifiers etc.), but we use a sieve bot for that - and as long as the |
Thanks for the pointer. We are now looking into improving our setup using the sieve bot. The only thing it cannot solve is the change of the fields for sandbox_url (you cannot add a new field with a value from another field). Maybe those fields should be changed in the upstream schema itself. However, that is not relevant for this PR. |
The change occurred between 3.0.2 and 3.1.0. I will correct this in the schema. 3.0.2
3.1.0
|
That is true. As a workaround, you could use the Jinja2 Template Expert or the Modify Expert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been possibly the longest running PR in intelmq.
But it's also the most complex one I have ever seen.
I think @elsif2 managed to stun us all.
Checking and testing this rigorously is pretty much out of the question.
But what I tested and was running on my setup, it looks good.
Thanks a ton for this important PR and sorry that it took months. Godspeed, shadowserver!
And all checks have passed :) merging in... |
- `intelmq.bots.collectors.shadowserver.collector_reports_api`: | ||
- The 'json' option is no longer supported as the 'csv' option provides better performance. | ||
|
||
#### Parsers | ||
- `intelmq.bots.parsers.shadowserver._config`: | ||
- Reset detected `feedname` at shutdown to re-detect the feedname on reloads (PR#2361 by @elsif2, fixes #2360). | ||
- `intelmq.bots.parsers.shadowserver._config`: | ||
- Switch to dynamic configuration to decouple report schema changes from IntelMQ releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changelog entry is in the wrong section (3.2.0) instead of 3.2.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be fixed with #2447
Dynamic configuration
Many thanks to Eric Halil for collaboration and testing on this update.
Parsers
intelmq.bots.parsers.shadowserver._config
:Collectors
intelmq.bots.collectors.shadowserver.collector_reports_api
: