Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpalerter seems to have lost visibility ~ April 8 - RIS issues ? #535

Closed
mfld-pub opened this issue Apr 11, 2021 · 6 comments
Closed

bgpalerter seems to have lost visibility ~ April 8 - RIS issues ? #535

mfld-pub opened this issue Apr 11, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@mfld-pub
Copy link
Contributor

Took me a few days to open report this as I wanted to make sure it is not some local issue:

Since April 8, ~1200 UTC I am not seeing any monitored events being triggered. RIPE Service status indicates all is well, RIS Live should be functioning.

Running on RHEL 8 as a systemd service. About a year in prod. Worked fine after udpating to v1.27.1.

I checked that notifications work with the -t flag. They do. It spams Email and Telegram when I use the -t flag. I checked that the process has sufficient resources and permissions - all good. I checked bgpalerter's reports.log and it is indeed empty but I know I created plenty of mayhem "events" :P

I then tried creating a new prefixes.yml list and config.yml by stopping the service, renaming the existing ones, executing the binary manually once with bgpalerter-linux-x64 generate -a ASN-o prefixes.yml -i -m

This completed without errors and created sensible files. I restart service, withdraw a monitored prefix and tail -f reports.log. Nothing.

I hijack my prefix from a lab ASN. Nothing.

Final sanity check before reaching out: I spun up a new Ubuntu server, installed docker and created the bgpalerter docker container. Created config, started it, withdrew a prefix. This instance, too, does not "see" an event.

I do see in error.log of the original prod instance around the thing things stopped working:

2021-04-10T14:34:44+00:00 info: ris connector connected
2021-04-10T14:39:45+00:00 info: ris connector connected
2021-04-10T15:05:15+00:00 error: Error: Unexpected server response: 500
2021-04-10T15:05:15+00:00 error: It was not possible to establish a connection with RIPE RIS
2021-04-10T15:06:20+00:00 error: Error: Unexpected server response: 500
2021-04-10T15:06:20+00:00 error: It was not possible to establish a connection with RIPE RIS
2021-04-10T15:07:25+00:00 info: ris connector connected

But these 500 responses have occured from time to time in the past. Of note is that during my testing of prefix withdrawals the last entry in error.log indicated that we were connected at that time: info: ris connector connected

Did RIPE change anything in RIS Live that could be breaking me ?

Unrelated, I did have some 530 responses from Cloudflare when using cloudflare as my vrpProvider for rpki which broke RPKI detection but I had switched back to ntt since and it worked fine.

@mfld-pub mfld-pub added the bug Something isn't working label Apr 11, 2021
@massimocandela
Copy link
Member

Hi @mfld-pub,

I'm checking your ticket.

In the meanwhile, what the uptimeApi says? If any problem occurs in BGPalerter, a warning appears in the api. Was the process monitored at the time of the incident?

I then tried creating a new prefixes.yml list and config.yml by stopping the service, renaming the existing ones, executing the binary manually once with bgpalerter-linux-x64 generate -a ASN-o prefixes.yml -i -m

I'm doing the same with your AS, I'll let you know

I do see in error.log of the original prod instance around the thing things stopped working:

2021-04-10T14:34:44+00:00 info: ris connector connected
2021-04-10T14:39:45+00:00 info: ris connector connected
2021-04-10T15:05:15+00:00 error: Error: Unexpected server response: 500
2021-04-10T15:05:15+00:00 error: It was not possible to establish a connection with RIPE RIS
2021-04-10T15:06:20+00:00 error: Error: Unexpected server response: 500
2021-04-10T15:06:20+00:00 error: It was not possible to establish a connection with RIPE RIS
2021-04-10T15:07:25+00:00 info: ris connector connected

But these 500 responses have occured from time to time in the past. Of note is that during my testing of prefix withdrawals the last entry in error.log indicated that we were connected at that time: info: ris connector connected

This should not matter, connection errors can occur. Everything is ok as long as the process is able to re-establish the connection in the coming seconds.

Unrelated, I did have some 530 responses from Cloudflare when using cloudflare as my vrpProvider for rpki which broke RPKI detection but I had switched back to ntt since and it worked fine.

I'm not sure if you did it already, but I suggest you to use preCacheROAs: true. This precaches a VRP dump instead of using online queries.

@mfld-pub
Copy link
Contributor Author

It must have been a RIS issue of sorts ?! As of 0914 UTC today it seems to work again both my RHEL 8 prod and my Ubuntu docker container that I set up for this ticket.

@massimocandela
Copy link
Member

Yes. I was able to reproduce your issue and I contacted the main dev behind RIS and he did some digging.
Somebody was flooding the service with connections (now banned), as a result other new legit connections were slow to be served. You spotted this because you were one of those unlucky, we were already connected and we did not.

We are planning some improvements, including a missing/delayed messages monitoring in both BGPalerter and RIS. You will see a PR linked to this issue soon. In the meanwhile a new rule to limit the number of connections per user has been set in RIS (since one connection can have unlimited subscriptions to prefixes, there is no reason at all to open multiple connections...just a lack of reading-the-doc skills).

Thanks for reporting this!!

@mfld-pub
Copy link
Contributor Author

Thanks for cruising github issues on your Sunday <3

I think they also want us to send a user-agent with our connections in the ?client= parameter of the URI. Something like

wss://ris-live.ripe.net/v1/ws/?client=AS174-basement-lab

Would it make sense to make this a configurable option in BGPAlerter ?

@massimocandela
Copy link
Member

Thanks for cruising github issues on your Sunday <3

You are welcome :)

I think they also want us to send a user-agent with our connections in the ?client= parameter of the URI. Something like
Would it make sense to make this a configurable option in BGPAlerter ?

We agreed with the RIS staff on what user agent to send for all BGPalerter clients, this is already done by BGPalerter and should not be configured per instance. Each user has also an additional random connection id.

@massimocandela
Copy link
Member

As promised, in addition to the fix on the RIS side reported above, in the next release of BGPalerter there will be a check for silent socket sessions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants