Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment monitoring and epic progress dashboard #4999

Closed
synctext opened this issue Dec 9, 2019 · 29 comments
Closed

deployment monitoring and epic progress dashboard #4999

synctext opened this issue Dec 9, 2019 · 29 comments

Comments

@synctext
Copy link
Member

synctext commented Dec 9, 2019

To better organise ourselves we need more critical information in 1 place.

The coming time we aim to close #1 finally. Our progress towards this goal and how stable we are should be captured in a Tribler-at-a-Glance dashboard. Example from Jenkins:

image https://medium.com/kj187/jenkins-job-dashing-widget-cc72feeed654

image https://www.level-up.one/6-of-my-favorite-jenkins-plugins/

image https://www.datadoghq.com/blog/monitor-jenkins-datadog/

Tribler critical information candidates:

  • stability issues
    • crash reports from the wild (24h, last week, last month) latest devel, latest stable version and all versions
    • Application tester with random clicker number of faults (24h, last week, last month) latest devel, latest stable version
    • burn-in testing of running Tribler for 1 week: total CPU cycle, peak memory, total disk IO (crash or run-away resource usage)
    • issues pending
  • performance monitoring
    1. Anonymous end-to-end download performance (latest devel, latest stable version)
    2. Exit node based download speed
    3. Start of download delay, (non-)anonymous mode
    4. First time startup time
  • deployment monitoring
    • explorer Trustchain, growth of blocks in (last minute ! ; last hour ; last day; last week ; all time)
    • exit node status (CPU, connections, idle slots, memory?)
    • traffic stats
    • metadata status: keyword searches, channel gossip community
@devos50 devos50 added this to the Backlog milestone Dec 10, 2019
@devos50
Copy link
Contributor

devos50 commented Dec 10, 2019

Interesting visualisations! Somewhat related to #3508 (at least the TrustChain deployment monitoring).

e2e anonymous download is an excellent candidate for performance monitoring and should not take long to setup. I think @ichorid addressed this a while ago actually but it has not been actively monitored since then. In fact, making us (more) aware of failing tests/validation experiments is becoming a necessity as the number of different tests that run with fixed time intervals is growing.

I think we have to address this issue rather sooner than later. The problem is that if we do not do it, we will have a proliferation of different tools. Currently, we have the TrustChain explorer, Tribler user statistics, the error reporter and all tests/monitors on Jenkins. There might be some opportunity to merge some tools, which eases maintenance.

metadata status: keyword searches, channel gossip community

This might be a dangerous one to monitor and could be a violation of ones privacy expectations of Tribler.

@synctext
Copy link
Member Author

Please look at FileCoin slipped roadmap. After Release 7.5 I'm considering that we work together on the first Jenkins dashboard for 2 weeks:

  • arrange hardware monitors with obscene awesomeness, due to size (@synctext)
  • Anonymous end-to-end download performance (latest devel, latest stable version) (@egbertbouman)
  • crash reports from the wild (24h, last week, last month) latest devel, latest stable version and all versions (@ichorid)
  • Application tester with random clicker number of faults (24h, last week, last month) latest devel, latest stable version (@devos50)
  • IPv8 traffic stats with total of unique number of public keys in last (24h, last week, last month) within discovery community (heard about only, responsive) (@qstokkink)
  • explorer Trustchain, growth of blocks in (last minute ! ; last hour ; last day; last week ; all time) (@grimadas)
    image

@qstokkink
Copy link
Contributor

qstokkink commented Mar 31, 2020

Can we decide on some software/library to use (or to make) to graph all of this data? All sorts of dashboard creation tools exist.

For example: https://dzone.com/articles/build-beautiful-console-dashboards-with-sampler

@devos50
Copy link
Contributor

devos50 commented Mar 31, 2020

Most of this data can either be extracted from our existing Jenkins Job using the API, or from our running Trustchain explorer backend, also with API requests. One of the question we should also answer, is whether we want a dedicated website for this. Jenkins unfortunately does not provide the tools for such real-time data, and integration of this dashboard in Jenkins would just be a new job with succeed/fail status.

arrange hardware monitors with obscene awesomeness, due to size

We should secure a prominent spot at the coffee machine ☕️

@qstokkink
Copy link
Contributor

I propose starting with something "easy". Exposing GitHub events through tribler.org:

  • Add a webhook to GitHub for the Tribler repository (sending POST requests to the tribler.org domain).
  • Add a new page (tribler.org/githubevents?) which renders all GitHub events (possibly with websockets for live updates).

The idea is that we can reuse the resulting backend for another (bigger and better) dashboard and we'll have something to look at in the mean time.

@devos50
Copy link
Contributor

devos50 commented Jul 15, 2020

One way to get more insights into our user count is by analysing the crawled TrustChain data. The plot below is generated based on our current dataset, with over 80.000 users and 123 million records. The (major) releases of Tribler are annotated. Note how our 7.5.0 release resulted in an increase in new user count.

identities_per_day

Parsing this 97GB database, however, is computationally intensive and could be done on a daily basis for example. A dashboard could include this static image.

@synctext
Copy link
Member Author

In 2006-2009 we had initial deployment monitoring. Included in Zeilemaker master thesis.
image

@xoriole
Copy link
Contributor

xoriole commented Sep 4, 2020

Based on data we already have
Screenshot from 2020-09-04 10-27-45

@kozlovsky
Copy link
Contributor

Yesterday I did a little research on this topic, and now I want to suggest a way to show anonymized performance statistics. It may be the following set of technologies:

  1. Custom client-side code to prepare anonymized statistics
  2. Dedicated server with custom API as an entry point
  3. InfluxDB for storing anonymized data
  4. Grafana for displaying beautiful graphs

The most popular tool for gathering and processing metrics is Prometheus. It has has a big community and is widely used for gathering server metrics. Prometheus if often compared with InfluxDB (see the comparison on official Prometheus doc). While Prometheus is more popular, in my opinion, InfluxDB is better suited to our needs for the following reasons:

  1. Prometheus pulls metrics from the known number of server instances. In our case, we cannot pull statistics from client machines and want to push instead. While it is possible to use Prometheus with additional tools like Prometheus Aggregation Gateway, it in some way goes against Prometheus philosophy. On the other side, InfluxDB expects that the data are pushed, which is better suited to our needs.

  2. Prometheus data storage is ephemeral and not intended to be stored for a long time. InfluxDB data are persistent and can be used to compare changes in gathered statistics on long time intervals.

Grafana is a very popular open-source tool for graph visualization, which can be used with Prometheus, InfluxDB, and multiple other data sources. It allows constructing powerful dashboards with different types of graphs and charts.

image

If we decide to use this set of tools, I think I can take on this task. I see the following sub-tasks here to be implemented:

  1. a client-side code for preparing anonymized statistics
  2. a client-side code to send gathered statistics to our dedicated server
  3. a custom server API to collect anonymized statistics
  4. a server code which implements API as mentioned above, aggregate collected data and put it into InfluxDB instance
  5. deploy a dedicated server with statistics gathering API, deploy InfluxDB instance (probably on some different machine)
  6. deploy Grafana instance
  7. make Grafana dashboard

Later we can use Grafana to display all graphs, not only user statistics but also server builds, etc.

What do you think?

@synctext
Copy link
Member Author

synctext commented Sep 11, 2020

  • Health monitoring of client state
  • Health monitoring of our website, Github, statistics servers
  • Health monitoring of bootstrap servers and crawler servers

Pitfall: everything we want with our self-organising research project is easier to do in a central server... Primarily use our crawlers as early warning infrastructure! (IPv8 is designed for network health monitoring) Then we need to emphasise crawler intelligence and stats aggregation.

Are we not re-creating this from scratch? https://jenkins-ci.tribler.org/job/Test_BootstrapServers/lastSuccessfulBuild/artifact/walk_rtts.png

First, anonymity is our existential feature. How to do this? (True anonymity might be impossible, OFF switch by default)
We could show the user inside the debug panel the exact history and record which will be shared in private with our debug servers optionally? Can we protect against Internet address leakage? Many steps in future I guess to re-usage our Tor-like stuff while debugging our Tor-like stuff :-)

This needs to be opt-in for production releases and can hopefully be opt-out for nightly builds and Beta versions. What about Release Candidates?

InfluxDB: 34,082 commits, 19.5k of stars on Github. This is a general time-series database solution, we still need to make custom code for deployment monitoring?

This seems quite complex tooling. Afraid of overengineering for the user community we have currently. However, deployment monitoring is something we really need to do more and get right.

@xoriole
Copy link
Contributor

xoriole commented Sep 11, 2020

InfluxDB and Graphana are indeed good choices.

1.Custom client-side code to prepare anonymized statistics
2.Dedicated server with custom API as an entry point
3.InfluxDB for storing anonymized data
4.Grafana for displaying beautiful graphs

I have done some work on 1 and 2. I'm extending https://release.tribler.org/docs to receive anonymized data from the client. That can be the entry point to further processing using InfluxDB and visualizing on Graphana.

@kozlovsky
Copy link
Contributor

We probably can use InfluxDB Jenkins plugin to put deployment statistics into the InfluxDB:
https://wiki.jenkins.io/display/JENKINS//InfluxDB+Plugin

@synctext
Copy link
Member Author

synctext commented Sep 11, 2020

Change of plans:-)
By 25 September aim to have plots in Jenkins. The PopularityCommunity is crawled and health statistics are refreshed every few minutes or half an hour. After this test project we determine what we need and roadmap. Could be a fix of the PopularityCommunity code plus algorithm as next step, deploy, monitor, etc.

Our current methodology:

  • undocumented algorithm
  • exclusively rely on unit tests
  • end-to-end test manually if the desired feature works
  • no health monitoring of protocol deployment

Tribler is a bottomless pit of problems. (stolen quote)
Our work methodology should become relentlessly data-driven: there is direct evidence we need better crawling, no evidence of client monitoring beyond debug screen and crash reporting (might change; agile)

@devos50
Copy link
Contributor

devos50 commented Sep 16, 2020

exclusively rely on unit tests

I think a key metric is the stability of our unit tests. Currently, unstable unit tests (both on devel and our release branches) are delaying the development process. Converting the test suite to pytest, which should make the debugging process of errors in the tests easier, is much more work than I anticipated.

My suggestion would be to continuously run all unit tests on a dedicated machine and include in the upcoming dashboard how stable they are (e.g., % of runs failing during the last day).

@synctext
Copy link
Member Author

Related work: https://stats.goerli.net/
image

@synctext
Copy link
Member Author

Impressive progress! Our .yml and servers are getting in much better shape. We can even see in real time the upgrade speed. Learned something new: they upgrade quite fast. Previous years we never had this.
Screenshot from 2021-01-12 11-14-10

@synctext
Copy link
Member Author

Yeah! More pretty graphs, exit node peak: 121 GiB per second
image

@one-two-my-gad
Copy link

cool

@synctext
Copy link
Member Author

Example: https://data.syncthing.net/
File sync with central servers discovery and no spam measures. Great deployment monitoring!

@synctext
Copy link
Member Author

@kozlovsky Could you please duplicate this specific https://data.syncthing.net/ graphs and wrap up the Grafana work?
This is quite a useful and simple graph to have.
Users Joining and Leaving per Day === This is the total number of unique users joining and leaving per day. A user is counted as "joined" on first the day their unique ID is seen, and as "left" on the last day the unique ID was seen before a two weeks or longer absence. "Bounced" refers to users who joined and left on the same day.

@synctext
Copy link
Member Author

synctext commented Oct 5, 2021

To better organise ourselves we need more critical information in 1 place.

Mature network alerts and deployment monitoring. The mission is to put everything in one place. The big danger is to partially put everything together, but actually create the n+1 place called Grafana where data is fragmented. Full user experience pipeline:

  1. Keyword search performance for Tribler (locked somewhat into Google)
  2. Website visits to Tribler.org (Github hosting statistics export)
  3. Crawling of our .exe download stats from https://githubdownloads.com/?username=tribler&repository=tribler Tribler-7.10.0.dmg (60.20 MiB) - downloaded 3,241 times. Last updated on 2021-07-14
  4. How many active users are our network health crawlers chatting to?
  5. How many incoming introduction-requests are our network health crawlers getting? (27Sep 6AM event)
    image
  6. what are the various exit nodes self-reporting in total traffic?
  7. How many daily users are using our various initial bootstrap nodes?

A single page having graphs for the health of each step in our user journey would help to identify faults. We learned a lot from our recent "unknown user drop" incident. Like:
image
Took the team 5 days to figure out we had a suspicious memory dip at 06:00AM dailly.
image

@synctext
Copy link
Member Author

synctext commented Sep 6, 2022

When we have hired more developers we can re-visit this issue. We need to focus on putting everything inside application-tester and existing code. Example of IPFS people on DHT health.

@synctext
Copy link
Member Author

synctext commented Oct 20, 2022

IPFS people have nice uptime monitoring script (DHT only level):
image

Epic 2015 ticket with monitoring with Niels statistics. User community insight using an improved crawler
image

@synctext
Copy link
Member Author

synctext commented Oct 27, 2022

We take screenshots, takes a few clicks to find (application tester on Jenkins)
image
Plus smooth Github actions: https://github.com/Tribler/tribler/actions/runs/3330189428/jobs/5508351618
Screenshot from 2022-10-27 10-57-49

@synctext
Copy link
Member Author

synctext commented Jul 10, 2023

Complex monitoring. Numerous statistics systems, all connected together, and almost all down now 😿

The network was not functioning optimal these days. The Tor-like network was running out of capacity. Root cause of failure was a memory leak which went unnoticed. Grafana did not alert. No Slack alarm post. Testers did not alert. InfluxDB is not recording anymore. Prometheus-Grafana data feed is down. Dream of a single dashboard with health should have caught this. Another system brought live in a few hours:

This duplicates Jenkins monitoring: https://jenkins-ci.tribler.org/job/Test_BootstrapServers/lastSuccessfulBuild/artifact/summary.png
We lack a single vision and minimal maintenance platform for alerts. ToDo after big release.

@drew2a
Copy link
Contributor

drew2a commented Jul 10, 2023

This is yet another indication that choosing Grafana+Prometheus may not have been the best decision for our "new" dashboard. We already have ample sources of information, so adding another unique source doesn't seem optimal. What we really need is a singular place to integrate all existing information.

From my perspective, here's what we should do (with a rough time estimation):

  1. Select a tool capable of integrating data from all existing information sources (1w).
  2. Install the tool (1d).
  3. Ensure the selected tool can analyze the entirety of this information and present it using a simple traffic-light-style indicator: 🟢 🟡 🔴 (3d).
  4. Consolidate all these sources into a single dashboard (1m-2m).
  5. Make this tool easily accessible for all developers:
    1. Provide easy access to a web page (1d).
    2. Dispatch daily summary notifications (1d).
    3. Send out alerts concerning critical incidents (1d).

Our information sources:

  1. Jenkins (Experiments, Application Tester, Release Builds)
  2. Download statistics: https://release.tribler.org/dashboard/
  3. Grafana: https://dashboard.tribler.org
  4. Infrastructure monitoring tools
  5. Sentry
  6. Metabase (crawlers)

(did I miss something?)

@kozlovsky
Copy link
Contributor

I think that of all the services we use for dashboards and monitoring (Prometheus, InfluxDB, Grafana), Prometheus is the most reliable (and can display monitoring graphs without Grafana), while the most problematic was InfluxDB; most dashboard outages were caused by it.

It may be worth spending time to set up Prometheus alerts, as it should cover most of the current problems.

For persistent time series data, the most convenient data storage may be TimescaleDB, which can replace InfluxDB and fix most problems.

But trying something simpler like Graphite is also possible.

@xoriole
Copy link
Contributor

xoriole commented Jan 25, 2024

Since Grafana is currently used for deployment monitoring and as far as I understand there is no immediate priority to work on an alternative, I'm unassigning myself from this ticket.

@xoriole xoriole removed their assignment Jan 25, 2024
@qstokkink qstokkink removed this from the Backlog milestone Aug 23, 2024
@qstokkink
Copy link
Contributor

Indeed, we have a solution in place. This issue is - at the very least for now - resolved. If we have specific alternatives that we want to explore in the future, another issue can be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants