-
Notifications
You must be signed in to change notification settings - Fork 399
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add blog around operator history and scope
- Loading branch information
Showing
1 changed file
with
223 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
--- | ||
author: "Hubert Stefański" | ||
date: 2023-12-05 | ||
title: "Grafana-Operator - A small subproject that made it big" | ||
linkTitle: "Grafana-Operator - A small subproject that made it big" | ||
description: "A History of the grafana operator and its growth, and why it's so 'quiet' " | ||
--- | ||
|
||
# Grafana-Operator - A small subproject that made it big | ||
|
||
This blog post will describe the journey that the grafana-operator underwent, from a small subproject from within a Red | ||
Hat product offering, to one silently being used by some of the biggest companies worldwide, as well as where it's going | ||
to next! | ||
|
||
## The Origin | ||
|
||
**Disclaimer: Most of the information written in this section is from before my time being involved with the project, | ||
written to the best of my knowledge.** | ||
|
||
The Grafana-Operator was initially created as part of the monitoring stack used in Red Hat Managed Integration (RHMI), | ||
or "Integr8ly" for its open-source name. Peter Braun created it back in 2019, giving it its start in open source. | ||
|
||
Unsurprisingly, the feature development for the operator was driven mostly by the requirements derived from Integr8ly, | ||
that is, simple management of dashboards and datasources (ironically saying "simple" is oversimplifying the work that | ||
went into it). And as time would show, these bare few requirements were something that a multitude of other teams and | ||
companies also had. This serves as a prime example of how a relatively small overhead in operator development can yield | ||
massive benefits for its users, granted, we didn't know how big the eventual user base would be, at the time. | ||
|
||
## Growth - How and Why? | ||
|
||
Slow and steady, is the most suitable description of the growth of the operator over the past three years. Granted, this | ||
assertion comes mostly from what we see in terms of stars/visitors and contributions to our git repository, that being | ||
pretty linear over time. However, judging the true size of an open source project based on these criteria isn't entirely | ||
the most accurate. Let me explain how and why that is, and why the project is in fact much bigger than it would appear | ||
by just browsing our git repository. | ||
|
||
### How did the operator grow? | ||
|
||
I don't think there's a special recipe in how we grew the operator, both in terms of features and community. | ||
I guess as long as you just make stuff that works, try not to break anything from version to version (Easier said than | ||
done, right?), and respond to user questions and support, then that's all that it takes? | ||
Our development story really isn't that complicated, most of it was prompted by user questions, feedback (complaints as | ||
well) and generally that's what we did. | ||
|
||
One of the key milestones was definitely V5, where we decided to completely re-write the operator using a new approach. | ||
Which we've done in hopes of improving the developer experience, as previous versions suffered greatly from code creep, | ||
pretty much every single controller was written in a different style, with different ways of handling the same cases, | ||
but for different resources. And that was something which really blocked potential contributors from, well, | ||
contributing. Even the core maintainer group would have to constantly refresh their memory on how and why something | ||
worked the way it did. | ||
Realising this, was definitely a first step in the right direction, at least setting the foundations for the eventual | ||
"upstreamification". | ||
|
||
## Why is the Grafana Operator so widely used, yet relatively small on GitHub? | ||
|
||
As I've said in the introduction of this blogpost, the operator, in my opinion, is a "quiet giant". That could be due to | ||
a number of reasons: | ||
The following list is just observations, not complaints ;) | ||
|
||
### 1. The git repository itself is largely meaningless when it comes to a vast majority of the users. | ||
|
||
This is not to say that grafana-operator users don't care about the operator, but rather, they don't need to care, and | ||
ironically, I see this as a pretty positive sign that the project is in good shape. This is mainly because most of the | ||
users opt to install the operator through <insert the most popular installation method in whatever year you're reading | ||
this>. Joking aside, users value ease of use, so it's no indictment against anyone for just wanting stuff to work, be it | ||
they install it through OLM/Operator Hub/Helm/whatever else. Frankly, this is the way the vast majority of users will | ||
interact with the operator, rather than installing from source through our git repository, there simply is no reason for | ||
them to do so. | ||
|
||
The convenience of OLM/OperatorHub and Helm means that that's where a potential user will first come across an operator. | ||
A fitting comparison would be that you'll probably go to buy milk in the nearest shop, rather than find the farm from | ||
which it originates, does that mean you like your milk more or less? You probably don't care! | ||
|
||
For one, this makes it easier than ever to just get an operator out there, and have it gather users. | ||
|
||
### 2. Variety of support channels | ||
|
||
Generally, we try to point people to existing issues (which we try to gradually work through and close, time permitting) | ||
when they inevitably arise. However, synchronous human interaction seems to be the way most people prefer to resolve | ||
their issues nowadays, and that's also valid. Most of our issues tend to be reported through our k8s.io Slack based | ||
channel, and close to 90% of those just tend to be general configuration and PEBKAC errors (yes, we can always do better | ||
on the docs side! So really, it's our fault in the end). | ||
|
||
The tendency to use Slack as the primary support channel definitely means less engagement on the repository itself, | ||
however, yet again we can't blame a user for doing what's easiest for them! But it does mean that we have roughly twice | ||
as many Slack channel members, than we do stars on the project. | ||
|
||
There's another aspect to this, that is largely a virtue of how RH does business, Peter and I, often get contacted | ||
through our corporate email from Red Hat Technical Account Managers and consultants, with regard to support queries from | ||
their respective customers, and we also answer a great deal of cases on that end. The customers of whom are often high | ||
profile. I'll expand on this later on in this blog post. | ||
|
||
### 3. If it works, you just don't hear about it as often (as a developer) | ||
|
||
Generally, people don't come to open source repositories to simply praise the project, although, there sometimes are | ||
those kind few souls that do! By that nature, it's more likely that we'll have a user find us to ask for support/report | ||
a bug or something else of that nature, rather than to simply leave a star on GitHub. | ||
|
||
### 4. We don't really advertise the operator as much as some others might | ||
|
||
The good old adage about a great project being only as good as how you can sell it holds true. The core maintainer group | ||
isn't really that social, We've done a few presentations here and there, written a few blogposts, but it's unlikely | ||
you'll find any of us, posting about the operator every day on LinkedIn etc. (Edvin generally posts on major | ||
milestones!) | ||
This is both likely a characteristic of our nature as software engineers, and the fact that this really is a side-gig | ||
for most of the maintainer group. | ||
|
||
Most of the maintainers on the project aren't really involved with Kubernetes/OpenShift on a daily basis anymore. For | ||
example, it's been close to 3 years now, that working with OpenShift was one of my main responsibilities, while now, | ||
outside the Grafana-Operator, this type of work shows up once every few weeks, at best. | ||
|
||
### So, why is the Grafana-Operator a "quiet giant"? | ||
|
||
This is mainly down to the fact that rarely any Grafana-Operator users will visit our git repository, as most will | ||
install through OLM/OperatorHub, so in reality our best guess is down to how many container image downloads happen on a | ||
daily/weekly/whatever basis (if you know how we could get these metrics from OLM, or others, please let us know!). | ||
|
||
From our available metrics, V5 has been downloaded just over **3.1 million** (as of November 27th) times since **Jun 9th | ||
2023** (the first available V5 branch release) giving an average of **18k daily downloads**! Which isn't an | ||
insignificant | ||
number! | ||
|
||
As mentioned previously, we get quite a few private emails/messages, from users whom don't exactly want to advertise | ||
that they're using a specific project. Some of who simply don't advertise regardless. | ||
Within this "group" of users, we've got to know sizeable banks (National and International ones), Automotive | ||
manufacturers, Stock Exchanges, cloud providers etc., the list goes on and on. | ||
I won't divulge the specifics of these users, but our list of known, or assumed users (based on who forked and starred | ||
our repository) is pretty large and varied. | ||
|
||
Granted, a lot of the success is due to the popularity of Grafana itself, but the added benefit of the " | ||
operationalization" of the management of Grafana instances is something that has significant value for users. We're well | ||
aware of the fact that many of our users have enterprise contracts with Grafana, and we've also adapted the operator to | ||
be able to manage resources on "external" Grafanas. Which now means, you can still have an enterprise contract with | ||
Grafana, but allow your monitoring team to have a GitOps-based approach to defining your monitoring stack, without | ||
having to manage the Grafana instance yourself (be it through the operator, or through a managed Grafana as a service). | ||
|
||
If I were to express my personal thought on why users find the operator valuable, it would probably be exactly because | ||
the operator just makes it easier to manage Grafana at scale, in a workflow that is familiar to many DevOps/platform | ||
engineers. | ||
As with all things, the management of software is a balance of compromises, in order to reap the benefit of a piece of | ||
software, you have to accept some of the overhead that comes with it. And at the end of the day a user is most likely to | ||
go with a solution that works well for their use case, and doesn't introduce a complex layer unnecessarily. I firmly | ||
believe that the Grafana-Operator classifies itself as one of these projects, we add a small overhead (you have to use | ||
our Custom Resource Definitions as a wrapper for your Grafana resources), but we make up for it in the overhead we | ||
remove from the operations side. | ||
|
||
The paragraph above massively simplifies the entire debate and decision process that goes into selecting a bit of | ||
software, as it rarely is as clear-cut as "just choosing the easiest solution". However, I would still stand by the | ||
sentiment, that the balance of compromises falls in favour of using the Grafana-Operator over self-managed instances. | ||
|
||
## Moving upstream | ||
|
||
It really isn't a secret that the operator is in a bit of a stagnation right now, which the maintainers are well aware | ||
of. Our day-to-day jobs take a significant amount of time and energy, and most of what we are able to dedicate to the | ||
operator is a 30-minute call once a week, and answering questions/issues on our GitHub on a best effort basis. However, | ||
it is important to note that the move upstream is not a result of us wanting to offload the maintenance of the operator | ||
to someone else. All maintainers currently active in the project are planning on staying around and continuing our | ||
involvement. | ||
|
||
Internally, within our small maintainer group, we had a few conversations around what could be the next major step for | ||
the operator. We did bounce the idea of reaching out and maybe having the project be adopted by the CNCF or Grafana. | ||
However, we never really made it a hard goal or made any steps towards either one of those options. | ||
Luckily, upstream Grafana folks reached out to us first, (which was a very welcome surprise to us). Seeing the | ||
initiative from upstream in adopting the operator and migrating it into the official Grafana Labs GitHub organization | ||
was a great motivational booster. For numerous reasons (which I'll clarify below) this seemed like the most logical step | ||
for the operator, so we fully embraced the idea. | ||
|
||
### Why move? | ||
|
||
I know, this question is self-explanatory, almost all open source projects would look forward to being somehow | ||
incorporated or acknowledged to a somewhat "official" extent upstream. | ||
Although, I think it's important to highlight the reasons why it is a logical choice. | ||
|
||
As a general rule I believe all the points I'm about to make below are based on wanting the best for the operator and | ||
its users. We believe that for the operator to continue to grow, be successful and continue to deliver value to users, | ||
it must improve in community engagement (even though we have gone a long way, there's still a ways to go!) | ||
|
||
Being a downstream project does have its downsides, as a software engineer I often share the same sentiments that I know | ||
others have. That is, there's always an element of "carefulness" when proposing or integrating a new project which is | ||
relatively unknown or hosted in a small repository. We know that feeling, it's not based on "distrust" per se, but it's | ||
just being hesitant about a project that doesn't really have "apparent validity'. | ||
Moving into an official Grafana GitHub organization would give the operator this "validity". Deep down, all this really | ||
means is that users might be more convinced to use it based on the sheer fact that it's hosted in a known, public and | ||
popular repository. | ||
|
||
An extension to the point above, is that, it's hard to grow a community, when the community doesn't really know how to | ||
find you, meaning that there's a slope of sorts, which without "backing" from an official organization is hard to | ||
overcome. | ||
|
||
Community growth is what keeps the operator alive, and we don't see a better way of facilitating that growth without | ||
associating ourselves with an upstream. This comes with a range of possible benefits (whether those will come to | ||
fruition, time will tell). By becoming part of the Grafana organization we can start to offer more to existing users, by | ||
now being more involved and closer to the actual product on which we operate. | ||
As the community grows, so will the feature set, as demand and the number of possible contributors creating these | ||
features increases. This will also bring clearer goals into the operators' development path, the community will make its | ||
expectations known, and the eventual development can be led by those expectations, rather than anticipation of user | ||
needs. Which is mainly how development has been happening up until this point. "Mainly", but not "solely". | ||
|
||
The move to upstream is prompted by both acknowledgement of accomplishment of what the operator has done up until this | ||
point, after all it must have a real benefit if Grafana wants to "adopt" it. As well as the acknowledgement of current | ||
stagnation, where the current maintainers cannot devote time and effort consistently to meet user needs, despite best | ||
efforts and intentions. | ||
|
||
All-in-all, the decision is driven purely by a desire to grow and improve, for the sake of the community and our users. | ||
|
||
## Personal Reflection | ||
|
||
My engagement with the operator began by trying to fix a bug in for my previous team and project, and by pure chance I | ||
happened to spot a few areas that could be improved, so I began contributing those fixes. Back then, I couldn't have | ||
imagined where this would land the operator, and how I'd at least put a few bricks to that. | ||
The work albeit sometimes feeling unimportant, or insignificant in terms of the size of contribution, definitely has its | ||
rewarding side, when a company/user reaches out and says "hey, thanks, we find it useful", or "thanks for fixing that". | ||
I've had the chance to interact with a wide variety of people, from within and outside of Red Hat, everyday DevOps | ||
engineers, tech leads, architects, consultants, TAMs , and CTOs of sizeable companies, all of whom at some point needed | ||
help with some aspect of the operator. | ||
|
||
I look forward to continuing this journey, and I am super thankful for all the other maintainers that make it all happen | ||
day-to-day: | ||
**Peter**, **Edvin** and **Igor**, and hopefully many more to come! | ||
|
||
Extra reading: https://grafana-operator.github.io/grafana-operator/blog/2023/11/21/moving-upstream/ <- For more context | ||
|
||
Feel free to reach out if you've got any questions or feedback! | ||