Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add blog around operator history and scope #1328

Merged
merged 1 commit into from
Dec 6, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions docs/blog/history-and-operator-reach.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
---
author: "Hubert Stefański"
date: 2023-12-05
title: "Grafana-Operator - A small subproject that made it big"
linkTitle: "Grafana-Operator - A small subproject that made it big"
description: "A History of the grafana operator and its growth, and why it's so 'quiet' "
---

# Grafana-Operator - A small subproject that made it big

This blog post will describe the journey that the grafana-operator underwent, from a small subproject from within a Red
Hat product offering, to one silently being used by some of the biggest companies worldwide, as well as where it's going
to next!

## The Origin

**Disclaimer: Most of the information written in this section is from before my time being involved with the project,
written to the best of my knowledge.**

The Grafana-Operator was initially created as part of the monitoring stack used in Red Hat Managed Integration (RHMI),
or "Integr8ly" for its open-source name. Peter Braun created it back in 2019, giving it its start in open source.

Unsurprisingly, the feature development for the operator was driven mostly by the requirements derived from Integr8ly,
that is, simple management of dashboards and datasources (ironically saying "simple" is oversimplifying the work that
went into it). And as time would show, these bare few requirements were something that a multitude of other teams and
companies also had. This serves as a prime example of how a relatively small overhead in operator development can yield
massive benefits for its users, granted, we didn't know how big the eventual user base would be, at the time.

## Growth - How and Why?

Slow and steady, is the most suitable description of the growth of the operator over the past three years. Granted, this
assertion comes mostly from what we see in terms of stars/visitors and contributions to our git repository, that being
pretty linear over time. However, judging the true size of an open source project based on these criteria isn't entirely
the most accurate. Let me explain how and why that is, and why the project is in fact much bigger than it would appear
by just browsing our git repository.

### How did the operator grow?

I don't think there's a special recipe in how we grew the operator, both in terms of features and community.
I guess as long as you just make stuff that works, try not to break anything from version to version (Easier said than
done, right?), and respond to user questions and support, then that's all that it takes?
Our development story really isn't that complicated, most of it was prompted by user questions, feedback (complaints as
well) and generally that's what we did.

One of the key milestones was definitely V5, where we decided to completely re-write the operator using a new approach.
Which we've done in hopes of improving the developer experience, as previous versions suffered greatly from code creep,
pretty much every single controller was written in a different style, with different ways of handling the same cases,
but for different resources. And that was something which really blocked potential contributors from, well,
contributing. Even the core maintainer group would have to constantly refresh their memory on how and why something
worked the way it did.
Realising this, was definitely a first step in the right direction, at least setting the foundations for the eventual
"upstreamification".

## Why is the Grafana Operator so widely used, yet relatively small on GitHub?

As I've said in the introduction of this blogpost, the operator, in my opinion, is a "quiet giant". That could be due to
a number of reasons:
The following list is just observations, not complaints ;)

### 1. The git repository itself is largely meaningless when it comes to a vast majority of the users.

This is not to say that grafana-operator users don't care about the operator, but rather, they don't need to care, and
ironically, I see this as a pretty positive sign that the project is in good shape. This is mainly because most of the
users opt to install the operator through <insert the most popular installation method in whatever year you're reading
this>. Joking aside, users value ease of use, so it's no indictment against anyone for just wanting stuff to work, be it
they install it through OLM/Operator Hub/Helm/whatever else. Frankly, this is the way the vast majority of users will
interact with the operator, rather than installing from source through our git repository, there simply is no reason for
them to do so.

The convenience of OLM/OperatorHub and Helm means that that's where a potential user will first come across an operator.
A fitting comparison would be that you'll probably go to buy milk in the nearest shop, rather than find the farm from
which it originates, does that mean you like your milk more or less? You probably don't care!

For one, this makes it easier than ever to just get an operator out there, and have it gather users.

### 2. Variety of support channels

Generally, we try to point people to existing issues (which we try to gradually work through and close, time permitting)
when they inevitably arise. However, synchronous human interaction seems to be the way most people prefer to resolve
their issues nowadays, and that's also valid. Most of our issues tend to be reported through our k8s.io Slack based
channel, and close to 90% of those just tend to be general configuration and PEBKAC errors (yes, we can always do better
on the docs side! So really, it's our fault in the end).

The tendency to use Slack as the primary support channel definitely means less engagement on the repository itself,
however, yet again we can't blame a user for doing what's easiest for them! But it does mean that we have roughly twice
as many Slack channel members, than we do stars on the project.

There's another aspect to this, that is largely a virtue of how RH does business, Peter and I, often get contacted
through our corporate email from Red Hat Technical Account Managers and consultants, with regard to support queries from
their respective customers, and we also answer a great deal of cases on that end. The customers of whom are often high
profile. I'll expand on this later on in this blog post.

### 3. If it works, you just don't hear about it as often (as a developer)

Generally, people don't come to open source repositories to simply praise the project, although, there sometimes are
those kind few souls that do! By that nature, it's more likely that we'll have a user find us to ask for support/report
a bug or something else of that nature, rather than to simply leave a star on GitHub.

### 4. We don't really advertise the operator as much as some others might

The good old adage about a great project being only as good as how you can sell it holds true. The core maintainer group
isn't really that social, We've done a few presentations here and there, written a few blogposts, but it's unlikely
you'll find any of us, posting about the operator every day on LinkedIn etc. (Edvin generally posts on major
milestones!)
This is both likely a characteristic of our nature as software engineers, and the fact that this really is a side-gig
for most of the maintainer group.

Most of the maintainers on the project aren't really involved with Kubernetes/OpenShift on a daily basis anymore. For
example, it's been close to 3 years now, that working with OpenShift was one of my main responsibilities, while now,
outside the Grafana-Operator, this type of work shows up once every few weeks, at best.

### So, why is the Grafana-Operator a "quiet giant"?

This is mainly down to the fact that rarely any Grafana-Operator users will visit our git repository, as most will
install through OLM/OperatorHub, so in reality our best guess is down to how many container image downloads happen on a
daily/weekly/whatever basis (if you know how we could get these metrics from OLM, or others, please let us know!).

From our available metrics, V5 has been downloaded just over **3.1 million** (as of November 27th) times since **Jun 9th
2023** (the first available V5 branch release) giving an average of **18k daily downloads**! Which isn't an
insignificant
number!

As mentioned previously, we get quite a few private emails/messages, from users whom don't exactly want to advertise
that they're using a specific project. Some of who simply don't advertise regardless.
Within this "group" of users, we've got to know sizeable banks (National and International ones), Automotive
manufacturers, Stock Exchanges, cloud providers etc., the list goes on and on.
I won't divulge the specifics of these users, but our list of known, or assumed users (based on who forked and starred
our repository) is pretty large and varied.

Granted, a lot of the success is due to the popularity of Grafana itself, but the added benefit of the "
operationalization" of the management of Grafana instances is something that has significant value for users. We're well
aware of the fact that many of our users have enterprise contracts with Grafana, and we've also adapted the operator to
be able to manage resources on "external" Grafanas. Which now means, you can still have an enterprise contract with
Grafana, but allow your monitoring team to have a GitOps-based approach to defining your monitoring stack, without
having to manage the Grafana instance yourself (be it through the operator, or through a managed Grafana as a service).

If I were to express my personal thought on why users find the operator valuable, it would probably be exactly because
the operator just makes it easier to manage Grafana at scale, in a workflow that is familiar to many DevOps/platform
engineers.
As with all things, the management of software is a balance of compromises, in order to reap the benefit of a piece of
software, you have to accept some of the overhead that comes with it. And at the end of the day a user is most likely to
go with a solution that works well for their use case, and doesn't introduce a complex layer unnecessarily. I firmly
believe that the Grafana-Operator classifies itself as one of these projects, we add a small overhead (you have to use
our Custom Resource Definitions as a wrapper for your Grafana resources), but we make up for it in the overhead we
remove from the operations side.

The paragraph above massively simplifies the entire debate and decision process that goes into selecting a bit of
software, as it rarely is as clear-cut as "just choosing the easiest solution". However, I would still stand by the
sentiment, that the balance of compromises falls in favour of using the Grafana-Operator over self-managed instances.

## Moving upstream

It really isn't a secret that the operator is in a bit of a stagnation right now, which the maintainers are well aware
of. Our day-to-day jobs take a significant amount of time and energy, and most of what we are able to dedicate to the
operator is a 30-minute call once a week, and answering questions/issues on our GitHub on a best effort basis. However,
it is important to note that the move upstream is not a result of us wanting to offload the maintenance of the operator
to someone else. All maintainers currently active in the project are planning on staying around and continuing our
involvement.

Internally, within our small maintainer group, we had a few conversations around what could be the next major step for
the operator. We did bounce the idea of reaching out and maybe having the project be adopted by the CNCF or Grafana.
However, we never really made it a hard goal or made any steps towards either one of those options.
Luckily, upstream Grafana folks reached out to us first, (which was a very welcome surprise to us). Seeing the
initiative from upstream in adopting the operator and migrating it into the official Grafana Labs GitHub organization
was a great motivational booster. For numerous reasons (which I'll clarify below) this seemed like the most logical step
for the operator, so we fully embraced the idea.

### Why move?

I know, this question is self-explanatory, almost all open source projects would look forward to being somehow
incorporated or acknowledged to a somewhat "official" extent upstream.
Although, I think it's important to highlight the reasons why it is a logical choice.

As a general rule I believe all the points I'm about to make below are based on wanting the best for the operator and
its users. We believe that for the operator to continue to grow, be successful and continue to deliver value to users,
it must improve in community engagement (even though we have gone a long way, there's still a ways to go!)

Being a downstream project does have its downsides, as a software engineer I often share the same sentiments that I know
others have. That is, there's always an element of "carefulness" when proposing or integrating a new project which is
relatively unknown or hosted in a small repository. We know that feeling, it's not based on "distrust" per se, but it's
just being hesitant about a project that doesn't really have "apparent validity'.
Moving into an official Grafana GitHub organization would give the operator this "validity". Deep down, all this really
means is that users might be more convinced to use it based on the sheer fact that it's hosted in a known, public and
popular repository.

An extension to the point above, is that, it's hard to grow a community, when the community doesn't really know how to
find you, meaning that there's a slope of sorts, which without "backing" from an official organization is hard to
overcome.

Community growth is what keeps the operator alive, and we don't see a better way of facilitating that growth without
associating ourselves with an upstream. This comes with a range of possible benefits (whether those will come to
fruition, time will tell). By becoming part of the Grafana organization we can start to offer more to existing users, by
now being more involved and closer to the actual product on which we operate.
As the community grows, so will the feature set, as demand and the number of possible contributors creating these
features increases. This will also bring clearer goals into the operators' development path, the community will make its
expectations known, and the eventual development can be led by those expectations, rather than anticipation of user
needs. Which is mainly how development has been happening up until this point. "Mainly", but not "solely".

The move to upstream is prompted by both acknowledgement of accomplishment of what the operator has done up until this
point, after all it must have a real benefit if Grafana wants to "adopt" it. As well as the acknowledgement of current
stagnation, where the current maintainers cannot devote time and effort consistently to meet user needs, despite best
efforts and intentions.

All-in-all, the decision is driven purely by a desire to grow and improve, for the sake of the community and our users.

## Personal Reflection

My engagement with the operator began by trying to fix a bug in for my previous team and project, and by pure chance I
happened to spot a few areas that could be improved, so I began contributing those fixes. Back then, I couldn't have
imagined where this would land the operator, and how I'd at least put a few bricks to that.
The work albeit sometimes feeling unimportant, or insignificant in terms of the size of contribution, definitely has its
rewarding side, when a company/user reaches out and says "hey, thanks, we find it useful", or "thanks for fixing that".
I've had the chance to interact with a wide variety of people, from within and outside of Red Hat, everyday DevOps
engineers, tech leads, architects, consultants, TAMs , and CTOs of sizeable companies, all of whom at some point needed
help with some aspect of the operator.

I look forward to continuing this journey, and I am super thankful for all the other maintainers that make it all happen
day-to-day:
**Peter**, **Edvin** and **Igor**, and hopefully many more to come!

Extra reading: https://grafana-operator.github.io/grafana-operator/blog/2023/11/21/moving-upstream/ <- For more context

Feel free to reach out if you've got any questions or feedback!