Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Reconsider the way we organize fleets #385

Closed
marvinmarnold opened this issue Mar 2, 2022 · 11 comments
Closed

Spike: Reconsider the way we organize fleets #385

marvinmarnold opened this issue Mar 2, 2022 · 11 comments
Assignees

Comments

@marvinmarnold
Copy link
Contributor

marvinmarnold commented Mar 2, 2022

Balena fleets are organized helium-HARDWARE-FREQ. This increases operationally complexity by making build processes more elaborate and more fleets to browse when trying to locate a miner. As we start adding support for other manufacturers, the number of fleets risks ballooning if we keep this pattern.

At the same time, there are good reason for keeping things the way they are:

  • Unable to mount devices that don't exist (eg /dev/i2c-1): This should be tested, Balena may have fixed this.
  • Balena does not perform well with huge fleets: Can take a long time for pages to load, releases more likely to have issues.
  • Reduce blast radius: Easily able to feature flag using fleet environment variables, and reduced impact even if an entire fleet gets accidentally deleted.
  • Unable to detect concentrator region at application level: @pritamghanghas recently suggested using geolocation to determine this. That would work in most cases, but may still be problematic in manufacturing/RMA context.
  • Lofi reporting: Easily able to see from the Balena dashboard fleet size and high level health. This is a weak benefit because we should have a more robust analytics framework for getting the same information.

Related to:

Acceptance criteria

  • Make a proposal and get consensus among the team
  • Create follow up tickets for work to be done
@marvinmarnold
Copy link
Contributor Author

@pritamghanghas @MuratUrsavas I pulled this into the sprint FYI

@MuratUrsavas
Copy link
Contributor

Chosen 3 as estimation. This spike includes a lot of manual tests.

@MuratUrsavas
Copy link
Contributor

Test Nr. 1
Subject:
Trying to run firmware with foo devices inside of the device-compose.yml file.

Result:
Failed. Balena is still problematic about non existing devices. A foo device as /dev/i2c-5 made hm-miner and hm-config containers stall. Removing it immediately solved the problem.

@MuratUrsavas
Copy link
Contributor

Test Nr. 2
Subject:
Removing FREQ parameter from fleet and device environment variables

Result:
Passed. The device returned its duty without a problem after removing the env parameter and restarting all services. Although this was an onboarded device. The result would be different with a non-onboarded device, but we're not expecting it to work at that state anyway.

@MuratUrsavas
Copy link
Contributor

Test Nr. 3
Subject:
Removing VARIANT parameter from fleet and device environment variables.

Result:
Failed. Gateway-config and packet-forwarder containers depend on this variant definition. We have to keep the variant at least at device level.

@MuratUrsavas
Copy link
Contributor

MuratUrsavas commented Mar 10, 2022

There is something we need to fix firmware-wide. We're calling device type as "VARIANT" but in other places, we're actually using this term for differentiating indoor and outdoor types. We should name this all caps "VARIANT" as more descriptive name, like "BASE_TYPE" or "DEVICE_TYPE".

I'm not sure whether we need to separate indoor and outdoor at fleet level. To me, if the device is different, than it's a device type difference. But other than that, this information should be kept in manufacturing variant records, not in the fleets.

I don't have a way to test Balena with large fleets. But I'm pretty sure it could not handle it properly. The existing fleets are still a pain to load other than our test fleets.

From my point of view, taking the tests into consideration, we should have fleets like this:

  • Helium-Nebra
  • Helium-RockPi
  • Helium-Rak
  • Helium-Pisces

and so on. Due to non-existing device issue and VARIANT definition requirement, we need separate fleets for different device types. But we don't need any indoor/outdoor and frequency differentiation for them.

@MuratUrsavas
Copy link
Contributor

A small point. We need to remove frequency from the diagnostics report (root) or get this info from elsewhere.

@MuratUrsavas
Copy link
Contributor

So, what's the final verdict? Should we create a giant "helium" fleet or create device type specific fleets like I mentioned two comments above?

I'm still defending the latter. It wouldn't need too much effort, compared to existing style. Also having just one big fleet makes me nervous 😅

@MuratUrsavas
Copy link
Contributor

MuratUrsavas commented Mar 11, 2022

@marvinmarnold I've confirmed the answer of your question with a test about Balena having problems with non-existing devices.

It's due to privileged mode. If the container is in privileged mode, it can confirm itself that the device does not exist and move on. But if not, then it gets stuck because it can not approve it via accessing system resources.

We have three options:

  1. Elevate the privileges of hm-miner container (Confirmed this on one of my test devices)
  2. Open /dev folder for access and define C rules for wildcard (Not sure about this one, since Balena has diverged from original Docker)
  3. Keep the existing style

@marvinmarnold
Copy link
Contributor Author

@MuratUrsavas can this be closed now that you have created follow up work?

@MuratUrsavas
Copy link
Contributor

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants