Spike: Reconsider the way we organize fleets #385

marvinmarnold · 2022-03-02T15:50:10Z

Balena fleets are organized helium-HARDWARE-FREQ. This increases operationally complexity by making build processes more elaborate and more fleets to browse when trying to locate a miner. As we start adding support for other manufacturers, the number of fleets risks ballooning if we keep this pattern.

At the same time, there are good reason for keeping things the way they are:

Unable to mount devices that don't exist (eg /dev/i2c-1): This should be tested, Balena may have fixed this.
Balena does not perform well with huge fleets: Can take a long time for pages to load, releases more likely to have issues.
Reduce blast radius: Easily able to feature flag using fleet environment variables, and reduced impact even if an entire fleet gets accidentally deleted.
Unable to detect concentrator region at application level: @pritamghanghas recently suggested using geolocation to determine this. That would work in most cases, but may still be problematic in manufacturing/RMA context.
Lofi reporting: Easily able to see from the Balena dashboard fleet size and high level health. This is a weak benefit because we should have a more robust analytics framework for getting the same information.

Related to:

https://github.com/NebraLtd/helium-miner-software/pull/387/files#r820696341

Acceptance criteria

Make a proposal and get consensus among the team
Create follow up tickets for work to be done

The text was updated successfully, but these errors were encountered:

marvinmarnold · 2022-03-07T16:22:24Z

@pritamghanghas @MuratUrsavas I pulled this into the sprint FYI

MuratUrsavas · 2022-03-09T08:54:14Z

Chosen 3 as estimation. This spike includes a lot of manual tests.

MuratUrsavas · 2022-03-10T10:47:03Z

Test Nr. 1
Subject:
Trying to run firmware with foo devices inside of the device-compose.yml file.

Result:
Failed. Balena is still problematic about non existing devices. A foo device as /dev/i2c-5 made hm-miner and hm-config containers stall. Removing it immediately solved the problem.

MuratUrsavas · 2022-03-10T10:50:36Z

Test Nr. 2
Subject:
Removing FREQ parameter from fleet and device environment variables

Result:
Passed. The device returned its duty without a problem after removing the env parameter and restarting all services. Although this was an onboarded device. The result would be different with a non-onboarded device, but we're not expecting it to work at that state anyway.

MuratUrsavas · 2022-03-10T10:56:50Z

Test Nr. 3
Subject:
Removing VARIANT parameter from fleet and device environment variables.

Result:
Failed. Gateway-config and packet-forwarder containers depend on this variant definition. We have to keep the variant at least at device level.

MuratUrsavas · 2022-03-10T11:14:09Z

There is something we need to fix firmware-wide. We're calling device type as "VARIANT" but in other places, we're actually using this term for differentiating indoor and outdoor types. We should name this all caps "VARIANT" as more descriptive name, like "BASE_TYPE" or "DEVICE_TYPE".

I'm not sure whether we need to separate indoor and outdoor at fleet level. To me, if the device is different, than it's a device type difference. But other than that, this information should be kept in manufacturing variant records, not in the fleets.

I don't have a way to test Balena with large fleets. But I'm pretty sure it could not handle it properly. The existing fleets are still a pain to load other than our test fleets.

From my point of view, taking the tests into consideration, we should have fleets like this:

Helium-Nebra
Helium-RockPi
Helium-Rak
Helium-Pisces

and so on. Due to non-existing device issue and VARIANT definition requirement, we need separate fleets for different device types. But we don't need any indoor/outdoor and frequency differentiation for them.

MuratUrsavas · 2022-03-10T12:38:17Z

A small point. We need to remove frequency from the diagnostics report (root) or get this info from elsewhere.

MuratUrsavas · 2022-03-11T06:18:46Z

So, what's the final verdict? Should we create a giant "helium" fleet or create device type specific fleets like I mentioned two comments above?

I'm still defending the latter. It wouldn't need too much effort, compared to existing style. Also having just one big fleet makes me nervous 😅

MuratUrsavas · 2022-03-11T06:31:51Z

@marvinmarnold I've confirmed the answer of your question with a test about Balena having problems with non-existing devices.

It's due to privileged mode. If the container is in privileged mode, it can confirm itself that the device does not exist and move on. But if not, then it gets stuck because it can not approve it via accessing system resources.

We have three options:

Elevate the privileges of hm-miner container (Confirmed this on one of my test devices)
Open /dev folder for access and define C rules for wildcard (Not sure about this one, since Balena has diverged from original Docker)
Keep the existing style

marvinmarnold · 2022-03-17T13:07:03Z

@MuratUrsavas can this be closed now that you have created follow up work?

MuratUrsavas · 2022-03-17T13:21:57Z

Sure!

shawaj mentioned this issue Mar 3, 2022

Test docker-compose.yml with multiple interfaces exposed #386

Open

MuratUrsavas mentioned this issue Mar 9, 2022

Introduced how_to_add_new_hotspot.md guide #384

Merged

4 tasks

MuratUrsavas self-assigned this Mar 9, 2022

This was referenced Mar 11, 2022

Implement fleet naming style change in helium-miner-software #395

Closed

Implement fleet naming style change in hm-diag NebraLtd/hm-diag#315

Open

MuratUrsavas closed this as completed Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Reconsider the way we organize fleets #385

Spike: Reconsider the way we organize fleets #385

marvinmarnold commented Mar 2, 2022 •

edited

Loading

marvinmarnold commented Mar 7, 2022

MuratUrsavas commented Mar 9, 2022

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 10, 2022 •

edited

Loading

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 11, 2022

MuratUrsavas commented Mar 11, 2022 •

edited

Loading

marvinmarnold commented Mar 17, 2022

MuratUrsavas commented Mar 17, 2022

Spike: Reconsider the way we organize fleets #385

Spike: Reconsider the way we organize fleets #385

Comments

marvinmarnold commented Mar 2, 2022 • edited Loading

marvinmarnold commented Mar 7, 2022

MuratUrsavas commented Mar 9, 2022

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 10, 2022 • edited Loading

MuratUrsavas commented Mar 10, 2022

MuratUrsavas commented Mar 11, 2022

MuratUrsavas commented Mar 11, 2022 • edited Loading

marvinmarnold commented Mar 17, 2022

MuratUrsavas commented Mar 17, 2022

marvinmarnold commented Mar 2, 2022 •

edited

Loading

MuratUrsavas commented Mar 10, 2022 •

edited

Loading

MuratUrsavas commented Mar 11, 2022 •

edited

Loading