Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with JSON decode on vgs combined stream ouput #247

Closed
kro-cat opened this issue Aug 15, 2023 · 10 comments · Fixed by #250 or #261
Closed

Issue with JSON decode on vgs combined stream ouput #247

kro-cat opened this issue Aug 15, 2023 · 10 comments · Fixed by #250 or #261

Comments

@kro-cat
Copy link
Contributor

kro-cat commented Aug 15, 2023

I recently ran into a bug which is causing the lvm driver to crash unexpectedly on my systems.

I've traced it to pkg/lvm/lvm_util.go line 666, in ListLVMVolumeGroup()

STDERR and STDOUT are combined via cmd.CombinedOutput(). The vgs program may produce non-critical warning messages on STDERR which don't affect the operation of this driver. Warning messages also don't affect the return code of the vgs program. Any warning messages printed to STDERR will fail the JSON decode step in decodeVgsJSON().

@kro-cat
Copy link
Contributor Author

kro-cat commented Aug 15, 2023

Quick and dirty implementation I did to mitigate this temporarily.

diff --git a/pkg/lvm/lvm_util.go b/pkg/lvm/lvm_util.go
index 0bb4877..e9b52ed 100644
--- a/pkg/lvm/lvm_util.go
+++ b/pkg/lvm/lvm_util.go
@@ -17,6 +17,7 @@ limitations under the License.
 package lvm

 import (
+       "bytes"
        "encoding/json"
        "fmt"
        "os"
@@ -663,11 +664,16 @@ func ListLVMVolumeGroup(reloadCache bool) ([]apis.VolumeGroup, error) {
                "--units", "b",
        }
        cmd := exec.Command(VGList, args...)
-       output, err := cmd.CombinedOutput()
+       var execOut bytes.Buffer
+       // var execErr bytes.Buffer
+       cmd.Stdout = &execOut
+       // cmd.Stderr = &execErr // do something with this useful information
+       err := cmd.Run()
        if err != nil {
                klog.Errorf("lvm: list volume group cmd %v: %v", args, err)
                return nil, err
        }
+       output := execOut.Bytes()
        return decodeVgsJSON(output)
 }

@abhilashshetty04
Copy link
Contributor

@kro-cat Thank you for reporting the issue. Can it be reproduced easily? If yes, please help us with the steps.

Since you have found the fix for it also. Would you like to raise a PR for this?

@kro-cat
Copy link
Contributor Author

kro-cat commented Aug 16, 2023

@abhilashshetty04 Of course! I'll have a PR up in a bit.

I'm convinced this is a niche issue. My current hypothesis is that it's due to the fact that the host machine has a different lvm system_id than the lvm-driver container. When lvm detects a foreign system id on a volume group, it will print a warning on STDERR; this is the kind of message that led me to this issue.

Since the system id can be set when using vgcreate, this shouldn't necessarily cause problems for volume groups meant to be managed by lvm-localpv so long as I'm aware of the correct system id to set (I think, by default, it's an empty string).

I haven't tested that yet, so I can't be certain that it's entirely correct.

@kro-cat
Copy link
Contributor Author

kro-cat commented Aug 17, 2023

By "in a bit" I guess I really meant to say "when I'm confident it actually solves the problem I'm having instead of just hiding it"

@abhilashshetty04
Copy link
Contributor

@kro-cat , Thanks for replying. We pre-provision vg (volume group) anyway right? So what do you mean by lvm-driver detects vg not managed by itself?

@kro-cat
Copy link
Contributor Author

kro-cat commented Aug 22, 2023

@abhilashshetty04
The daemonset can see the any of the host's lvm devices due to its /dev hostPath mount.

I'll try my best to explain why that might cause a problem.

In my particular use case, the host uses lvm logical volumes for its own filesystem (/var, /home, stuff like that). These volume groups cannot be managed by lvm-localpv, so I don't include them in its configuration. Additionally, the host's lvm.conf has been configured beyond defaults; notably, the hosts each have a unique systemid configured. These are the kinds of volumes whose volume groups create this issue.

When the lvm-driver pod executes vgs, it's intention is to discover the volume groups it's configured to manage. However, because the host's own volume group (and lvm.conf) are configured outside of lvm-driver, this has the potential to cause vgs (which shows all volume groups in /dev) to display warnings or errors about the differences or incompatibilities (foreign systemid, no lvmlockd, and others).

@kro-cat
Copy link
Contributor Author

kro-cat commented Aug 22, 2023

I think I should follow-up on my previous comment. The issue here is that lvm-driver crashes unexpectedly when warning messages are mixed with JSON output from vgs.
I've put a lot of effort into explaining why the warnings messages are produced, but that's not the point of this issue, nor is that necessarily an issue with lvm-localpv itself.
The PR for this issue addresses the unexpected crashing: split the STDERR output from vgs and print it to the log. This provides much-needed diagnostic information while not affecting the health of the daemonset.

Abhinandan-Purkait pushed a commit that referenced this issue Aug 31, 2023
)

* fix(lvm): change calling operation to avoid garbage in json object

Drop the STDERR stream contents from `output` to avoid JSON mangling.
The `vgs` command may print non-critical warnings to STDERR. Warnings
may not necessarily result in a failure return code, which allows the
program to continue with marshalling the JSON-formatted output.
Combining this stream with STDIN will cause the next step at
`decodeVgsJSON()` to fail due to garbage mixed in the JSON.

Fixes #247

Signed-off-by: Kara <[email protected]>

* fix(lvm): add PR changelog - suppress STDERR when calling vgs

Add PR changelog

Signed-off-by: Kara <[email protected]>

* refactor(lvm): refactor exec to separate function, add log verbosity

Move exec code into separate function which returns STDOUT and STDERR
streams separately. This is to facilitate future occurences of that same
pattern within lvm_util.go . Additionally, add log messages to assist with
debugging lvm and json.

Signed-off-by: Kara <[email protected]>

* feat(ci): integrate test to simulate foreign lvm systemid

This test aims to determine whether lvm-driver is capable of operating
with a volume group whose systemid is foreign to the kubernetes host
system.
The test ensures the program in pkg/lvm/lvm_utils.go is capable of
operating in a foreign lvm environment. This is important when the
kubernetes host system has a different lvm configuration than the
lvm-driver container. Differences in configuration may cause unforseen
consequences when the host machine provisions a volume group which is to
be consumed by the lvm-driver container.
Additionally, this feature includes teardown code in the form of the
cleanup functions which aims to aid with cleaning up ci resources in a
local development environment. To make the cleanup actions compatible
with the original operation of the ci-test.sh script, the feature is
only enabled through the use of environment variables.
While this commit fixes issue #249, it's intent was as an integration
test for issue #247.

Fixes #249

Signed-off-by: Kara <[email protected]>

* feat(ci): clean up volume groups when applicable

To avoid volume group caching, explicitly remove volume groups (and
their pvs) before removing the disk and loopback device.

Signed-off-by: Kara <[email protected]>

* fix(ci): use -n instead of ! -z, change last statement to return true

Fix linter errors: use -n instead of ! -z in test statements. Last
statement in file is used as the return code of the file, so it must
return true; change `[ ! -z ... ] && ...` into `[ -z ... ] || ...` to
achieve this.

Signed-off-by: kro <[email protected]>

---------

Signed-off-by: Kara <[email protected]>
Signed-off-by: kro <[email protected]>
@Mjolkhare
Copy link

@kro-cat @Abhinandan-Purkait I have the same issue with lvs and pvs. I think all commands should use RunCommandSplit instead of CombindOutput.

@kro-cat
Copy link
Contributor Author

kro-cat commented Sep 28, 2023

@kro-cat @Abhinandan-Purkait I have the same issue with lvs and pvs. I think all commands should use RunCommandSplit instead of CombindOutput.

@Mjolkhare I think I'll set up a PR this weekend for that since I never did that refactor in pull #250, unless of course someone beats me to it.

@Abhinandan-Purkait
Copy link
Member

@kro-cat Thanks for being on top of this.

kro-cat added a commit to kro-cat/lvm-localpv that referenced this issue Oct 8, 2023
The LVM system may sometimes produce non-critical warnings which are
written to STDERR without formatting. Combining STDERR with STDIN may
cause failures when the output is being formatted or otherwise
interpreted.

Following pull openebs#250, it's been requested that all lvm commands be
refactored to use a split output in order to resolve this issue under
non-tested scenarios. See issue openebs#247.

The definition for RunCommandSplit has been moved above all uses of the
function, and any Command using CombinedOutput and an lvm command (s.a.:
lvs, vgs, lvcreate, &c.) has been refactored to instead use
RunCommandSplit to obtain the command's output.

If anything is written to STDERR by the lvm commands, RunCommandSplit
prints the message to the log as a warning.

Signed-off-by: kro <[email protected]>
Abhinandan-Purkait pushed a commit that referenced this issue Oct 18, 2023
…#261)

* refactor(ci): rename env variables and update vgcreate command

Env vars LVM_SYSTEMID and LVM_CONFIG are used to describe a potentially
foreign lvm system on the kubernetes host. The lvm system is only meant
to be foreign to the lvm-localpv containers.

Due to the way the ci tests are written, the kubernetes host and the
lvm-localpv conatiners must have identical lvm configurations or ci
tests might fail.

The variables LVM_SYSTEMID and LVM_CONFIG have been renamed to
FOREIGN_LVM_SYSTEMID and FOREIGN_LVM_CONFIG, respectively. This is
helpful when determining their roles at a glance.

A secondary change was made to the foreign pv creation: the lvm option
`--config` has been used in place of `--systemid` to hopefully minimize
unintended side effects of using the same lvm config (with the exception
of the system id) during volume group creation.

Signed-off-by: kro <[email protected]>

* refactor(lvm): update all lvm commands to use split output

The LVM system may sometimes produce non-critical warnings which are
written to STDERR without formatting. Combining STDERR with STDIN may
cause failures when the output is being formatted or otherwise
interpreted.

Following pull #250, it's been requested that all lvm commands be
refactored to use a split output in order to resolve this issue under
non-tested scenarios. See issue #247.

The definition for RunCommandSplit has been moved above all uses of the
function, and any Command using CombinedOutput and an lvm command (s.a.:
lvs, vgs, lvcreate, &c.) has been refactored to instead use
RunCommandSplit to obtain the command's output.

If anything is written to STDERR by the lvm commands, RunCommandSplit
prints the message to the log as a warning.

Signed-off-by: kro <[email protected]>

---------

Signed-off-by: kro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants