Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel do not generate coverage details when test are cached #1613

Open
guibou opened this issue Oct 15, 2021 · 7 comments
Open

Bazel do not generate coverage details when test are cached #1613

guibou opened this issue Oct 15, 2021 · 7 comments

Comments

@guibou
Copy link
Contributor

guibou commented Oct 15, 2021

Describe the bug

Bazel is caching test. In the context of bazel coverage, it means that the Haskell coverage results are incomplete if some tests are cached.

It is unclear if that's a bazel bug or a rules_haskell bug.

To Reproduce

bazel coverage //tests/.... in rules_haskell repository. Run it many times, the results are different when test are cached.

Expected behavior

Coverage result should be independent of caching.

Environment

  • OS name + version: nixos
  • Bazel version: 4.1
  • Version of the rules: current master
@aherrmann
Copy link
Member

Thanks for reporting this. To me this sounds more likely to be a Bazel issue. Can you test if the missing files are already missing in the set of Mix file paths, or if they are listed in there but then missing in the runfiles tree? IIRC we're only passing directories to hpc, so missing files would not cause errors but just silently missing files.

Did you find any upstream issues about this? I find a few issues that seem potentially related, but nothing explicitly like this: bazelbuild/bazel#12013, bazelbuild/bazel#12592.
I've also found reference to the --experimental_fetch_all_coverage_outputs flag. Have you tried that flag in your use-case?

@lfpino
Copy link

lfpino commented Nov 8, 2021

(Found this issue through bazelbuild/bazel#12592)

I've experienced the same issue of coverage builds not working with remote caching (although I don't have a small repro), I don't think this is rules_haskell but a Bazel issue instead. I've been meaning to write a small repro but if you have one, it'd be great to file it upstream.

@aherrmann
Copy link
Member

@lfpino Thanks for reaching out. I don't have a repro for this myself. @guibou do you have a repro that you could share? Could you point out where specifically you see differences with bazel coverage //tests/...? I tried to reproduce it on some of the coverage-compatible targets but didn't encounter the issue.

@TLATER
Copy link

TLATER commented Feb 3, 2022

I've encountered this too; I tried looking into a proper fix for this, but ended up unable to reproduce it reliably. At this point I only have anecdotal evidence that i've seen empty coverage after cached tests ran.

Is anyone here still able to reproduce this with a recent Bazel version, or at least sees the behavior reliably?

@guibou
Copy link
Contributor Author

guibou commented Feb 3, 2022

In short: disable bazel caching for tests, most of the time bazel is not doing what it should do. (At work we had everyday test failing in CI because of changes merged earlier that bazel test did not rerun because they were OK in the cache, or test which are not rerun because bazel cached their failure: bazelbuild/bazel#9389)

Now, if you really want to suffer with bazel caching behavior (and actually you want, because I've come to like arbitrary CI results in 10s compared to correct CI results in 10 minutes), here is a bit of details, as well as partial workaround.

When running bazel test //..., bazel will only run the tests which are not cached and will ignore the others.

Rules haskell does not correctly use the bazel infrastructure for coverage aggregation, actually, it just scans the test output directory for test results, extract the haskell informations hidden in the XML test results and do its aggregate. It leads to two defects in how rules_haskell handle coverage:

  • The bug we are discussed now, when a test is not run, its result may not appear in the outputs (may because it depends on when you ran bazel clean)
  • the report we generate with rules_haskell does not aggregate results, instead it returns as many coverage result as test you run, which is not super convenient.

I was able to workaround this by using a combination of --combined_report=lcov, and tricking bazel so it things that the hpc reports are lcov.

You can see part of the setup in my MR: #1434 which actually solve the current problem, but forces output to lcov.

I locally have an ad hoc setup which provides the same. I do not have time neither motivation to upstream that (i.e. that's ad hoc code, and considering my long track record of stalling PR, closed issues, design disagreement, ... when my proposals were not perfect according to arbitrary standard. Sorry, crumpy dev here).

Here are the things that I'm using right now:

This script is supposed to run the coverage report:

 #!/usr/bin/env bash

# break after any error
set -e

# remove the previous coverage reports
rm -rf bazel-bin/coverage-reports

# get the target scope passed to this script, default is "//..."
scope=${1:-"//..."}

# Find which targets are enabled for coverage.
# @repl and @repl targets seems to not generate coverage report,
# they are hence filtered.
compatible_targets=$(bazel query "attr(\"tags\", \"has_coverage_report\", ${scope})" | grep -v -- "-repl" | grep -v "@repl")

echo "$compatible_targets"

# Run the coverage analysis
# Notes:
# - test timeout is increased. Coverage add a lot of overhead, most test were failing.
bazel coverage \
--test_timeout=1000 \
--test_env=LCOV_MERGER=$(pwd)/scripts/lcov_merger \
--combined_report=lcov \
--javabase=@local_jdk//:jdk \
--host_javabase=@local_jdk//:jdk \
--coverage_report_generator=@blork//buildlib:coverage_report \
${compatible_targets}

shopt -s globstar nullglob

COVERAGE_DIR=bazel-out/_coverage/_coverage_report.dat/
# Walk the coverage directory to look for TIX files
TIX_FILES=$COVERAGE_DIR/**/*.tix

# Walk the coverage directory to look for directory containing MIX files.
# They are called '*_.hpc'
mix_args=()
for t in $(find $COVERAGE_DIR -name '*_.hpc' -type d)
do
    mix_args+=("--hpcdir=$t")
done

# Sum all the report
# TODO: Main are ignored, because all the "Main" modules (As many as test
# entrypoints) will conflict.
# We can fix that with a pre-processing which will qualify the module name.
hpc sum --union --exclude Main $TIX_FILES > bazel-out/_coverage/union.tix

# Text report
hpc report bazel-out/_coverage/union.tix "${mix_args[@]}"

# Generate the html report
LANG=C.UTF-8 hpc markup --verbosity=0 bazel-out/_coverage/union.tix "${mix_args[@]}" --destdir=bazel-bin/coverage-reports/

echo "Coverage results are in bazel-bin/coverage-reports/hpc_index.html"

--test_env=LCOV_MERGER=$(pwd)/scripts/lcov_merger forces bazel to merge coverage results together using the following script:

#!/usr/bin/env sh

# This file locates all the .tix and .mix files, resulting from an
# Haskell coverage process and move everything as a tar archive to the coverage
# output directory.

set -euo pipefail
shopt -s globstar nullglob
tar cvhf "$COVERAGE_OUTPUT_FILE" **/*.mix **/*.tix

I theory, you should be able to pass this script using the --coverage_support argument, but I had never been able to get it to work and again, I'm too tired by bazel to take any more time finding a more robust solution (this is robust enough to work on two mono repos since two years).

Note that the coverage merge pass the files as tar archive. This archive will be cached by bazel test and hence that's why it fix the main problem.

Then, I use --combined_report=lcov. This does nothing to lcov, but I observed that it triggers the usage of LCOV_MERGE and --coverage_report_generator, so I'm keeping it.

The different stuff about --javabase and --host_javabase are because internally bazel uses java for some coverage stuffs for C++ and even if it does not use theses tools for Haskell, it will try to fetch them and fail (never investigated when, again, this thing drained my motivation).

Final step, the --coverage_report_generator=@blork//buildlib:coverage_report points to:

sh_binary(
  name = "coverage_report",
  srcs = ["@coverage_report//:bin/coverage_report"],
)

Which points to a nixpkgs_package which points to:

  # The coverage_report.sh script is used by bazel coverage to merge the result
  # of all the coverage run.
  # Unfortunately, it is a really hardcoded thing inside bazel, so it is
  # impossible to pass path to dependencies (such as gnutar here) to the script
  # using the traditional bazel `data` argument.
  # So instead, I'm generating the file with hardcoded path directly with nix.
  coverage_report = prev.symlinkJoin {
    name = "coverage_report";
    paths = [
      (prev.writeShellScriptBin "coverage_report" ''
          export PATH=${final.coreutils}/bin:${final.gnutar}/bin

          set -e

          # Clean the _coverage_report.dat directory
          # This directory is the official output for bazel coverage data
          # In theory, it should be a file (an lcov result), but in bazel 4.1
          # it works fine with a directory, so we exploit this fact to store
          # the uncompressed mix/tix files archives.
          rm -rf bazel-out/_coverage/_coverage_report.dat
          mkdir -p bazel-out/_coverage/_coverage_report.dat
          cd bazel-out/_coverage/_coverage_report.dat

          # Uncompress all the archives
          while read i
          do
              if ! [[ $i =~ baseline ]]
              then
                  tar xvf ../../../$i
              fi
          done < ../../../bazel-out/_coverage/lcov_files.tmp
        ''
      )
      # And I also need to generate a bazel BUILD file so I can use it ;)
      (prev.writeTextFile {
        name = "BUILD.bazel";
        destination = "/BUILD.bazel";
        text = ''
           exports_files(["bin/coverage_report"])
        '';
      }
      )];
  };

This stuffs takes the content of a manifest file lcov_files.tmp (which actually points to all our tar archives which contains mix and tix files) and uncompress everything in the _coverage_report.dat directory. Actually, I tricked bazel, because it is supposed to be a file, but I'm using it as a directory.

I should have been possible to copy all the archives in that directory and do the unpacking outside of bazel, I don't remember why I've done it like that.

Note the indirection with the nix script. We use bazel strict_action_env so bazel is a bit more reproducible, because of that, the script is now unable to locate tar and things, and because bazel does not respect data on that script, we cannot pass theses files otherwise. You may remove this indirection if you don't use strict_action_env.

Note that this final script may also contain all the hpc calls which appears in the first script, so this way, everything would be doable with a unique bazel coverage call, I don't remember why I've done it like that.

@TLATER
Copy link

TLATER commented Feb 3, 2022

Thanks for that write-up! Coming back to specifically:

The bug we are discussed now, when a test is not run, its result may not appear in the outputs (may because it depends on when you ran bazel clean)

If I understand correctly, you believe Bazel combines all lcov files in $COVERAGE_DIR - if a test is cached, it may not have a coverage file in that directory. This happens if the files stored there are not lcov files, because Bazel doesn't cache non-lcov files correctly? I'm trying to root-cause specifically this issue :)

Have you come across --experimental_split_coverage_postprocessing and --experimental_fetch_all_coverage_outputs? It's possible that I have stopped seeing these cache issues after starting to use those flags, albeit with other rules, but I'm not sure because coverage doesn't work reliably in remote execution without them, which is a different can of worms.

@guibou
Copy link
Contributor Author

guibou commented Feb 4, 2022

If I understand correctly, you believe Bazel combines all lcov files in $COVERAGE_DIR - if a test is cached, it may not have a coverage file in that directory. This happens if the files stored there are not lcov files, because Bazel doesn't cache non-lcov files correctly? I'm trying to root-cause specifically this issue :)

That's the idea. Bazel is caching lcov files, and hence is not caching if there is no lcov files.

Have you come across --experimental_split_coverage_postprocessing and --experimental_fetch_all_coverage_outputs? It's possible that I have stopped seeing these cache issues after starting to use those flags, albeit with other rules, but I'm not sure because coverage doesn't work reliably in remote execution without them, which is a different can of worms.

Not experimented with that, sorry. I do admit that now that I have something "working", I rather not touch that again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants