Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fuzzlyn to CI #60344

Merged
merged 5 commits into from
Oct 18, 2021
Merged

Add Fuzzlyn to CI #60344

merged 5 commits into from
Oct 18, 2021

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Oct 13, 2021

Add support for Fuzzlyn in the exploratory pipeline files.

  • We use the pipeline name to determine which tool to use, since that
    seems to be the easiest way to have this available during template
    expansion.
  • All the .yml files are shared, and the setup script is also shared
    (renamed to fuzzer_setup.py). However the summarize and run scripts
    are not shared.
  • The summarize scripts now use the AZDO feature that allows outputting
    a markdown file that shows up rendered under the pipeline results.
    These can be seen on the "Extensions" tab of AZDO.
  • For Fuzzlyn, we automatically reduce silent bad codegen examples found
    and include these in the summary (but we do not reduce examples if we
    are over time). Assertion errors are not reduced, but the
    documentation in exploratory.md contains some information on how to
    reduce these manually. This should just be a temporary measure until
    we can more efficiently reduce these.
  • The issue zips are now part of the issues artifact (and I removed the
    "summary" part of the name) since the Fuzzlyn summarize script reads
    the reduced examples from the zip, and I feel it's simpler to have all
    the info in one artifact.

I have also included a small fix for the superpmi download display progress so that we do not display a larger size than the file downloaded.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2021
@ghost
Copy link

ghost commented Oct 13, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch jakobbotsch force-pushed the fuzzlyn-in-ci branch 2 times, most recently from 5d010b9 to 33945c4 Compare October 15, 2021 12:20
Add support for Fuzzlyn in the exploratory pipeline files.
* We use the pipeline name to determine which tool to use, since that
  seems to be the easiest way to have this available during template
  expansion.
* All the .yml files are shared, and the setup script is also shared
  (renamed to fuzzer_setup.py). However the summarize and run scripts
  are not shared.
* The summarize scripts now use the AZDO feature that allows outputting
  a markdown file that shows up rendered under the pipeline results.
  These can be seen on the "Extensions" tab of AZDO.
* For Fuzzlyn, we automatically reduce silent bad codegen examples found
  and include these in the summary (but we do not reduce examples if we
  are over time). Assertion errors are not reduced, but the
  documentation in exploratory.md contains some information on how to
  reduce these manually. This should just be a temporary measure until
  we can more efficiently reduce these.
* The issue zips are now part of the issues artifact (and I removed the
  "summary" part of the name) since the Fuzzlyn summarize script reads
  the reduced examples from the zip, and I feel it's simpler to have all
  the info in one artifact.
@jakobbotsch
Copy link
Member Author

/azp run Antigen, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

The arm/arm64 Antigen failures in run 1423235 are strange. The helix step failed to download some artifacts and the logs seem to be truncated in exactly the same place: Partition0, Partition1

@jakobbotsch
Copy link
Member Author

/azp run Antigen, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch jakobbotsch marked this pull request as ready for review October 18, 2021 14:34
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Oct 18, 2021

Similar failure in the Fuzzlyn run on linux arm partition 2. The results seem to indicate that it did run the subprocess (there is a Fuzzlyn log file here), but the console output and output from the script seems to be truncated: see here.

@kunalspathak Any ideas what could be going on and why we don't see output from the "run" scripts?

EDIT: The Fuzzlyn log file that is attached to the results is also truncated, strangely.

@kunalspathak
Copy link
Member

@kunalspathak Any ideas what could be going on and why we don't see output from the "run" scripts?

Yes, I have seen those failures and it was an understanding that this happen because of long running python scripts, but the one that you are running is just an hour long. FYI - @MattGal

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress. Added few questions/comments.

@@ -38,11 +38,11 @@

<!-- For Scheduled= 3 hours. For PRs= 1 hour -->
<PropertyGroup Condition=" '$(RunReason)' == 'Scheduled' ">
<WorkItemTimeout>3:15</WorkItemTimeout>
<WorkItemTimeout>3:30</WorkItemTimeout>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a big deal but curious why is this increased by 15 minutes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that I added the reduction of silent bad codegen examples in the background for Fuzzlyn. The Fuzzlyn run will not start reducing examples after the 1 hour is up, but it might start reducing an example after 00:59:59. Reducing a silent bad codegen example typically doesn't take more than a few minutes, but for very large programs it might use more than 15 minutes, especially for some of the platforms where we have weaker hardware in CI.

displayName: ${{ format('Print unique issues ({0})', parameters.osGroup) }}
continueOnError: true
- script: $(PythonScript) $(Build.SourcesDirectory)/src/coreclr/scripts/$(SummarizeScript) -issues_directory $(IssuesLocation) -arch $(archType) -platform $(osGroup)$(osSubgroup) -build_config $(buildConfig)
displayName: ${{ format('Summarize ({0})', parameters.osGroup) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include parameters.archType in the displayName?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -426,7 +426,7 @@ def download_progress_hook(count, block_size, total_size):
block_size (int) : size of a block
total_size (int) : total size of a payload
"""
sys.stdout.write("\rDownloading {0:.1f}/{1:.1f} MB...".format(count * block_size / 1024 / 1024, total_size / 1024 / 1024))
sys.stdout.write("\rDownloading {0:.1f}/{1:.1f} MB...".format(min(count * block_size, total_size) / 1024 / 1024, total_size / 1024 / 1024))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's unrelated, but it's so small I didn't want to do separate PR/CI runs for it.

"Fuzzlyn": "https://github.com/jakobbotsch/Fuzzlyn.git",
}

repo_url = repo_urls[coreclr_args.tool_name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a check that repo_url is not None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this one is already verified above in setup_args


The basics of both tools are the same: they generate random programs using Roslyn and execute them with `corerun.exe` in a baseline and a test mode.
Typically, baseline uses the JIT with minimum optimizations enabled while the test mode has optimizations enabled.
Antigen also sets various `COMPlus_*` variables in its test mode to turn off different stress modes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Antigen also sets various `COMPlus_*` variables in its test mode to turn off different stress modes.
Antigen also sets various `COMPlus_*` variables in its test mode to turn on different stress modes or turn on/off different optimizations.


## Getting test examples from Antigen runs

For Antigen runs the summary will show the assertion errors that were hit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For Antigen runs the summary will show the assertion errors that were hit.
For Antigen runs, the summary will show the assertion errors that were hit.


# Turned off since the output does not seem particularly useful
# if len(remaining_issues) > 0:
# f.write("# {} uncategorized issues found\n", len(remaining_issues))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still print uncategorized issues found line so we can come back and investigate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

src/coreclr/scripts/fuzzlyn_run.py Show resolved Hide resolved
@MattGal
Copy link
Member

MattGal commented Oct 18, 2021

@kunalspathak Any ideas what could be going on and why we don't see output from the "run" scripts?

Yes, I have seen those failures and it was an understanding that this happen because of long running python scripts, but the one that you are running is just an hour long. FYI - @MattGal

It's hard to actually say what's going on here. We don't try to stream the output from the docker container continuously because in prototyping this caused issues, so this got as far as it got with its std out buffer and this is how much output it copied.

Some thoughts about this problem:

  • While older, the image is still in heavy usage daily (something like 526411 successful work items used it in the past month) so it's not likely specific to the image, rather what's running inside it.
  • At least for this instance of the problem, we don't actually know if / how far it's getting past the "install dependencies on this container so I can use Helix functionality" stage or not. It probably is, because it didn't time out, and pip would log some number of errors if it had exited w/ code 1 (unfortunately, 1 is many executables' favorite generic exit code including XUnit.) - Writing any old file to $HELIX_WORKITEM_UPLOAD_ROOT as the first thing you do should answer that question.
  • You can get rid or reduce this stage in execution by updating the image to latest; every time someone revs a dependency in the helix scripts this gets the docker images' preinstalled dependencies slightly out of whack with what's needed. In this case you'd update to mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-20211018132704-c537e64

I think the best thing to do is to get a matching device to run this on and run the exact payload directly from an interactive bash session, and see where it actually hangs. When you're to this stage, you can ping @ilyas1974 for some help with it.

@kunalspathak
Copy link
Member

While you are here, can you also modify the following to log the output?

if output:
of.write(output.strip().decode("utf-8") + "\n")

 if output: 
     print(output.strip().decode("utf-8") + "\n") 
     of.write(output.strip().decode("utf-8") + "\n") 

@jakobbotsch
Copy link
Member Author

/azp run Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jakobbotsch
Copy link
Member Author

Looks like the latest CI run found a silent bad codegen example. I opened #60597 for it.

@MattGal

At least for this instance of the problem, we don't actually know if / how far it's getting past the "install dependencies on this container so I can use Helix functionality" stage or not. It probably is, because it didn't time out, and pip would log some number of errors if it had exited w/ code 1 (unfortunately, 1 is many executables' favorite generic exit code including XUnit.) - Writing any old file to $HELIX_WORKITEM_UPLOAD_ROOT as the first thing you do should answer that question.

The output does seem to indicate that the Python script started to execute. The results here has a Fuzzlyn-linux-arm-Partition2.log file. This file is created by the Python script file running on the partition. Note that this file is truncated too, which seems strange.

Anyway, I will merge this PR for now and then see if I can find some time to investigate the failure further.

@jakobbotsch jakobbotsch merged commit 730d1f4 into dotnet:main Oct 18, 2021
@jakobbotsch jakobbotsch deleted the fuzzlyn-in-ci branch October 18, 2021 23:31
@MattGal
Copy link
Member

MattGal commented Oct 18, 2021

Note that this file is truncated too, which seems strange.

@jakobbotsch that is very interesting because for any result file like this, we directly mount the outer volume into the Helix Docker container, so no matter how badly things go that file should represent as far as execution got before "the bad thing" happened. I'd definitely suggest investigating this from that perspective, as it'd be harder for a file to be accidentally partially written in this configuration than losing some std out.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants