Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDMS implementation for runtime I/O collection #834

Merged
merged 47 commits into from
Dec 21, 2023

Conversation

Snell1224
Copy link
Contributor

This includes LDMS API code that collects I/O event data in real time.
This PR is a work in progress and will be updated with the necessary test script and documentation.

@github-actions github-actions bot added CI continuous integration pydarshan labels Nov 16, 2022
@Snell1224
Copy link
Contributor Author

Snell1224 commented Nov 16, 2022

Branch is now updated with a CI github test. This tests the Darshan-LDMS integrated code by running a simple mpi-io-test application and checks that both Darshan and LDMS are collecting I/O events. It then checks for completion of the run and generated Darshan log file when LDMS is not functioning properly (i.e. env variables not set, cannot connect to daemon).

Right now, the CI test clones and installs Darshan from my repository (i.e. Snell1224/darshanConnector) to test the LDMS implemented code.

Once this PR has been reviewed and/or approved to merge into the main branch, I will update the CI test accordingly (i.e. install Darshan from the main branch of this repository). If it's better to leave this PR as a separate branch and not merge the changes, that's fine too. I will also update the CI test accordingly (i.e. test will trigger and the Darshan will be installed from the darshanConnector branch of this repository).

Documentation on how to run the Darshan-LDMS Integrated code (i.e. darshanConnector) can be found here: https://ovis-hpcreadthedocs.readthedocs.io/en/latest/ldms-streams.html

Copy link
Contributor

@shanedsnyder shanedsnyder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this LDMS functionality to a new darshan-ldms.c source file, rather than in the DXT source here?

The DXT module currently just piggybacks off of our POSIX and MPI-IO modules to provide more detailed tracing, similar to what you want to with the LDMS connector. But, there isn't really any integration between LDMS/DXT that I can see here (all calls to LDMS are in the POSIX module, in the MPI-IO module, etc.), so I'm not sure this is the best spot for the code responsible for talking with LDMS. Pushing as much code to a standalone LDMS source file as possible and trying to limit LDMS-specific code in other source files will probably make maintenance a little easier on our end.

@Snell1224
Copy link
Contributor Author

The LDMS functionality has been moved to a new darshan-ldms.c source file. Please let me know if there are any issues with these changes.

@shanedsnyder
Copy link
Contributor

Thanks for the updates! I'll do my best to make another pass over this this week and see if I have more comments related to the timer issue.

darshan-runtime/lib/darshan-core.c Outdated Show resolved Hide resolved
darshan-runtime/lib/darshan-core.c Outdated Show resolved Hide resolved
darshan-runtime/lib/darshan-core.c Outdated Show resolved Hide resolved
darshan-runtime/lib/darshan-ldms.c Outdated Show resolved Hide resolved
darshan-runtime/lib/darshan-ldms.c Outdated Show resolved Hide resolved
darshan-runtime/lib/darshan-ldms.c Outdated Show resolved Hide resolved
@shanedsnyder
Copy link
Contributor

I just added a bunch of review comments. Mostly just another pass to see how to tidy up the integration overall, but let me know if you have any questions or if anything seems unreasonable.

I still need to think more on the timespec issue to see what might be the cleanest solution there, but I'm sure we can figure something out.

tylerjereddy and others added 3 commits March 8, 2023 13:42
* this allows `pip install -e .` to produce a working editable
install locally

* the basis for the `setuptools` pin here is based on reading
through pypa/setuptools#3548; in short,
looks like a behavior change in `setuptools` so let's just pin it
for now

* since this only affects Python developers for our project I've not
added any "regression test"/"CI test" for it, but feel free to add
one editable job to the CI if you want
@github-actions github-actions bot removed the pydarshan label Mar 8, 2023
@Snell1224
Copy link
Contributor Author

Snell1224 commented Mar 8, 2023

Made the necessary changes to jobid, uid, and exename and cleaned up the LDMS code.

The DARSHAN_LDMS_HOST is different from gethostname().
DARSHAN_LDMS_HOST: Used to connect to the ldmsd (I.e. what host the ldms daemon is running on)
gethostname(): Used for identifying the producer name (i.e. what host the application is running on).

However, after looking at setup_connection(), you are correct that gethostname() should not be here. Not only is this the wrong function call (needs to find the host of the ldms daemon and not the application) but the darshan_ldms_connector_initialized() already checks if DARSHAN_LDMS_HOST is set before calling setup_connection(). This chunk of code has been removed

Thank you for reviewing this! Please let me know if there is any other ldms code that can be cleaned up or doesn't make sense.

darshan-runtime/lib/darshan-hdf5.c Outdated Show resolved Hide resolved
Snell1224 and others added 4 commits November 28, 2023 15:51
 * move various LDMS code so it's contiguous where possible
 * various indentation/whitespace changes
 * remove unused dC vars
 * fix source copyright
@shanedsnyder
Copy link
Contributor

I pushed some small updates over the last day or so, and think I'm ready to sign off on things. For the source code, I mostly did some refactoring to further reduce the diff (e.g., moving the definition of the extern dC variable into the LDMS header so it doesn't have to be repeated everywhere), and also made some whitespace changes to make sure indentations were consistent. I also made some relatively minor updates to the autotools files to simplify and cleanup the output from configure.

As far as the github workflow goes, I've only briefly scanned it and the most recent run from my latest commit and noticed that it's not building the LDMS module currently. It looks like a stage of the workflow is explicitly cloning our main branch, whereas I think we want it to just build the code associated with the workflow (in this case, this PR). Maybe have a look at the existing runtime_ci.yml example to see how we are building Darshan directly there?

At any rate, it'd be great if we could get the workflow to run on an LDMS-enabled version of Darshan before we merge so I can sanity check that it's not causing issues while running examples. We should also think about how we could modify the existing workflow to trip some sort of error if the Darshan-LDMS integration isn't working properly (e.g., the workflow is passing now even without LDMS support being built in Darshan). Doing the typical inspection of Darshan logs in this case doesn't tell much other than that Darshan didn't crash and produced some output. Is there anyway to "check" LDMS to see that it has ingested Darshan data? Maybe at the very least, we could enable the verbose option you have for the LDMS module and check that you see some of that output at runtime on stdout/stderr? Any other ideas?

@Snell1224
Copy link
Contributor Author

Snell1224 commented Dec 5, 2023

Doing the typical inspection of Darshan logs in this case doesn't tell much other than that Darshan didn't crash and produced some output. I

Thank you for making these changes and fixing the whitespaces.

Yes, right now this Darshan-LDMS github workflow is set to clone the main Darshan branch because, at the time when I first created this PR, I figured the code would be merged into the main branch once the PR was approved so we would want to test that branch instead. I have been checking this workflow by running it in a separate branch (identical to this one) where the LDMS integrated Darshan code is cloned. However, I do agree that we should test in this PR to make sure we are all on the same page about the Darshan LDMS code so I will update this.

As a general description of the workflow file, the LDMS and Darshan codes are cloned, installed and the environment variables are set. Then 3 tests are performed where the goal is to check that the application and Darshan are unaffected by any sort of LDMS related issue/error:

  1. LDMS streams daemon is connected and env variables are set
  2. LDMS env variables are not set
  3. LDMS streams daemon does not exist (not connected or started)

As for checking if LDMS is collecting Darshan data, the LDMS streams daemon that's created is also configured to store the Darshan data to a .csv file and this is checked if it exists and not empty (a new change that I need to update in this branch). There is also a log file that's created when and LDMS daemon is started so I can also cat that file so we can see if any errors show up.

Now for checking if Darshan is working, I'm not currently doing anything else other than checking that the .darshan log files exist but I'll look into the runtime_ci.yml and see if there's anything there I could refer to. If you have any other suggestions then please let me know!

As I mentioned earlier, there are some changes that I made to the workflow that I haven't updated in this branch. The reason why is because I've recently ran into an LDMS related issue with the latest LDMS code and have been trying to debug this since. So once that is all sorted out I'll update the workflow with these changes and have it clone this PR instead of the main branch.

Thank you for the suggestions and input!

@shanedsnyder
Copy link
Contributor

Thanks for the update, @Snell1224 ! That all sounds great. The key thing I was interested in checking was that there was a way to ensure that LDMS is capturing the Darshan data properly. Sounds like you have a couple of different files/logs you can check for which is perfect.

Now for checking if Darshan is working, I'm not currently doing anything else other than checking that the .darshan log files exist but I'll look into the runtime_ci.yml and see if there's anything there I could refer to. If you have any other suggestions then please let me know!

I think this is sufficient, for now, probably. We have a more advanced regression testing system for darshan-runtime in the repo, but it's not currently running in any of our workflows (this is on my TODO list) so there's not an example to point to. I can always come back and modify this test to use that more thorough testing if we want, but I think a test that does the following is more than fine:

  • ensure application runs fine, terminates without error
  • ensure LDMS properly collects data/logs as expected
  • ensure Darshan generates a log and that darshan-parser can parse it without error

The main thing I think that would be useful to include here from runtime_ci.yml is the code for making sure we build the current branch (rather than main):

- name: Install darshan-runtime

(That would allow the workflow to run against this PR now, but once it's merged into main, subsequent workflows will target main, if that makes sense.)

Let me know when you have the workflow ready, and I can review and approve it to run to ensure we're getting the expected result on this branch. After that, I think we can go ahead and merge all of this into main.

@Snell1224
Copy link
Contributor Author

@shanedsnyder Okay the test script is updated and ready to be reviewed! The script now builds the Darshan-LDMS integrated code and is triggered whenever this is pull/push on this branch.

Apologies for the delay, that LDMS issue took a bit longer than expected to debug but I finally found out the problem and have a temporary workaround until it's addressed by the other LDMS developers.

I've added some additional ls -ltrch commands for the log file directory (towards the bottom of the file) so that we can see all the Darshan log files, their parsed output files, LDMS output and LDMS csv file. Please feel free to make any edits to this .yaml file and/or let me know if there's any issues or questions!

shanedsnyder
shanedsnyder previously approved these changes Dec 15, 2023
@shanedsnyder
Copy link
Contributor

Awesome, I just reviewed and everything looks good. I made a few smaller simplifications (like removing HDF5 and python dependencies that aren't used), but that's all. Looks like the LDMS workflow is running and passing now, which is great to confirm. There are some workflow failures elsewhere, but those appear to be related to some PyDarshan issues not introduced in this PR -- I'm working on getting a fix in for those now in a separate PR.

So, I went ahead and approved this -- I don't think I need any other specific changes from you, unless there was something you still wanted to add? I will add a note to our darshan-runtime docs about the new configure argument to enable LDMS (and will link to the docs you shared previously), but that's my last planned change before merging this PR in.

Thanks a lot for the contribution and for working with us to get the code ready for merge! This is a really neat capability that I hope will lead to some interesting analysis of I/O behavior at runtime.

@Snell1224
Copy link
Contributor Author

Snell1224 commented Dec 18, 2023

So, I went ahead and approved this -- I don't think I need any other specific changes from you, unless there was something you still wanted to add? I will add a note to our darshan-runtime docs about the new configure argument to enable LDMS (and will link to the docs you shared previously), but that's my last planned change before merging this PR in.

Thanks a lot for the contribution and for working with us to get the code ready for merge! This is a really neat capability that I hope will lead to some interesting analysis of I/O behavior at runtime.

Okay sounds good! And no there's nothing else that I need to add so it can be merged whenever you're ready!

About the docs, I just emailed a version of the darshan-runtime doc with the added LDMS info so if you haven't started editing the document yet then feel free to refer to this!

No problem! It was a great learning experience and thank you for reviewing and cleaning up the code! Yes, I'm also excited to see what analyses we can produce with this capability.

@Snell1224
Copy link
Contributor Author

Snell1224 commented Dec 19, 2023

EDIT: Pushed the changes. Everything is ready to go now.

@shanedsnyder I actually did have one more thing to add which is a env "switch" so that users can choose to turn off/on the LDMS capability. Right now it automatically initializes a connection to the LDMS daemon (when enabled during Darshan build). I'm going to add the commit in a few to this branch and updating the darshan-runtime document that was emailed earlier.

@shanedsnyder
Copy link
Contributor

Looks good, only suggestion is that I think we want to use the new env variable you just added (DARSHAN_LDMS_ENABLE) in test case 3 for your github workflow, as in test case 1? To ensure nothing bad happens when trying to connect to an LDMS daemon that has shut down?

I'll try to push in the docs updates later today, thanks for sending those to me!

@Snell1224
Copy link
Contributor Author

Snell1224 commented Dec 20, 2023

Looks good, only suggestion is that I think we want to use the new env variable you just added (DARSHAN_LDMS_ENABLE) in test case 3 for your github workflow, as in test case 1? To ensure nothing bad happens when trying to connect to an LDMS daemon that has shut down?

The DARSHAN_LDMS_ENABLE is already set for test case 3 because it's under the same step name: "Test Preparation and Run" and it is not being unset in test case 2. From what I understand, github actions keeps the exported variables until that step is complete. I did re-run the test to double check this and, for test case 3, LDMS threw a connection error (Error 111) and exited but there were no runtime issues with the program or Darshan.

Although I still exported it in test case 3 for ease of mind.

Please let me know if you have any questions about the documentation part!

@shanedsnyder
Copy link
Contributor

All sounds good. I just pushed in the docs changes you sent me with a couple small tweaks. I also fixed some of the unrelated CI failures in another PR, so hopefully this passes all checks. Assuming so, I'll go ahead and merge it in. Thanks again for the contribution!

@shanedsnyder shanedsnyder merged commit cd84ee5 into darshan-hpc:main Dec 21, 2023
17 of 18 checks passed
@shanedsnyder shanedsnyder added this to the 3.4.5 milestone Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants