Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC 2024 | Benchmark | First objective | Israel Roldan #2565

Merged

Conversation

airvzxf
Copy link
Contributor

@airvzxf airvzxf commented Mar 29, 2024

📝 Description

Type: 🚀 feature

Hi, @andrewfullard, @atharva-2001, and @aoifeboyle.

I added my pull request for my work in the application for GSoC 2024, it is for the benchmark project. Thanks in advance.

GitHub Action to store more than 5 commits of benchmarks.

The first goal is: “You should improve the GitHub Action to store more than just the recent 5 commits of benchmarks to provide an extended history”.

The solution to this problem is in my application that I will share with the mentors. Trying to do the first goal I had some issues, then I decided to create a development environment.

Development environment

Commit: edb0e8f

I tried to install all the requirements, and I noticed that it is quite of difficult and some packages were installed as Conda. I would rather not mix my personal environment with the TARDIS development environment. For this reason, I created a series of Docker processes to have 3 types of environment: one for static code in the container, another fragmented (mixed) and the total live which means use your local computer (host) to take and store files.

To see more references, check the file: Dockerfile-benchmarks.bash. Furthermore, one video is better thatn ten thousand of words; I created a YouTube video explaining this development environment usage.

Docker

Static

Pros:

  • Easy setup.
  • Straight usage.
  • Isolate the results and publish them for ASV.

Cons:

  • The modifications in local (host) are not reflected in the containers. Except if it is built again.

Fragmented

Pros:

  • It takes the live source code.
  • The “.asv/results” and “.asv/html” are created in the code folder (live).
  • The folders “pkg” and “.asv/env”. It is created in the “image” and dynamically used in the container.

Cons:

  • If you want to test or introduce changes to “asv.conf.json”, it is not reflected in the code folder; you need to copy and modify this file again.
  • With more than one container, the results and publishes for ASV are shared in the same place (the live code folder).

Live

Pros:

  • It takes the live source code.
  • Change “asv.conf.json” at any moment and reflect it in the containers.

Cons:

  • The folders “pkg” and “.asv/env” are created in the code folder (live). They have a gigabyte size.
  • With more than one container, the ASV folders will be shared in the same place (the live code folder).

Expand the framework and create benchmarks.

Commit: 65ae093

  • You should plan to expand the framework to make it easy for us to add benchmarks to different parts of the TARDIS.
  • You should use the improved framework to add more benchmarks that cover other areas of the TARDIS code.

Results.

The results are on my application but I almost create the 56% of the unit tests as a time benchmark. Furthermore, in my application listed the challenges to migrate all the unit test to the benchmarks.

TARDIS Benchmarking and Performance Improvement.

I run all the benchmarks selecting the releases tags, then run some specific benchmark to find bottlenecks. At the end the idea is check if ASV is useful for detect performance issues. Spolier alert: I am not pretty sure that ASV detect them.

I crate and upload the result of my benchmarks in this web links:


All benchmarks in the home page


All benchmarks in the list page


All benchmarks in the regression page


Specific benchmarks page


📌 Resources

N/A.

🚦 Testing

How did you test these changes?

  • Testing pipeline
  • Other method: Create time benchmarks.
  • My changes can't be tested (explain why)

☑️ Checklist

  • I requested two reviewers for this pull request
  • I updated the documentation according to my changes
  • I built the documentation by applying the build_docs label

Note: If you are not allowed to perform any of these actions, ping (@) a contributor.

@andrewfullard
Copy link
Contributor

Thank you for this extremely comprehensive pull request. It may take us some time to review this!

Copy link

codecov bot commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 67.18%. Comparing base (7db7ef0) to head (558d57c).

❗ Current head 558d57c differs from pull request most recent head c32a7bf. Consider uploading reports for the commit c32a7bf to get more accurate results

Files Patch % Lines
tardis/io/model/readers/blondin_toymodel.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2565   +/-   ##
=======================================
  Coverage   67.18%   67.18%           
=======================================
  Files         173      173           
  Lines       14570    14570           
=======================================
  Hits         9789     9789           
  Misses       4781     4781           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@airvzxf
Copy link
Contributor Author

airvzxf commented Apr 2, 2024

I removed the Dockerfiles in the commit 25c5dfd as it was requested.

@andrewfullard andrewfullard requested review from aoifeboyle, andrewfullard and AlexHls and removed request for aoifeboyle April 2, 2024 14:15
@tardis-bot
Copy link
Contributor

tardis-bot commented Apr 2, 2024

*beep* *bop*
Hi human,
I ran benchmarks as you asked comparing master (7db7ef0) and the latest commit (c32a7bf).
Here are the logs produced by ASV.
Results can also be downloaded as artifacts here.
Significantly changed benchmarks:

· No results found

All benchmarks:

· No results found

@andrewfullard
Copy link
Contributor

The visualization_convergence_plots benchmarks are empty in your screenshot, is that a bug?

benchmarks/benchmark_base.py Outdated Show resolved Hide resolved
f"{self.example_configuration_dir}/tardis_configv1_verysimple.yml"
)

class RegressionData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a partial duplicate of the existing RegressionData class, why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem in general is that the unit tests which are based on PyTest have some specific behavior (tricks). Some of these are the fixtures which can generate special functionality for the test class, function; other functionality is the ability to get the parameters sent in the argument. For this reason, most of the classes in PyTest were taken and modified.

I remember that I tried to use the tardis/tests/fixtures/regression_data.py but it wasn't work.

The best way, is improving ASV to have the possibility to metric the PyTest rather than rework in new benchmark tests.

ASV discussion for summit 2022.

airspeed-velocity/asv#1219
Improving Python benchmarking tooling #1219
Created by @asmeurer opened this issue Oct 6, 2022 · 18 comments

@oscarbenjamin commented on Oct 18, 2022
This actually extends beyond benchmarking to unit tests as well. There is no real need for unit tests and benchmarks to be different things.

My personal comments:

  • Other things I noticed when I migrated the unit tests with PyTest to ASV was that the effort and code were really duplicating. Basically, my benchmarks are a copy of PyTest adapted to ASV.
  • Therefore, I saw more potential in ASV to create functionality benchmarks instead of unit tests.
  • If ASV can be extended to benchmark unit tests in PyTest, then it would be incredible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very useful information. My intent with the unit test first objective was that our unit tests provide suggested input for functions that make benchmarking them easier for applicants without prior TARDIS knowledge. If this leads to extensive code duplication, then more "integrated" benchmarks may be preferable, that measure multiple functions at once.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the context here but since I was mentioned I will say that in general, benchmarks and tests are very similar in the way they work, so they can probably share some common frameworks. But I would caution against saying that they are the same thing in general. Tests and benchmarks have different goals and trying to squeeze both into one is a bad idea. A good test may be a bad benchmark and a good benchmark may be a bad test.

If anything, a benchmark is like a "test" against performance instead of functionality, but even then, that leads to different requirements. Just as an example, asv benchmarks should work against every historical version of the code, whereas a test only needs to run against the latest version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insight, Aaron!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Aron!

Copy link
Contributor Author

@airvzxf airvzxf Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel lost in these approaches because I don't know what the best setup that provides a useful result for the benchmark.

I was reviewing the article about Pandas regression using ASV benchmarking that @epassaro shared. Then with all of this information I have these comments:

I was thinking carefully and came to the conclusion that we can learn from those who have already walked the path. So I took on the task of searching the Pandas repository to see how they handle benchmarks. This is the Pandas repository regarding benchmarks versus unit testing for memory_usage, but I didn't find a pattern to follow. Apparently, the benchmarks are very specific and simple. Normally, they test a method with parameters; however, I didn't see many benchmarks with parameterization. Perhaps a good idea is to schedule a meeting with the Pandas maintainers to see if they will give us 30 minutes or an hour to give us a lecture on how to implement our benchmarks.

The second option is to apply all the approaches and learning along the way for months or years. The approaches could be all of these three together:

  • Create benchmarks based on the unit tests.
  • Create some specific unit tests for some specific methods.
  • As you mentioned, Andrew, create benchmarks with flows of functionality (“integrated” benchmarks).

ASV benchmarking page: https://asv-runner.github.io/asv-collection/pandas/

return atomic_data_fname

@property
def tardis_model_config_nlte_root(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer similar properties to be grouped in the class definition order e.g. all of the tardis_model_config together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be grouped in classes.

I noticed that PyTest has utility classes for atomic or other groups. I did in one class for time reasons and because I sent as a draft rather than a final pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I mean that the order of the properties in the class definition as written in the file would be better in a clear order. Right now they are not clearly ordered by usage or name. If it can be broken up into multiple classes, even better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check my demo proposal in the commit: 3aa02eb?

If it is useful and there are no comments, I'll continue to refactor the benchmark_base.py and all the benchmarks with this structure.


The benchmarks ran as expected:

· Creating environments                                                                                                             12:20:35 [8/8]
· Discovering benchmarks
· Running 12 total benchmarks (1 commits * 1 environments * 12 benchmarks)
[ 0.00%] · For tardis commit b7e07720 <master>:
[ 0.00%] ·· Benchmarking mamba-py3.12
[ 4.17%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_config_hdf--).....
[25.00%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_lu_config--).
[29.17%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_root_config--).
[33.33%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_section_config--).....
[54.17%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_config_hdf                                         69.4±2ms
[58.33%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_convergence_section_parser                      1.27±0.02μs
[62.50%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_from_config_dict                                   58.2±1ms
[66.67%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_model_section_config                               128±10ms
[70.83%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_root_exc_section_config               22.3±0.4s
[75.00%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_lu_config                     21.7±0.1s
[79.17%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_root_config                  21.4±0.01s
[83.33%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_section_config                                    ok
[83.33%] ··· ================= ============
               Plasma values
             ----------------- ------------
              initial_t_inner   60.9±0.3ms
               initial_t_rad     61.7±1ms
             ================= ============

[87.50%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_spectrum_section_config                            64.3±1ms
[91.67%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_supernova_section_config                            131±2ms
[95.83%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_coll_exc_deexc_matrix                                                 ok
[95.83%] ··· ============================================================================================================================== =============
                                                                         Matrix
             ------------------------------------------------------------------------------------------------------------------------------ -------------
                          {'coll_exc_coeff_values': [1, -2, 3], 'coll_deexc_coeff_values': [4, 9, 10], 'number_of_levels': 3}                1.63±0.09ms
              {'coll_exc_coeff_values': [0.21, 0.045, 0.1234], 'coll_deexc_coeff_values': [0.7865, 0.987, 0.00123], 'number_of_levels': 3}   1.57±0.04ms
             ============================================================================================================================== =============

[100.00%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_prepare_bound_bound_rate_matrix                                  140±2ms

real    5m32.771s
user    5m40.852s
sys     1m52.319s

benchmarks/benchmark_base.py Show resolved Hide resolved
return self.RegressionData()

@property
def packet(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and similar properties duplicate a lot of pytest fixtures. Could they reference them instead? This could become unmaintainable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that I tried to reuse them, but it was not successful for some reason that I don't remember. We can try again with more context and knowledge?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my best proposal for the re-utilization Regression Data class in PyTest: ab98e61.

The big problem is that PyTest when runs, it got many values and information, and it is passed and used in the Fixtures. It is challenging to get these values running ASV, we need to create fake objects and hard-code some values.

For this demo, I ran ASV with asv run --bench "time_prepare_bound_bound_rate_matrix" --bench "time_coll_exc_deexc_matrix". It works good, basically took the regression data, it did some processes, and compared the results for the process and the regression data.

The output was:

· Creating environments
· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For tardis commit b7e07720 <master>:
[ 0.00%] ·· Benchmarking mamba-py3.12
[25.00%] ··· Running (plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_coll_exc_deexc_matrix--)..
[75.00%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_coll_exc_deexc_matrix                                                 ok
[75.00%] ··· ============================================================================================================================== =============
                                                                         Matrix
             ------------------------------------------------------------------------------------------------------------------------------ -------------
                          {'coll_exc_coeff_values': [1, -2, 3], 'coll_deexc_coeff_values': [4, 9, 10], 'number_of_levels': 3}                1.60±0.04ms
              {'coll_exc_coeff_values': [0.21, 0.045, 0.1234], 'coll_deexc_coeff_values': [0.7865, 0.987, 0.00123], 'number_of_levels': 3}   1.63±0.04ms
             ============================================================================================================================== =============

[100.00%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_prepare_bound_bound_rate_matrix                                  140±1ms

What are the main changes?

In the file benchmarks/benchmark_base.py, I created an inner class CustomPyTestRequest which is the replacement for the request variable for the function def regression_data(request) in the file tardis/tests/fixtures/regression_data.py.

In the base benchmark, I created an instance of the class CustomPyTestRequest and passed the parameters for module name, node name, regression data directory and path.

When some benchmark has parameters (@parameterize), it is more dirty because it needs to specify what benchmark is running between parameters. In the demo, I create a condition to detect the ID of the parameter:

@parameterize({'Matrix': [
    {
        "coll_exc_coeff_values": [1, -2, 3],
        "coll_deexc_coeff_values": [4, 9, 10],
        "number_of_levels": 3,
    },
    {
        "coll_exc_coeff_values": [0.21, 0.045, 0.1234],
        "coll_deexc_coeff_values": [0.7865, 0.987, 0.00123],
        "number_of_levels": 3,
    },
]})
def time_coll_exc_deexc_matrix(self, matrix):
    coll_exc_coeff_values = matrix["coll_exc_coeff_values"]
    coll_deexc_coeff_values = matrix["coll_deexc_coeff_values"]
    number_of_levels = matrix["number_of_levels"]
test_parameter_id = 0
if coll_exc_coeff_values != [1, -2, 3]:
    test_parameter_id = 1

Then, replace in the node name:

custom_request = self.CustomPyTestRequest(
    f"test_coll_exc_deexc_matrix__coll_exc_coeff_values{test_parameter_id}-coll_deexc_coeff_values{test_parameter_id}-3__",
    …

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this comment into a github discussion to preserve it?

benchmarks/benchmark_template.py Outdated Show resolved Hide resolved
tardis/io/model/readers/artis.py Show resolved Hide resolved
tardis/io/model/readers/stella.py Show resolved Hide resolved
tardis/io/model/readers/blondin_toymodel.py Outdated Show resolved Hide resolved
tardis/util/base.py Show resolved Hide resolved
@airvzxf
Copy link
Contributor Author

airvzxf commented Apr 15, 2024

The visualization_convergence_plots benchmarks are empty in your screenshot, is that a bug?

No, usually ASV maintains the test names. If I don't delete all the .asv folders and run from scratch, it will show these blank tests because these tests were deleted or skipped by me.

Copy link
Contributor Author

@airvzxf airvzxf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I respond all the comments for the review.

benchmarks/asv_by_release.bash Outdated Show resolved Hide resolved
benchmarks/benchmark_template.py Outdated Show resolved Hide resolved
benchmarks/asv_by_release.bash Outdated Show resolved Hide resolved
benchmarks/benchmark_base.py Outdated Show resolved Hide resolved
benchmarks/benchmark_base.py Outdated Show resolved Hide resolved
return self.RegressionData()

@property
def packet(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that I tried to reuse them, but it was not successful for some reason that I don't remember. We can try again with more context and knowledge?

tardis/io/model/readers/artis.py Show resolved Hide resolved
tardis/io/model/readers/stella.py Show resolved Hide resolved
tardis/io/model/readers/blondin_toymodel.py Outdated Show resolved Hide resolved
tardis/util/base.py Show resolved Hide resolved
@airvzxf airvzxf force-pushed the gsoc-benchmark-first-objective-iarv branch from 6e0fd31 to 27dddf9 Compare April 19, 2024 00:56
@airvzxf airvzxf force-pushed the gsoc-benchmark-first-objective-iarv branch from 27dddf9 to 44287de Compare April 19, 2024 00:59
@andrewfullard
Copy link
Contributor

Looks like the PR is still failing codestyle. Test failures are less of a concern because of the reference data changes.

@andrewfullard andrewfullard self-requested a review April 24, 2024 17:31
@andrewfullard andrewfullard marked this pull request as ready for review April 25, 2024 14:46
@airvzxf
Copy link
Contributor Author

airvzxf commented Apr 30, 2024

Looks like the PR is still failing codestyle. Test failures are less of a concern because of the reference data changes.

Done in the commit: 6021a76.

Done in the commit: 558d57c.

Copy link
Contributor Author

@airvzxf airvzxf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apparently finalized all the comments and suggestions in the review. If I missed anything or if more work is needed in the verification, feel free to let me know, and I will be happy to make the changes.

andrewfullard
andrewfullard previously approved these changes May 2, 2024
Copy link
Contributor

@andrewfullard andrewfullard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work on this!

@andrewfullard andrewfullard self-requested a review May 3, 2024 14:42
Copy link
Member

@Knights-Templars Knights-Templars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all the benchmarks except montecarlo_numba and run_tardis. Remove any files that are commented out.

@airvzxf
Copy link
Contributor Author

airvzxf commented May 6, 2024

Please remove all the benchmarks except montecarlo_numba and run_tardis. Remove any files that are commented out.

Done in the commit: c32a7bf.

Questions:

  • I didn't delete the file asv_by_release.bash. Do you want the deletion of it?

Updates:

  • I deleted 3 Monte Carlo Numba benchmarks: two were with the suffix _x which was not implemented; another with the suffix _p which means partial implementation.
  • I delete the run_tardis_full.py and only keep run_tardis.py.
  • I didn't clean the file benchmark_base.py because it is a good start to set up the parent class for the benchmarks.

@Knights-Templars Knights-Templars enabled auto-merge (squash) May 6, 2024 14:20
r_packet_transport.move_packet_across_shell_boundary(
packet, delta_shell, no_of_shells
)
assert packet.status == r_packet.PacketStatus.EMITTED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want assert statements?

@Knights-Templars Knights-Templars merged commit 328ec77 into tardis-sn:master May 6, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants