GSoC 2024 | Benchmark | First objective | Israel Roldan #2565

airvzxf · 2024-03-29T06:58:56Z

📝 Description

Type: 🚀 feature

Hi, @andrewfullard, @atharva-2001, and @aoifeboyle.

I added my pull request for my work in the application for GSoC 2024, it is for the benchmark project. Thanks in advance.

GitHub Action to store more than 5 commits of benchmarks.

The first goal is: “You should improve the GitHub Action to store more than just the recent 5 commits of benchmarks to provide an extended history”.

The solution to this problem is in my application that I will share with the mentors. Trying to do the first goal I had some issues, then I decided to create a development environment.

Development environment

Commit: edb0e8f

I tried to install all the requirements, and I noticed that it is quite of difficult and some packages were installed as Conda. I would rather not mix my personal environment with the TARDIS development environment. For this reason, I created a series of Docker processes to have 3 types of environment: one for static code in the container, another fragmented (mixed) and the total live which means use your local computer (host) to take and store files.

To see more references, check the file: Dockerfile-benchmarks.bash. Furthermore, one video is better thatn ten thousand of words; I created a YouTube video explaining this development environment usage.

Docker

Static

Pros:

Easy setup.
Straight usage.
Isolate the results and publish them for ASV.

Cons:

The modifications in local (host) are not reflected in the containers. Except if it is built again.

Fragmented

Pros:

It takes the live source code.
The “.asv/results” and “.asv/html” are created in the code folder (live).
The folders “pkg” and “.asv/env”. It is created in the “image” and dynamically used in the container.

Cons:

If you want to test or introduce changes to “asv.conf.json”, it is not reflected in the code folder; you need to copy and modify this file again.
With more than one container, the results and publishes for ASV are shared in the same place (the live code folder).

Live

Pros:

It takes the live source code.
Change “asv.conf.json” at any moment and reflect it in the containers.

Cons:

The folders “pkg” and “.asv/env” are created in the code folder (live). They have a gigabyte size.
With more than one container, the ASV folders will be shared in the same place (the live code folder).

Expand the framework and create benchmarks.

Commit: 65ae093

You should plan to expand the framework to make it easy for us to add benchmarks to different parts of the TARDIS.
You should use the improved framework to add more benchmarks that cover other areas of the TARDIS code.

Results.

The results are on my application but I almost create the 56% of the unit tests as a time benchmark. Furthermore, in my application listed the challenges to migrate all the unit test to the benchmarks.

TARDIS Benchmarking and Performance Improvement.

I run all the benchmarks selecting the releases tags, then run some specific benchmark to find bottlenecks. At the end the idea is check if ASV is useful for detect performance issues. Spolier alert: I am not pretty sure that ASV detect them.

I crate and upload the result of my benchmarks in this web links:

📌 Resources

N/A.

🚦 Testing

How did you test these changes?

Testing pipeline
Other method: Create time benchmarks.
My changes can't be tested (explain why)

☑️ Checklist

I requested two reviewers for this pull request
I updated the documentation according to my changes
I built the documentation by applying the build_docs label

Note: If you are not allowed to perform any of these actions, ping (@) a contributor.

andrewfullard · 2024-03-29T13:40:36Z

Thank you for this extremely comprehensive pull request. It may take us some time to review this!

codecov · 2024-03-29T13:52:02Z

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 67.18%. Comparing base (7db7ef0) to head (558d57c).

❗ Current head 558d57c differs from pull request most recent head c32a7bf. Consider uploading reports for the commit c32a7bf to get more accurate results

Files	Patch %	Lines
tardis/io/model/readers/blondin_toymodel.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2565   +/-   ##
=======================================
  Coverage   67.18%   67.18%           
=======================================
  Files         173      173           
  Lines       14570    14570           
=======================================
  Hits         9789     9789           
  Misses       4781     4781

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

airvzxf · 2024-04-02T04:55:07Z

I removed the Dockerfiles in the commit 25c5dfd as it was requested.

tardis-bot · 2024-04-02T14:17:01Z

*beep* *bop*
Hi human,
I ran benchmarks as you asked comparing master (7db7ef0) and the latest commit (c32a7bf).
Here are the logs produced by ASV.
Results can also be downloaded as artifacts here.
Significantly changed benchmarks:

· No results found

All benchmarks:

· No results found

andrewfullard · 2024-04-15T14:45:20Z

The visualization_convergence_plots benchmarks are empty in your screenshot, is that a bug?

.gitignore

benchmarks/asv_by_release.bash

benchmarks/benchmark_base.py

andrewfullard · 2024-04-15T14:50:50Z

benchmarks/benchmark_base.py

+            f"{self.example_configuration_dir}/tardis_configv1_verysimple.yml"
+        )
+
+    class RegressionData:


This seems to be a partial duplicate of the existing RegressionData class, why is that?

The main problem in general is that the unit tests which are based on PyTest have some specific behavior (tricks). Some of these are the fixtures which can generate special functionality for the test class, function; other functionality is the ability to get the parameters sent in the argument. For this reason, most of the classes in PyTest were taken and modified.

I remember that I tried to use the tardis/tests/fixtures/regression_data.py but it wasn't work.

The best way, is improving ASV to have the possibility to metric the PyTest rather than rework in new benchmark tests.

ASV discussion for summit 2022.

airspeed-velocity/asv#1219
Improving Python benchmarking tooling #1219
Created by @asmeurer opened this issue Oct 6, 2022 · 18 comments

@oscarbenjamin commented on Oct 18, 2022
This actually extends beyond benchmarking to unit tests as well. There is no real need for unit tests and benchmarks to be different things.

My personal comments:

Other things I noticed when I migrated the unit tests with PyTest to ASV was that the effort and code were really duplicating. Basically, my benchmarks are a copy of PyTest adapted to ASV.

Therefore, I saw more potential in ASV to create functionality benchmarks instead of unit tests.

If ASV can be extended to benchmark unit tests in PyTest, then it would be incredible.

That's very useful information. My intent with the unit test first objective was that our unit tests provide suggested input for functions that make benchmarking them easier for applicants without prior TARDIS knowledge. If this leads to extensive code duplication, then more "integrated" benchmarks may be preferable, that measure multiple functions at once.

I don't know the context here but since I was mentioned I will say that in general, benchmarks and tests are very similar in the way they work, so they can probably share some common frameworks. But I would caution against saying that they are the same thing in general. Tests and benchmarks have different goals and trying to squeeze both into one is a bad idea. A good test may be a bad benchmark and a good benchmark may be a bad test.

If anything, a benchmark is like a "test" against performance instead of functionality, but even then, that leads to different requirements. Just as an example, asv benchmarks should work against every historical version of the code, whereas a test only needs to run against the latest version.

Thanks for your insight, Aaron!

Thanks, Aron!

I feel lost in these approaches because I don't know what the best setup that provides a useful result for the benchmark.

I was reviewing the article about Pandas regression using ASV benchmarking that @epassaro shared. Then with all of this information I have these comments:

I was thinking carefully and came to the conclusion that we can learn from those who have already walked the path. So I took on the task of searching the Pandas repository to see how they handle benchmarks. This is the Pandas repository regarding benchmarks versus unit testing for memory_usage, but I didn't find a pattern to follow. Apparently, the benchmarks are very specific and simple. Normally, they test a method with parameters; however, I didn't see many benchmarks with parameterization. Perhaps a good idea is to schedule a meeting with the Pandas maintainers to see if they will give us 30 minutes or an hour to give us a lecture on how to implement our benchmarks.

The second option is to apply all the approaches and learning along the way for months or years. The approaches could be all of these three together:

Create benchmarks based on the unit tests.

Create some specific unit tests for some specific methods.

As you mentioned, Andrew, create benchmarks with flows of functionality (“integrated” benchmarks).

ASV benchmarking page: https://asv-runner.github.io/asv-collection/pandas/

andrewfullard · 2024-04-15T14:52:36Z

benchmarks/benchmark_base.py

+        return atomic_data_fname
+
+    @property
+    def tardis_model_config_nlte_root(self):


I would prefer similar properties to be grouped in the class definition order e.g. all of the tardis_model_config together

Yes, it should be grouped in classes.

I noticed that PyTest has utility classes for atomic or other groups. I did in one class for time reasons and because I sent as a draft rather than a final pull request.

To be clear, I mean that the order of the properties in the class definition as written in the file would be better in a clear order. Right now they are not clearly ordered by usage or name. If it can be broken up into multiple classes, even better.

Could you check my demo proposal in the commit: 3aa02eb?

If it is useful and there are no comments, I'll continue to refactor the benchmark_base.py and all the benchmarks with this structure.

The benchmarks ran as expected:

· Creating environments 12:20:35 [8/8] · Discovering benchmarks · Running 12 total benchmarks (1 commits * 1 environments * 12 benchmarks) [ 0.00%] · For tardis commit b7e07720 <master>: [ 0.00%] ·· Benchmarking mamba-py3.12 [ 4.17%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_config_hdf--)..... [25.00%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_lu_config--). [29.17%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_root_config--). [33.33%] ··· Running (io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_section_config--)..... [54.17%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_config_hdf 69.4±2ms [58.33%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_convergence_section_parser 1.27±0.02μs [62.50%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_from_config_dict 58.2±1ms [66.67%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_model_section_config 128±10ms [70.83%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_root_exc_section_config 22.3±0.4s [75.00%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_lu_config 21.7±0.1s [79.17%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_nlte_section_root_config 21.4±0.01s [83.33%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_plasma_section_config ok [83.33%] ··· ================= ============ Plasma values ----------------- ------------ initial_t_inner 60.9±0.3ms initial_t_rad 61.7±1ms ================= ============ [87.50%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_spectrum_section_config 64.3±1ms [91.67%] ··· io_configuration_config_reader.BenchmarkIoConfigurationConfigReader.time_supernova_section_config 131±2ms [95.83%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_coll_exc_deexc_matrix ok [95.83%] ··· ============================================================================================================================== ============= Matrix ------------------------------------------------------------------------------------------------------------------------------ ------------- {'coll_exc_coeff_values': [1, -2, 3], 'coll_deexc_coeff_values': [4, 9, 10], 'number_of_levels': 3} 1.63±0.09ms {'coll_exc_coeff_values': [0.21, 0.045, 0.1234], 'coll_deexc_coeff_values': [0.7865, 0.987, 0.00123], 'number_of_levels': 3} 1.57±0.04ms ============================================================================================================================== ============= [100.00%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_prepare_bound_bound_rate_matrix 140±2ms real 5m32.771s user 5m40.852s sys 1m52.319s

benchmarks/benchmark_base.py

andrewfullard · 2024-04-15T14:56:36Z

benchmarks/benchmark_base.py

+        return self.RegressionData()
+
+    @property
+    def packet(self):


This and similar properties duplicate a lot of pytest fixtures. Could they reference them instead? This could become unmaintainable.

I remember that I tried to reuse them, but it was not successful for some reason that I don't remember. We can try again with more context and knowledge?

This is my best proposal for the re-utilization Regression Data class in PyTest: ab98e61.

The big problem is that PyTest when runs, it got many values and information, and it is passed and used in the Fixtures. It is challenging to get these values running ASV, we need to create fake objects and hard-code some values.

For this demo, I ran ASV with asv run --bench "time_prepare_bound_bound_rate_matrix" --bench "time_coll_exc_deexc_matrix". It works good, basically took the regression data, it did some processes, and compared the results for the process and the regression data.

The output was:

· Creating environments · Discovering benchmarks · Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks) [ 0.00%] · For tardis commit b7e07720 <master>: [ 0.00%] ·· Benchmarking mamba-py3.12 [25.00%] ··· Running (plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_coll_exc_deexc_matrix--).. [75.00%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_coll_exc_deexc_matrix ok [75.00%] ··· ============================================================================================================================== ============= Matrix ------------------------------------------------------------------------------------------------------------------------------ ------------- {'coll_exc_coeff_values': [1, -2, 3], 'coll_deexc_coeff_values': [4, 9, 10], 'number_of_levels': 3} 1.60±0.04ms {'coll_exc_coeff_values': [0.21, 0.045, 0.1234], 'coll_deexc_coeff_values': [0.7865, 0.987, 0.00123], 'number_of_levels': 3} 1.63±0.04ms ============================================================================================================================== ============= [100.00%] ··· plasma_nlte_excitation_x.BenchmarkPlasmaNlteExcitation.time_prepare_bound_bound_rate_matrix 140±1ms

What are the main changes?

In the file benchmarks/benchmark_base.py, I created an inner class CustomPyTestRequest which is the replacement for the request variable for the function def regression_data(request) in the file tardis/tests/fixtures/regression_data.py.

In the base benchmark, I created an instance of the class CustomPyTestRequest and passed the parameters for module name, node name, regression data directory and path.

When some benchmark has parameters (@parameterize), it is more dirty because it needs to specify what benchmark is running between parameters. In the demo, I create a condition to detect the ID of the parameter:

@parameterize({'Matrix': [ { "coll_exc_coeff_values": [1, -2, 3], "coll_deexc_coeff_values": [4, 9, 10], "number_of_levels": 3, }, { "coll_exc_coeff_values": [0.21, 0.045, 0.1234], "coll_deexc_coeff_values": [0.7865, 0.987, 0.00123], "number_of_levels": 3, }, ]}) def time_coll_exc_deexc_matrix(self, matrix): coll_exc_coeff_values = matrix["coll_exc_coeff_values"] coll_deexc_coeff_values = matrix["coll_deexc_coeff_values"] number_of_levels = matrix["number_of_levels"]

test_parameter_id = 0 if coll_exc_coeff_values != [1, -2, 3]: test_parameter_id = 1

Then, replace in the node name:

custom_request = self.CustomPyTestRequest( f"test_coll_exc_deexc_matrix__coll_exc_coeff_values{test_parameter_id}-coll_deexc_coeff_values{test_parameter_id}-3__", …

Can you put this comment into a github discussion to preserve it?

benchmarks/benchmark_template.py

tardis/io/model/readers/artis.py

tardis/io/model/readers/stella.py

tardis/io/model/readers/blondin_toymodel.py

tardis/util/base.py

airvzxf · 2024-04-15T17:04:45Z

The visualization_convergence_plots benchmarks are empty in your screenshot, is that a bug?

No, usually ASV maintains the test names. If I don't delete all the .asv folders and run from scratch, it will show these blank tests because these tests were deleted or skipped by me.

airvzxf

I respond all the comments for the review.

benchmarks/asv_by_release.bash

benchmarks/benchmark_template.py

benchmarks/asv_by_release.bash

benchmarks/benchmark_base.py

airvzxf · 2024-04-15T21:40:24Z

benchmarks/benchmark_base.py

+        return self.RegressionData()
+
+    @property
+    def packet(self):


I remember that I tried to reuse them, but it was not successful for some reason that I don't remember. We can try again with more context and knowledge?

tardis/io/model/readers/artis.py

tardis/io/model/readers/stella.py

tardis/io/model/readers/blondin_toymodel.py

tardis/util/base.py

… class.

benchmarks/util/nlte.py

This reverts commit abe3b0d.

andrewfullard · 2024-04-22T20:50:16Z

Looks like the PR is still failing codestyle. Test failures are less of a concern because of the reference data changes.

airvzxf · 2024-04-30T10:13:46Z

Looks like the PR is still failing codestyle. Test failures are less of a concern because of the reference data changes.

~~Done in the commit: 6021a76.~~

Done in the commit: 558d57c.

airvzxf

I apparently finalized all the comments and suggestions in the review. If I missed anything or if more work is needed in the verification, feel free to let me know, and I will be happy to make the changes.

andrewfullard

Thanks for all the hard work on this!

Knights-Templars

Please remove all the benchmarks except montecarlo_numba and run_tardis. Remove any files that are commented out.

airvzxf · 2024-05-06T13:52:37Z

Please remove all the benchmarks except montecarlo_numba and run_tardis. Remove any files that are commented out.

Done in the commit: c32a7bf.

Questions:

I didn't delete the file asv_by_release.bash. Do you want the deletion of it?

Updates:

I deleted 3 Monte Carlo Numba benchmarks: two were with the suffix _x which was not implemented; another with the suffix _p which means partial implementation.
I delete the run_tardis_full.py and only keep run_tardis.py.
I didn't clean the file benchmark_base.py because it is a good start to set up the parent class for the benchmarks.

atharva-2001 · 2024-05-06T14:24:58Z

benchmarks/montecarlo_montecarlo_numba_packet.py

+        r_packet_transport.move_packet_across_shell_boundary(
+            packet, delta_shell, no_of_shells
+        )
+        assert packet.status == r_packet.PacketStatus.EMITTED


do we want assert statements?

airvzxf added 4 commits March 27, 2024 20:57

Added the development environment to make easy the test of the changes.

edb0e8f

Improved the ASV framework; Added benchmarks taking the unit tests.

65ae093

Added the script for the run the ASV benchmarks by tag release in Git.

d930c08

Added my conact information in the mail map.

bdb9f0a

airvzxf and others added 2 commits April 1, 2024 09:01

Merge branch 'tardis-sn:master' into gsoc-benchmark-first-objective-iarv

8e60056

Removed the Docker files.

25c5dfd

andrewfullard requested review from aoifeboyle, andrewfullard and AlexHls and removed request for aoifeboyle April 2, 2024 14:15

andrewfullard added the benchmarks label Apr 2, 2024

andrewfullard requested a review from atharva-2001 April 2, 2024 14:16