Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool meister work #6

Closed

Conversation

Maxusmusti
Copy link

Test PR from new working setup, includes V1 prometheus and node-exporter local work. Next steps include:

  1. Adding begin and end steps to TM workflow
  2. Reworking node-exporter tool to start at begin and end at end
  3. Use group data to modify prom configuration (making it work on all remote hosts/jobs)
  4. Optimize efficiency of prometheus launch through podman python package
  5. Alter node-exporter to no longer require a local installation

Maxusmusti and others added 30 commits June 17, 2020 02:42
Use setuptools to package up the pbench python modules,
classes, and scripts. This will make it easier for users and
developers to easily consume pbench.

This commit also does the following:

- Setup and Run python unittests with tox
- Move server bits to pbench namespace
- Move configtools to pbench package
- Move s3backup to pbench package
- Remove $PYTHONPATH usage
- Update unittests due to changes.
- Addresses some flake8 problems found

Signed-off-by: Charles Short <[email protected]>
Added pbench-server to be executed by systemctl.
Additional changes required on RPM build spec.

feat: agent side of transfer REST

feat: moved util-scripts to own utils dir

final prototype for move_results

feat: moved read in chunks and post to copy_results_tb

reverted util-scripts changes

fixed revert

feat: refactored pbench-make-result-tb to python

feat: converted copy-results-tb

added get on pbench-server for check sum

fix: formatting on logger

feat: added first test for pbench-server

added test for post

added tests plus mod gitignore

feat: fixed and moved tests around for pbench-server REST

feat: fixes to tests plus metadata.log fixure cleanup

general refactoring

feat: reafactored and added tests

feat: gral tests refactoring

removed config invocation for agent

added host_info entrypoint test

feat: tests refactoring

feat: added pytests to travis

fix: travisyml conflicts

fix: removed new line on init

fix: travis yml python path

fix: travis yml python script

fix: added missing libraries for travis

fix: added werkzeug library to travis install

fix: moved python rpm package to be systems one

fix: added system site packages for travis virtualenv

feat: added tests and pytest.ini

fix: log formatting plus added tests

fix: fixes to TCRT tests

Addressing review comments

Moved agent utils to lib/pbench and config ammendments

fix to travis.yml and tests references

general refactoring

Added libs to pythonpath for travis

removed unused import

fix to travis configtools import

Added _PBENCH_*_CONFIG env variables to travis

Fix for fixtures paths

fix on test fixture variables

fix for travis env yaml

fix for server tests

Refactored config on server side to accomodate to pytest

moved logger to init

fixed configtools refactor

fix for config fixture install_dir

Fix for test copy results

fix to server tests

Fixes to server tests

fixed flake8 issues

black lint fixes

made config methods

blacked

Fixes to config

Added argparse for move_results plus removed check entrypoint for server

Addressing comments from PR

Fixes to agent tests

fix for test copy result

Fixes to test move results fixtures

fix to tox pythonpath

fix for pytest execution

Fixed requirements/removed pytest-flask

running pytest via python

Fixes to pythonpath for tox

fixes to tox pythonpath

moved pythonpath to envdir in tox

Added build-system to pytproject toml

added requires to build-system

Fix configtools import

fix configtools import

moved rpm package to pipy

removed sitepackages from tox

Fixes to fixtures

removed rpy-py-installer

Removed rpm for host info

modified env from VIRTUAL_ENV to PYTHONPATH

Fix for server test setup

removed fixtures scope plus redundant packages on test-requirements

Added fixture decorator

Added pytest-flask package

fixes to server conftest

blacked

fixes to server tests

Final fixes to tests

blacked

fixes to add-metalog-option

additional fixes
Fixes issue distributed-system-analysis#1530

With this there can be a logging level that can be specific to any
script, we just need to mention that in the config file. If nothing
is mentioned than there is always a default logging level that can
be used.

Test for checking the working of logging-level is also added in this
commit.
Fixes distributed-system-analysis#1585

This PR deals with using f-string's wherever possible in the
following scripts:

* pbench-report-status.py
* pbench-base.py
* pbench-index.py
* pbench-satellite-state-change.py
* pbench-verify-backup-tarballs.py
* pbench-backup-tarballs.py
This commit does several things at once:

- Simplify tox a bit more. Remove the default py3 unittests and
  split out the py3-agent and py3-server environments. Updated
  the corresponding entries in .travis.yml.
- Move the exist python server bits to own directory. This was done
  so the code is better structured.
- Stubbed in a pbench-cli unified cli for pbench. Added click as a
  requirement (https://click.palletsprojects.com/en/7.x/) Made the command
  installable via setuptools.
- Moved the agent bits to its own directory. This was done so we
  can seperate the server and agent code.
- Centralize exception handling
- Make python scripts installable via setuptools.
- Fix agent unitests failures.

Signed-off-by: Charles Short <[email protected]>
Unify the server/agent getconf.py so that we arent re-using the same
code with minimal differene. Also add unittests to test output.

Signed-off-by: Charles Short <[email protected]>
Use pbench-config instead of getconf.py in agent scripts.

Signed-off-by: Charles Short <[email protected]>
Use pbench-config to read configuration file with shell scripts for the
pbench-server.

Signed-off-by: Charles Short <[email protected]>
Added unttests to test AgentConfig class. Also raise a BadConfig
exception when neither the pbench-agent section or the results section
in the configuration file is found.

Signed-off-by: Charles Short <[email protected]>
Search for the agent configuration if no configuraiton is specified.
If the configuration is not specified we check for _PBENCH_AGENT_CONFIG
environment variable, check for the configuration in
agent/config otherwise error out. Once we loaded a configuration file
make sure that is valid.

Signed-off-by: Charles Short <[email protected]>
Centralize click options so they can be shared with the backwards
compatible commands.

Signed-off-by: Charles Short <[email protected]>
Put both agent results clases in a single place so its easier to follow.

Signed-off-by: Charles Short <[email protected]>
Consolidate pbench-cli to make it easier to backport commands
so that the workload scripts can still keep on working when
we convert more bash scripts to python.

Signed-off-by: Charles Short <[email protected]>
Detect the configuration file that the user is going to be using for an
individual run. If no conifiguration file one will be chosen for the
user.

Also allow to turn debugging on and off. Remove some cruft as well.

Signed-off-by: Charles Short <[email protected]>
* Refactor the pbench namespace so that we can easily split out the
  server/agent and libraries to make things more portable in the future.
* Split out functional tests into its own tox environment so we can
  run them seperately from unit tests. Also apart of the
  diabolical scheme to replace what we have for agent unittests.
* Drop the pbench-cli methods and classes, simply a case of cart before
  the horse and it was over complicated as well.
* Change the behaviour of setup.cfg so that we
  can simplify the binary(python scripts) instatllation.

Signed-off-by: Charles Short <[email protected]>
We also add a `detox` script which is called before test are run in a tox
environment, reworking the `tox.ini` file so that it is a bit DRYer.
Sub-classing of `object` in Python3 is now implicitly done for all classes.

As a result of this change, we also fix up `flake8` errors for two datalog
Python3 modules.
This let's us use `tox -e util-scripts -- test-51` so that we can
run individual tests in each directory where we have unit tests.
The logging mechanism for the server is not really code specific to the
server.  We want to use it for the agent, so we are just doing a code
move first.
We restructure the `PbenchConfig` class to be a base class, and move the
pbench server specific components to a `PbenchServerConfig` sub-class.

We also remove the `ServerConfig` implementation, and fix up the pytest
environment so that it users the real `pbench-server-default.cfg` file.
Leverages more shared code, we have created a `PbenchAgentConfig` class
which is a sub-class of the common `PbenchConfig` class, allowing us to
share more code between the server and agent environments.

Having the `PbenchConfig` class shared with the agent side as well allows
us to introduce the logging infrastructure, previously only available to
the server side, to the agent side as well.

Along the way, we address a number of issues in the Python3 code found
while re-working unit tests, including the proper use of temporary
directories.
The goal of the "Tool Meister" is to encapsulate the starting and
stopping of tools into a wrapper daemon which can be started once on a
node for the duration of a benchmark script.  Instead of the start/stop
tools scripts using SSH to start/stop tools on local or remote hosts, a
Redis Server is use to communicate with the all the started Tool
Meisters which execute the start/stop operations via messages they
receive vis Redis's publish/subscribe pattern.

The Redis server location is passed as a set of parameters (host & port)
to the tool meister, along with the name of a "key" in the Redis server
which contains the tool meister's initial operating instructions for the
given benchmark script's execution:

  * What Redis pub/sub channel to use
  * What tool group describing the tools to use and their options

The tool meister than runs through a three phase life-cycle until it is
told to terminate: "`start`", "`stop`", and "`send`".  The initial phase
is "`start`", where it waits to be told when to start its tools running
from a published message on the channel it is subscribed. Once it starts
a tool in the background via `screen`, it waits for a "`stop`" message
to invoke its tool's `stop` action.  It then waits for a "`send`"
message to transmit tool data (if any).

We add a simple "data sink" to allow for sending tool data results as
tar balls back to the main controller via HTTP to facilitate the "send"
operation.

** Removal of remote tool registration **

This work IMPLIES that we no longer need to record registered tools
remotely.  We only need to start a tool meister remotely for each host,
passing the initial data it needs at start time via Redis.

Now that pbench-register-tool[-set] supports the ability for a caller to
register a tool [or tool set] for a list of hosts, we can consider
keeping all the tool data locally on the pbench "controller" node where
the pbench-agent's user registers tools.

By doing this, we remove the need to manage a distributed data set
across multiple hosts, allowing for a "late" binding of tools to be run
on a set of hosts.  In other words, the tool registration can be done
without a host being present, with the understanding that it must be
present when a workload is run.

This is particularly powerful for environments like, OpenStack and
OpenShift, where software installation of tools are provided by
container images, VM images (like qcow2), and other automated
installation environments.

This is an invasive change, as knowledge about how tool data is
represented on disk was spread out across different pieces of code. We
have attempted to consolidate that knowledge, future work might be
required to adhere to the DRY principle.

NOTES:

  * The `start` and `stop` messages contain the iteration name (string)
    to use

  * The tool meister invokes the existing tools in `tool-scripts` as
    they operate today without any changes

  * This commit DOES NOT remove pbench-start/stop-tools yet; those
    interfaces are maintained for compatibility initially

  * This work requires further work to eliminate the need to run the
    post-processing step after stopping tools

    - Eliminating the post-processing step is not a requirement, we
      could easily add a "post-processing" phase to the above three
      phases ... but the reason to remove post-processing now is to
      move tool data to more of a pure collection process; this is
      arguable, as we may want some kind of post-processing to
      happen before sending the data back.

TO DO:

Work that is still needed for this commit:

 - [x] Add pbench-tool-meister-start, which will start a Redis server,
       load up all the tool data for the given group into the Redis
       server, start the pbench-tool-data-sink, and start all the tool
       meister's remotely

 - [x] Add pbench-tool-meister-stop, which will stop all the tool
       meisters, stop the data sink, and stop the Redis server

 - [x] Write pbench-tool-miester-client, which will implement the CLI
       method of sending messages via Redis to the tool meisters to
       take a particular action

 - [x] How to get pbench-agent configuration data into python

   - Will likely use the same method we do on the server side

 - [x] Actually invoke the tool via `screen`

 - [ ] The "`send`" phase is not implemented yet, as we don't have the
       small object store implementation handle that phase yet

 - [ ] Unit tests for this infrastructure

 - [ ] Rewrite pbench-tool-trigger into a python application that
       talks directly to the Redis server to initiate the start, stop,
       send messages

 - [ ] Add support for the tool-meister-start/stop to initiate the
       sysinfo-dump operations
Copy link
Owner

@portante portante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, well, 2,273 files changed seems like a bit much. =)

Perhaps you can rebase your work properly on the "containerized-tools" branch?

@Maxusmusti Maxusmusti closed this Jul 6, 2020
portante pushed a commit that referenced this pull request Aug 24, 2022
Replaces previous contributed data aggregation files with the following:

pbench_combined_data.py

This contains the PbenchCombinedData class which serves as a wrapper object for processing data sources given and storing it along with diagnostic information. Has methods for adding run, result, client name, and disk host data. These methods use the Filter objects provided that specify the processing.

This also contains the Filter abstract class that serves as a template for custom filters on any data sources. Filter class allows you to specify required and optional fields from JSON docs for run and result data. Enables the specification of a diagnostic method to run more complex validation checks on the data source. And a apply_filter method that filters down the data source to the required components to be returned.

This also contains the PbenchCombinedDataCollection class which serves as a wrapper class for multiple PbenchCombinedData objects. While the PCD class is used for 1 run and it’s associated result data, this collection class is used for the processing and storing of all data to be aggregated. It has methods that keep statistics of valid and invalid data, and implements multiprocessing techniques to process the data. It has a method for outputting the data in csv files. The method that gets called to perform the aggregation is called aggregate_data that takes in a list of months to aggregate data for. It processes run data first. Then stores base result data onto a queue. Multiple worker processes read off the queue and add run, diskhost and client name data to it and put it on another queue. Multiple worker processes then read from that queue and add sos data to it, and these complete data are stored and then outputted.

sos_collection.py

This contains the Sosreport and SosCollection classes that respectively wrap the Sosreport data processing given the sos tarball and processing for all the sosreports. The Sosreport class has methods for extracting the desired information out of the exploded tarball, and takes in the path to the tarball on the local system and does the exploding and extracting. Since the sosreports need to be downloaded from a remote server, the SosCollection class has methods for opening a ssh and sftp client to download the desired Sosreport and then perform the processing, given the data of a PbenchCombinedData object.

aggregate_data.py

This is the script that is called to execute the aggregation. It creates a PbenchCombinedDataCollection object, and calls its aggregate_data method on a default if the last 12 months and then it’s emit_csv method to output the data. The months can be changed with cli flags.

The command to run the processing is:

./aggregate_data.py elasticsearch.intlab.perf-infra.lab.eng.rdu2.redhat.com 10081 http://pbench.perf.lab.eng.bos.redhat.com intlab-011.ctrl.perf-infra.lab.eng.rdu2.redhat.com

This will run it with all available CPU’s for the multiprocessing and aggregate the last 12 months worth of data.

I recommend adding the —months_e flag that specifies how many months prior to now to end the aggregation. Add it with as many months so that it ends in 2021-12. Because from what I’ve noticed 2022 has all invalid data according to hifza’s checks and millions of records to go through, which cause the program to use a lot of memory doing nothing inevitably crashing because it’s out of memory.

NOTE: the processing here is slightly different from the one Hifza was using. The use of the ClientCount filter is the difference and dramatically reduces the set of ‘valid’ data. If you would like to have Hifza’s original processing, remove ClientCount from list for run from line 267 of pbench_combined_data.py. And comment line 961 of the same file.

----
The work was performed in a series of 21 commits, what follows are the
set of messages from the original commits.

1st commit message:
----

Use argparse for cli parsing & Add record limiting functionality

Refactor naive cli argument parsing to use python's argparse module.
Add cli option to limit number of records, which defaults to 10, and
modify code accordingly.
Note: With this implementation there is no way to
run for all records as of yet.

commit message #2:
----

Refactor PbenchRun class into PbenchCombinedData and PbenchCombinedDataCollection

Since ultimate goal is to store all data in one object type, renamed class
accordingly. Created a corresponding collection object that tracks various
statistics about the objects inserted into it.

Update PbenchCombinedDataCollection to separately store invalid records

Add an internal dictionary for storing invalid records, so that diagnostic
data stored on it isn't lost.

Add some diagnostic check classes for result-data, refactor structure of abstract DiagnosticCheck and inheriting concrete classes

Fixed DiagnosticCheck and concrete inheriting class' structures

Updata structure, so that tracker_names now retrievable from a
DiagnosticCheck object, without the need to pass in a doc to perform
the check on. That means we can pass the check classes into the Collection object,
and can get all the different issues we will track as objects are inserted, and can
initialize the trackers for all such issues.

Succesfully reimplements run data processing and adding as it was originally in new structure

Use check class instances instead of class names in check list

Since some checks require extra information to perform the check, apart from
the 'doc' passed in as input to the diagnostic method, decided to have those
checks take more args in their constructor, to preserve the ability to simply
loop over checks and perform them in the compact manner used so far.
This means that list of check classes must now be instances of those classes,
so that those checks that those checks that require args can be given them.

Add basic result data merging without disknames,hostnames,clientnames to new class structure

Added methods that now handle the error-checking and merging of 'pure' result data
from a result doc to the PbenchCOmbinedData and Collection version classes. Thus can now simply
call add_run and add_result methods on a collection Object and the error-checking,
stats tracking, merging all happen internally in the class objects.
Note: error-checking for result data needs to be done in the collection
class as compared to other checks which happen in the PbenchCombinedDataCollection
class, because need to know associated PCD object which has associated run ahead of time,
and the existence of such an object is part of the checks performed.

Reimplement ability to process all data for the past year & set limit default to all

Set default limit value to be -1 meaning all. Now default is all data processed and
to limit need to pass --limit flag in cli.

Implement host and disk name data collection

Add method in PbenchCombinedData class to add host and disk names for data collected.
This should only be calles after run and base result data already added. Calls this method
in collection's add_result method after result data added, to ensure correct order of
insertion. Associated check classes implemented and performed accordingly.

Implement client name adding in new class structure

Add method in PbenchCombinedData class to add client name data
to object. Needs to be performed after run and base result data
already added. Call this method in the collection's add_result
method, after base result and disk and host names added. Associated check
classes implemented and performed accordingly.

Add Documentation and deletes Old code

Adds documentation for all new classes created: PbenchCombinedData,
PbenchCombinedDataCollection, DiagnosticCheck and concrete inheritors of
DiagnosticCheck. Also adds documentation for remaining functions in
merge_sos_and_perf_parallel.py. All documentation adheres to the
numpy/scipy style. All old code functionality reimplemented under new
structure so deleted.

commit message #3:
----

Internalize run and result data collection for given month.

Make merging and collecting of run and result data for a given month
the collect_data method in the PbenchCombinedDataCollection class.
Make the es_data_gen function also a method in this class. Rename
merge_sos_and_perf_parallel to aggregate_data. aggregate_data now
only gives the collection object a month to collect data on and all
processing happens inside the Class object.

commit message #4:
----

Fix for comments by Peter on aggregate_data.py

Change _month_gen function signature, definition and docs to reflect
2 params. One for end_time and start_months_prior duration specifying
how many months prior to start data collection from. Update function
call in main appropriately.

Add type hints and documentation to main function. Remove comments
regarding multiprocessing. Fix user agent string to reflect filename
by using argparse.

Add optional arument for argparse to specify months_prior to use for
_month_gen, have default be 12 months.

Fix program name retrieval in main

CHange parse_arguments to return ArgumentParser Object instead of
Namespace, and create Namespace args in main function, because
prog name only accessible from parser.

add check to main to ensure record limiting works

Add print_report and emit_csv methods to Collection Object

Adds print_report method to print tracker information, instead of
using print on object. Adds emit_csv method, which writes important
data collected to csv files in a subdirectory. Writes valid, invalid,
trackers, diskhost, client data to separate csv files.

make csv folder using current working directory instead of hardcoding it

Finish Fixing comments by peter, add some comments to code

commit message #5:
----

Implement multiprocessing capability

Uses pathos because dill can better serialize objects. Creates
a ProcessPool and stores it in an attribute of collection object.
Makes Collection object take in no of cpus to use in initialization,
so pool can be initialized. Adds methods to add one month asynchornously,
or a list of months. Add methods to merge dictionaries in a special way,
and combine another Collection object's data into its own. This because
self is returned from the worker processes.

commit message #6:
----

fix jenkins errors

commit message #7:
----

Fix 1 multiprocessing issue

CHange default cpu count to 0, which means all but 1 cpus of the system.
Change ProcessPool initialization to work when cpus to use is 1.

commit message #8:
----

start fio result hostname refactor

commit message #9:
----

Verify aggregation behavior

commit message #10:
----

Implement ClientCount check

This should replace the ClientHostAggregteResultCheck and
the last check of SosreportCheck where it checks if the
sos reports have different hostnames. invalid run ids are those
that have more than 2 measurement_idx values for any measurement_title.
It also checks for run ids not in any result data (ie with no corresponding
results and thus invalid)

commit message #11:
----

Reverse order of month string generation in _month_gen

YYYY-MM strings now generated starting with the most recent month
and going backwards and ending the start_months_prior number of months
before.

commit message #12:
----

Change * use in result index generation to a loop

commit message #13:
----

Use asyncio module to replace pathos

Use async function calls to process each month, as opposed to
using multiple processes. This to see if record limiting can
work under this method instead.

commit message #14:
----

Table Multiprocessing/asyncio on month processing

Tables the use of multiprocessing/asyncio to speedup
processing of months by doing it in parallel for later,
since running into a myriad of issues.

commit message #15:
----

Test ssh and sftp client & Begin refactoring of sos collection

commit message #16:
----

Finish Sosreport collection refactor - no changes to processing logic

commit message #17:
----

print timings

commit message #18:
----

Modify Check class into Filter class

Filter class implements improved handling of custom required and
optional field/property specifications from run and result docs.
Factors out filtering logic, making it easily changeable.

commit message #19:
----

Add documentation and wrap up changes

commit message #20:
----

Finish documentation and changes

commit message #21:
----

Address flake8 errors and warnings
portante pushed a commit that referenced this pull request Aug 24, 2022
Replaces previous contributed data aggregation files with the following:

pbench_combined_data.py

This contains the PbenchCombinedData class which serves as a wrapper object for processing data sources given and storing it along with diagnostic information. Has methods for adding run, result, client name, and disk host data. These methods use the Filter objects provided that specify the processing.

This also contains the Filter abstract class that serves as a template for custom filters on any data sources. Filter class allows you to specify required and optional fields from JSON docs for run and result data. Enables the specification of a diagnostic method to run more complex validation checks on the data source. And a apply_filter method that filters down the data source to the required components to be returned.

This also contains the PbenchCombinedDataCollection class which serves as a wrapper class for multiple PbenchCombinedData objects. While the PCD class is used for 1 run and it’s associated result data, this collection class is used for the processing and storing of all data to be aggregated. It has methods that keep statistics of valid and invalid data, and implements multiprocessing techniques to process the data. It has a method for outputting the data in csv files. The method that gets called to perform the aggregation is called aggregate_data that takes in a list of months to aggregate data for. It processes run data first. Then stores base result data onto a queue. Multiple worker processes read off the queue and add run, diskhost and client name data to it and put it on another queue. Multiple worker processes then read from that queue and add sos data to it, and these complete data are stored and then outputted.

sos_collection.py

This contains the Sosreport and SosCollection classes that respectively wrap the Sosreport data processing given the sos tarball and processing for all the sosreports. The Sosreport class has methods for extracting the desired information out of the exploded tarball, and takes in the path to the tarball on the local system and does the exploding and extracting. Since the sosreports need to be downloaded from a remote server, the SosCollection class has methods for opening a ssh and sftp client to download the desired Sosreport and then perform the processing, given the data of a PbenchCombinedData object.

aggregate_data.py

This is the script that is called to execute the aggregation. It creates a PbenchCombinedDataCollection object, and calls its aggregate_data method on a default if the last 12 months and then it’s emit_csv method to output the data. The months can be changed with cli flags.

The command to run the processing is:

./aggregate_data.py elasticsearch.intlab.perf-infra.lab.eng.rdu2.redhat.com 10081 http://pbench.perf.lab.eng.bos.redhat.com intlab-011.ctrl.perf-infra.lab.eng.rdu2.redhat.com

This will run it with all available CPU’s for the multiprocessing and aggregate the last 12 months worth of data.

I recommend adding the —months_e flag that specifies how many months prior to now to end the aggregation. Add it with as many months so that it ends in 2021-12. Because from what I’ve noticed 2022 has all invalid data according to hifza’s checks and millions of records to go through, which cause the program to use a lot of memory doing nothing inevitably crashing because it’s out of memory.

NOTE: the processing here is slightly different from the one Hifza was using. The use of the ClientCount filter is the difference and dramatically reduces the set of ‘valid’ data. If you would like to have Hifza’s original processing, remove ClientCount from list for run from line 267 of pbench_combined_data.py. And comment line 961 of the same file.

----
The work was performed in a series of 21 commits, what follows are the
set of messages from the original commits.

1st commit message:
----

Use argparse for cli parsing & Add record limiting functionality

Refactor naive cli argument parsing to use python's argparse module.
Add cli option to limit number of records, which defaults to 10, and
modify code accordingly.
Note: With this implementation there is no way to
run for all records as of yet.

commit message #2:
----

Refactor PbenchRun class into PbenchCombinedData and PbenchCombinedDataCollection

Since ultimate goal is to store all data in one object type, renamed class
accordingly. Created a corresponding collection object that tracks various
statistics about the objects inserted into it.

Update PbenchCombinedDataCollection to separately store invalid records

Add an internal dictionary for storing invalid records, so that diagnostic
data stored on it isn't lost.

Add some diagnostic check classes for result-data, refactor structure of abstract DiagnosticCheck and inheriting concrete classes

Fixed DiagnosticCheck and concrete inheriting class' structures

Updata structure, so that tracker_names now retrievable from a
DiagnosticCheck object, without the need to pass in a doc to perform
the check on. That means we can pass the check classes into the Collection object,
and can get all the different issues we will track as objects are inserted, and can
initialize the trackers for all such issues.

Succesfully reimplements run data processing and adding as it was originally in new structure

Use check class instances instead of class names in check list

Since some checks require extra information to perform the check, apart from
the 'doc' passed in as input to the diagnostic method, decided to have those
checks take more args in their constructor, to preserve the ability to simply
loop over checks and perform them in the compact manner used so far.
This means that list of check classes must now be instances of those classes,
so that those checks that those checks that require args can be given them.

Add basic result data merging without disknames,hostnames,clientnames to new class structure

Added methods that now handle the error-checking and merging of 'pure' result data
from a result doc to the PbenchCOmbinedData and Collection version classes. Thus can now simply
call add_run and add_result methods on a collection Object and the error-checking,
stats tracking, merging all happen internally in the class objects.
Note: error-checking for result data needs to be done in the collection
class as compared to other checks which happen in the PbenchCombinedDataCollection
class, because need to know associated PCD object which has associated run ahead of time,
and the existence of such an object is part of the checks performed.

Reimplement ability to process all data for the past year & set limit default to all

Set default limit value to be -1 meaning all. Now default is all data processed and
to limit need to pass --limit flag in cli.

Implement host and disk name data collection

Add method in PbenchCombinedData class to add host and disk names for data collected.
This should only be calles after run and base result data already added. Calls this method
in collection's add_result method after result data added, to ensure correct order of
insertion. Associated check classes implemented and performed accordingly.

Implement client name adding in new class structure

Add method in PbenchCombinedData class to add client name data
to object. Needs to be performed after run and base result data
already added. Call this method in the collection's add_result
method, after base result and disk and host names added. Associated check
classes implemented and performed accordingly.

Add Documentation and deletes Old code

Adds documentation for all new classes created: PbenchCombinedData,
PbenchCombinedDataCollection, DiagnosticCheck and concrete inheritors of
DiagnosticCheck. Also adds documentation for remaining functions in
merge_sos_and_perf_parallel.py. All documentation adheres to the
numpy/scipy style. All old code functionality reimplemented under new
structure so deleted.

commit message #3:
----

Internalize run and result data collection for given month.

Make merging and collecting of run and result data for a given month
the collect_data method in the PbenchCombinedDataCollection class.
Make the es_data_gen function also a method in this class. Rename
merge_sos_and_perf_parallel to aggregate_data. aggregate_data now
only gives the collection object a month to collect data on and all
processing happens inside the Class object.

commit message #4:
----

Fix for comments by Peter on aggregate_data.py

Change _month_gen function signature, definition and docs to reflect
2 params. One for end_time and start_months_prior duration specifying
how many months prior to start data collection from. Update function
call in main appropriately.

Add type hints and documentation to main function. Remove comments
regarding multiprocessing. Fix user agent string to reflect filename
by using argparse.

Add optional arument for argparse to specify months_prior to use for
_month_gen, have default be 12 months.

Fix program name retrieval in main

CHange parse_arguments to return ArgumentParser Object instead of
Namespace, and create Namespace args in main function, because
prog name only accessible from parser.

add check to main to ensure record limiting works

Add print_report and emit_csv methods to Collection Object

Adds print_report method to print tracker information, instead of
using print on object. Adds emit_csv method, which writes important
data collected to csv files in a subdirectory. Writes valid, invalid,
trackers, diskhost, client data to separate csv files.

make csv folder using current working directory instead of hardcoding it

Finish Fixing comments by peter, add some comments to code

commit message #5:
----

Implement multiprocessing capability

Uses pathos because dill can better serialize objects. Creates
a ProcessPool and stores it in an attribute of collection object.
Makes Collection object take in no of cpus to use in initialization,
so pool can be initialized. Adds methods to add one month asynchornously,
or a list of months. Add methods to merge dictionaries in a special way,
and combine another Collection object's data into its own. This because
self is returned from the worker processes.

commit message #6:
----

fix jenkins errors

commit message #7:
----

Fix 1 multiprocessing issue

CHange default cpu count to 0, which means all but 1 cpus of the system.
Change ProcessPool initialization to work when cpus to use is 1.

commit message #8:
----

start fio result hostname refactor

commit message #9:
----

Verify aggregation behavior

commit message #10:
----

Implement ClientCount check

This should replace the ClientHostAggregteResultCheck and
the last check of SosreportCheck where it checks if the
sos reports have different hostnames. invalid run ids are those
that have more than 2 measurement_idx values for any measurement_title.
It also checks for run ids not in any result data (ie with no corresponding
results and thus invalid)

commit message #11:
----

Reverse order of month string generation in _month_gen

YYYY-MM strings now generated starting with the most recent month
and going backwards and ending the start_months_prior number of months
before.

commit message #12:
----

Change * use in result index generation to a loop

commit message #13:
----

Use asyncio module to replace pathos

Use async function calls to process each month, as opposed to
using multiple processes. This to see if record limiting can
work under this method instead.

commit message #14:
----

Table Multiprocessing/asyncio on month processing

Tables the use of multiprocessing/asyncio to speedup
processing of months by doing it in parallel for later,
since running into a myriad of issues.

commit message #15:
----

Test ssh and sftp client & Begin refactoring of sos collection

commit message #16:
----

Finish Sosreport collection refactor - no changes to processing logic

commit message #17:
----

print timings

commit message #18:
----

Modify Check class into Filter class

Filter class implements improved handling of custom required and
optional field/property specifications from run and result docs.
Factors out filtering logic, making it easily changeable.

commit message #19:
----

Add documentation and wrap up changes

commit message #20:
----

Finish documentation and changes

commit message #21:
----

Address flake8 errors and warnings
portante pushed a commit that referenced this pull request Aug 25, 2022
Replaces previous contributed data aggregation files with the following:

pbench_combined_data.py

This contains the PbenchCombinedData class which serves as a wrapper object for processing data sources given and storing it along with diagnostic information. Has methods for adding run, result, client name, and disk host data. These methods use the Filter objects provided that specify the processing.

This also contains the Filter abstract class that serves as a template for custom filters on any data sources. Filter class allows you to specify required and optional fields from JSON docs for run and result data. Enables the specification of a diagnostic method to run more complex validation checks on the data source. And a apply_filter method that filters down the data source to the required components to be returned.

This also contains the PbenchCombinedDataCollection class which serves as a wrapper class for multiple PbenchCombinedData objects. While the PCD class is used for 1 run and it’s associated result data, this collection class is used for the processing and storing of all data to be aggregated. It has methods that keep statistics of valid and invalid data, and implements multiprocessing techniques to process the data. It has a method for outputting the data in csv files. The method that gets called to perform the aggregation is called aggregate_data that takes in a list of months to aggregate data for. It processes run data first. Then stores base result data onto a queue. Multiple worker processes read off the queue and add run, diskhost and client name data to it and put it on another queue. Multiple worker processes then read from that queue and add sos data to it, and these complete data are stored and then outputted.

sos_collection.py

This contains the Sosreport and SosCollection classes that respectively wrap the Sosreport data processing given the sos tarball and processing for all the sosreports. The Sosreport class has methods for extracting the desired information out of the exploded tarball, and takes in the path to the tarball on the local system and does the exploding and extracting. Since the sosreports need to be downloaded from a remote server, the SosCollection class has methods for opening a ssh and sftp client to download the desired Sosreport and then perform the processing, given the data of a PbenchCombinedData object.

aggregate_data.py

This is the script that is called to execute the aggregation. It creates a PbenchCombinedDataCollection object, and calls its aggregate_data method on a default if the last 12 months and then it’s emit_csv method to output the data. The months can be changed with cli flags.

The command to run the processing is:

./aggregate_data.py elasticsearch.intlab.perf-infra.lab.eng.rdu2.redhat.com 10081 http://pbench.perf.lab.eng.bos.redhat.com intlab-011.ctrl.perf-infra.lab.eng.rdu2.redhat.com

This will run it with all available CPU’s for the multiprocessing and aggregate the last 12 months worth of data.

I recommend adding the —months_e flag that specifies how many months prior to now to end the aggregation. Add it with as many months so that it ends in 2021-12. Because from what I’ve noticed 2022 has all invalid data according to hifza’s checks and millions of records to go through, which cause the program to use a lot of memory doing nothing inevitably crashing because it’s out of memory.

NOTE: the processing here is slightly different from the one Hifza was using. The use of the ClientCount filter is the difference and dramatically reduces the set of ‘valid’ data. If you would like to have Hifza’s original processing, remove ClientCount from list for run from line 267 of pbench_combined_data.py. And comment line 961 of the same file.

----
The work was performed in a series of 21 commits, what follows are the
set of messages from the original commits.

1st commit message:
----

Use argparse for cli parsing & Add record limiting functionality

Refactor naive cli argument parsing to use python's argparse module.
Add cli option to limit number of records, which defaults to 10, and
modify code accordingly.
Note: With this implementation there is no way to
run for all records as of yet.

commit message #2:
----

Refactor PbenchRun class into PbenchCombinedData and PbenchCombinedDataCollection

Since ultimate goal is to store all data in one object type, renamed class
accordingly. Created a corresponding collection object that tracks various
statistics about the objects inserted into it.

Update PbenchCombinedDataCollection to separately store invalid records

Add an internal dictionary for storing invalid records, so that diagnostic
data stored on it isn't lost.

Add some diagnostic check classes for result-data, refactor structure of abstract DiagnosticCheck and inheriting concrete classes

Fixed DiagnosticCheck and concrete inheriting class' structures

Updata structure, so that tracker_names now retrievable from a
DiagnosticCheck object, without the need to pass in a doc to perform
the check on. That means we can pass the check classes into the Collection object,
and can get all the different issues we will track as objects are inserted, and can
initialize the trackers for all such issues.

Succesfully reimplements run data processing and adding as it was originally in new structure

Use check class instances instead of class names in check list

Since some checks require extra information to perform the check, apart from
the 'doc' passed in as input to the diagnostic method, decided to have those
checks take more args in their constructor, to preserve the ability to simply
loop over checks and perform them in the compact manner used so far.
This means that list of check classes must now be instances of those classes,
so that those checks that those checks that require args can be given them.

Add basic result data merging without disknames,hostnames,clientnames to new class structure

Added methods that now handle the error-checking and merging of 'pure' result data
from a result doc to the PbenchCOmbinedData and Collection version classes. Thus can now simply
call add_run and add_result methods on a collection Object and the error-checking,
stats tracking, merging all happen internally in the class objects.
Note: error-checking for result data needs to be done in the collection
class as compared to other checks which happen in the PbenchCombinedDataCollection
class, because need to know associated PCD object which has associated run ahead of time,
and the existence of such an object is part of the checks performed.

Reimplement ability to process all data for the past year & set limit default to all

Set default limit value to be -1 meaning all. Now default is all data processed and
to limit need to pass --limit flag in cli.

Implement host and disk name data collection

Add method in PbenchCombinedData class to add host and disk names for data collected.
This should only be calles after run and base result data already added. Calls this method
in collection's add_result method after result data added, to ensure correct order of
insertion. Associated check classes implemented and performed accordingly.

Implement client name adding in new class structure

Add method in PbenchCombinedData class to add client name data
to object. Needs to be performed after run and base result data
already added. Call this method in the collection's add_result
method, after base result and disk and host names added. Associated check
classes implemented and performed accordingly.

Add Documentation and deletes Old code

Adds documentation for all new classes created: PbenchCombinedData,
PbenchCombinedDataCollection, DiagnosticCheck and concrete inheritors of
DiagnosticCheck. Also adds documentation for remaining functions in
merge_sos_and_perf_parallel.py. All documentation adheres to the
numpy/scipy style. All old code functionality reimplemented under new
structure so deleted.

commit message #3:
----

Internalize run and result data collection for given month.

Make merging and collecting of run and result data for a given month
the collect_data method in the PbenchCombinedDataCollection class.
Make the es_data_gen function also a method in this class. Rename
merge_sos_and_perf_parallel to aggregate_data. aggregate_data now
only gives the collection object a month to collect data on and all
processing happens inside the Class object.

commit message #4:
----

Fix for comments by Peter on aggregate_data.py

Change _month_gen function signature, definition and docs to reflect
2 params. One for end_time and start_months_prior duration specifying
how many months prior to start data collection from. Update function
call in main appropriately.

Add type hints and documentation to main function. Remove comments
regarding multiprocessing. Fix user agent string to reflect filename
by using argparse.

Add optional arument for argparse to specify months_prior to use for
_month_gen, have default be 12 months.

Fix program name retrieval in main

CHange parse_arguments to return ArgumentParser Object instead of
Namespace, and create Namespace args in main function, because
prog name only accessible from parser.

add check to main to ensure record limiting works

Add print_report and emit_csv methods to Collection Object

Adds print_report method to print tracker information, instead of
using print on object. Adds emit_csv method, which writes important
data collected to csv files in a subdirectory. Writes valid, invalid,
trackers, diskhost, client data to separate csv files.

make csv folder using current working directory instead of hardcoding it

Finish Fixing comments by peter, add some comments to code

commit message #5:
----

Implement multiprocessing capability

Uses pathos because dill can better serialize objects. Creates
a ProcessPool and stores it in an attribute of collection object.
Makes Collection object take in no of cpus to use in initialization,
so pool can be initialized. Adds methods to add one month asynchornously,
or a list of months. Add methods to merge dictionaries in a special way,
and combine another Collection object's data into its own. This because
self is returned from the worker processes.

commit message #6:
----

fix jenkins errors

commit message #7:
----

Fix 1 multiprocessing issue

CHange default cpu count to 0, which means all but 1 cpus of the system.
Change ProcessPool initialization to work when cpus to use is 1.

commit message #8:
----

start fio result hostname refactor

commit message #9:
----

Verify aggregation behavior

commit message #10:
----

Implement ClientCount check

This should replace the ClientHostAggregteResultCheck and
the last check of SosreportCheck where it checks if the
sos reports have different hostnames. invalid run ids are those
that have more than 2 measurement_idx values for any measurement_title.
It also checks for run ids not in any result data (ie with no corresponding
results and thus invalid)

commit message #11:
----

Reverse order of month string generation in _month_gen

YYYY-MM strings now generated starting with the most recent month
and going backwards and ending the start_months_prior number of months
before.

commit message #12:
----

Change * use in result index generation to a loop

commit message #13:
----

Use asyncio module to replace pathos

Use async function calls to process each month, as opposed to
using multiple processes. This to see if record limiting can
work under this method instead.

commit message #14:
----

Table Multiprocessing/asyncio on month processing

Tables the use of multiprocessing/asyncio to speedup
processing of months by doing it in parallel for later,
since running into a myriad of issues.

commit message #15:
----

Test ssh and sftp client & Begin refactoring of sos collection

commit message #16:
----

Finish Sosreport collection refactor - no changes to processing logic

commit message #17:
----

print timings

commit message #18:
----

Modify Check class into Filter class

Filter class implements improved handling of custom required and
optional field/property specifications from run and result docs.
Factors out filtering logic, making it easily changeable.

commit message #19:
----

Add documentation and wrap up changes

commit message #20:
----

Finish documentation and changes

commit message #21:
----

Address flake8 errors and warnings
portante pushed a commit that referenced this pull request Jan 17, 2023
Consider a small change to re-use `mk_dirs`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants