This script is depending on Python3
, and nvidia-smi
, awk
, ps
commands.
pip install -r requirements.txt
If there is a missing package, please install by yourself using pip.
Also you might need to setup some configurations for your own environment.
When using gpu_status_server.py
and gpu_info_sender.py
, there are many options to change the settings, or you can use .yaml file for overwriting the arguments.
Use -h
to see all arguments and use --local_settings_yaml_path
option to overwrite.
.yaml file can be written like
# every key must be in capital letter
# this ip address is the server address for the client which send the gpu information
IP: "192.168.0.1"
# server's open port
PORT_NUM: 8080
MAIN_PAGE_TITLE: "AWSOME GPUs"
MAIN_PAGE_DESCRIPTION: "awsome description"
TABLE_PAGE_TITLE: "AWSOME Table"
TABLE_PAGE_DESCRIPTION: "awsome description"
# DD/MM/YYYY
TIMESTAMP_FORMAT: "DMY"
# this will filter the networks
# it will be fed in python `re.search()`, so you can use regular expressions
VALID_NETWORK: "192.168.11.(129|1[3-9][0-9]|2[0-5][0-9])"
# this allows 192.168.11.129~255
...
Example is in example/local_settings.yaml
nvidia-smi
's information printing format has been changed, so you need to specify a paring version for the client (which is sending a GPU information) script.
Please specify the format version (1 or 2) using --nvidia-smi_parse_version
or write NVIDIA_SMI_PARSE_VER
in local .yaml file.
version: 1 is for format of following
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16163 C python 240MiB |
| 1 16163 C python 8522MiB |
+-----------------------------------------------------------------------------+
version: 2 is for format of following (this is default now)
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 24898 C python 17939MiB |
| 1 N/A N/A 24899 C python 17063MiB |
+-----------------------------------------------------------------------------+
You can use simple wrapper,
for Server
python gpu_status_server.py
for Nodes
python gpu_info_sender.py
For automation, using systemd and crontab will do the work.
To get GPU information from terminal app, use curl and access http://<server_address>/?term=true
.
You will get like
$ curl "http://0.0.0.0:8080/?term=true"
+------------------------------------------------------------------------------+
| vesta ver. 1.2.4 gpu info. |
+------------------+------------------------+-----------------+--------+-------+
| host | gpu | memory usage | volat. | temp. |
+------------------+------------------------+-----------------+--------+-------+
|mau_local | 0:GeForce GTX 1080 Ti | 8018 / 11169 | 92 %| 80 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
+------------------+------------------------+-----------------+--------+-------+
|mau_local_11a7c5eb| 0:GeForce GTX 1080 Ti | 2400 / 11169 | 78 %| 79 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
+------------------+------------------------+-----------------+--------+-------+
|mau_local_ac993634| 0:GeForce GTX 1080 Ti | 8897 / 11169 | 98 %| 82 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
| | 2:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 36 °C|
| | 3:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 40 °C|
+------------------+------------------------+-----------------+--------+-------+
If you want to see detail information you can use detail
option like http://<server_address>/?term=true&detail=true
.
You will get like
$ curl "http://0.0.0.0:8080/?term=true&detail=true"
vesta ver. 1.2.4
#### mau_local :: 127.0.0.1 ####################################################
last update: 24/03/2019 20:27:10
--------------------------------------------------------------------------------
┌[ gpu:0 GeForce GTX 1080 Ti 2019/03/24 20:00:00.000 ]─────────────────────┐
│ memory used memory available gpu volatile temperature │
│ 8018 / 11169MiB 3151MiB 92% 80°C │
│ │
│ mem [/////////////////////////////////////////// ] 71% │
│ ├── train1 6400MiB user1 │
│ └── train2 1618MiB user1 │
└──────────────────────────────────────────────────────────────────────────┘
┌[ gpu:1 GeForce GTX 1080 Ti 2019/03/24 20:00:00.000 ]─────────────────────┐
│ memory used memory available gpu volatile temperature │
│ 2 / 11172MiB 11170MiB 0% 38°C │
│ │
│ mem [ ] 0% │
└──────────────────────────────────────────────────────────────────────────┘
________________________________________________________________________________
.
.
.
Server will also provide you to access host data by json.
Access http://<server_address>/states/
, or to specify host http://<server_address>/states/<host_name>/
You can use url parameter to fetch how many log you want by fetch_num=<# you want>
Just access http://<server_address>/
You will get like
User can get the information of GPU by accessing http://<server_address>/states/
.
Json response is like
{
"host1":{
# the order of data is ascending order in time
"data":
# host_name log are in array
[
{ # each GPU will be denote by "gpu:<device_num>"
'gpu_data':{
'gpu:0':{'available_memory': '10934',
'device_num': 0,
'gpu_name': 'GeForce GTX 1080 Ti',
'gpu_volatile': 92,
'processes': [
{
'name': 'train1',
'pid': "31415",
'used_memory': 6400,
'user': 'user1'
},
{
'name': 'train2',
'pid': "27182",
'used_memory': 1618,
'user': 'user1'
}
],
'temperature': 80,
'timestamp': '2019/03/24 20:00:00.000',
'total_memory': 11169,
'used_memory': 8018,
'uuid': 'GPU-...'
},
'gpu:1':{
'available_memory': '11170',
'device_num': '1',
'gpu_name': 'GeForce GTX 1080 Ti',
'gpu_volatile': '0',
'processes': [],
.
.
.
}
},
"timestamp": 20181130232947 # server recorded timestamp YYYYMMDDhhmmss
}
],
"ip_address": 127.0.0.1 # host IP address
},
"host2":{...}
}
If you set slack's webhook and bot setting, you can receive notification via slack.
For specifying slack setting, use --slack_webhook
, --slack_bot_token
, and --slack_bot_post_channel
for gpu_status_server.py
.
Or you can use .yaml file, see example/local_settings.yaml
Topology is very simple, Master (server) and Slave (each local machine) style, but it is ad hoc.
Server is only waiting the slaves to post the gpu information.
mchine
table is a lookup table for hash_code (id) to host name.
Table field is
id (TEXT) | name (TEXT) | ip_address (TEXT) |
---|---|---|
hash_code_1 | host_1 | host_1_ip |
hash_code_2 | host_2 | host_2_ip |
... | ... | ... |
hash_code_n | host_n | host_n_ip |
hash_code will be generated by Python code
hash_code = random.getrandbits(128)
Each host has own table for logging.
Table field is
timestamp (INTEGER) | data (BLOB) |
---|---|
timestamp_1 | data_1 |
timestamp_2 | data_2 |
... | ... |
timestamp_n | data_n |
timestamp
is based on server time zone and the style is "YYYYMMDDhhmmss".
data
is a Python dict object while it is serialized and compressed by Python pickle and bz2.