Skip to content

Commit

Permalink
[ add ] telemetry for CPU/GPU #42 & Pin processes for experiments use…
Browse files Browse the repository at this point in the history
… CPU (#48)

* [ add ] Wall clock for dashboard

* [ refactor ] dashboard / common

* [ add ] time info for failed experiments

* [ modified ] include timing information in data.json

* [ modified ] style

* [ merge request ] README.md

Co-Authored-By: Steven S. Lyubomirsky <[email protected]>

* [ modified ] do not create data.json for failed runs

* [ fix ] data.json generating logic

* [ refactor ] extract code for get timing info

* [ modified ] code logic

* [ add ] data collector for telemtry

* [ upd ] integrate telemetry to dashboard

* [ impl ] telemetry for cpu & gpu

* [ impl ] graph generating

* [ impl ] graph generating & [ refactor ] record time passed

* [ impl ] pin processes for trials that use CPU

* [ add ] top-level config of telemetry

* [ modified ] in case of commands run inside the method cost more time than expected

* [ modified ] include last run in webside used graphs

* [ remove ] unused lines

* [ remove ] unused lines

* [ refactor ] factor out code for telemetry process

* [ modified ] flexibility of telemetry rate for each experiments

* [ refactor ] move telemetry results to subsystem dir & timeout for telemetry process

* [ refactor ] use subprocess.run for timeout

* [ upd ] modify telemetry data directory

* [ upd ] switch for telemetry and process pinning

* [ add ] accessor for telemetry statistics

* [ remove ] logging...

* [ refactor ] use shared library

* [ modified ] avoid ignoring non-trivial exceptions

* [ modified ] switch for cpu and gpu sep

* floating point seconds\

* [ add ] docs for telemetry data

Co-authored-by: Steven S. Lyubomirsky <[email protected]>
  • Loading branch information
AD1024 and slyubomirsky committed Jan 10, 2020
1 parent b71d546 commit 347e2fe
Show file tree
Hide file tree
Showing 11 changed files with 386 additions and 24 deletions.
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ The top-level dashboard config.json may contain the following fields:
- `tmp_data_dir` (str, mandatory): Directory for storing experiment raw data (we hope to move this to cloud storage eventually), which are zipped CSV files
- `backup_dir` (str, mandatory): Directory for storing compressed copies of dashboard backups AKA dumping zip files (we hope to move this to cloud storage too)
- `setup_dir` (str, mandatory): Directory for storing persistent setup files for experiments (this probably should stay local)
- `run_cpu_telemetry` (boolean, optional): Top-level switch for CPU logging for all experiments (can be overwritten by configurations of experiments, default false)
- `run_gpu_telemetry` (boolean, optional): Top-level switch for GPU logging for all experiments (can be overwritten by configurations of experiments, default false)
- `telemetry_rate` (integer, optional): The rate (in seconds) that the telemetry process collect data from `sensors` and `nvidia-smi` (e.g. setting to 30 will make the telemetry process collect data once 30 seconds). The default value is 15. To disable the telemetry process, set this field to a negative integer.
- `randomize` (boolean, optional): Whether to randomize the experiment order. Defaults to true. If false, experiments will be run based on their specified priority (ties broken by lexicographic order by name).

Example configurations for the dashboard and every experiment and subsystem are given in `sample-dashboard-home/`.
Expand Down Expand Up @@ -69,6 +72,12 @@ Experiment `config.json` files may contain, in addition to any fields specific t
- `tvm_remote` (optional, string): TVM fork to use for tvm_branch's functionality
- `tvm_branch` (optional, string): If indicated, the experiment will check out the specified branch from the `tvm_remote` repo and build that variant of TVM for the experiment
- `rerun_setup` (optional, boolean): If indicated and the experiment has a `setup.sh`, this will force the setup to be rerun regardless of whether the experiment has changed. Defaults to false.
- `process_pinning` (optional, dict): configuration of process pinning for experiments
- `enable` (mandatory, boolean): Switch for the process pinning
- `cores`: (mandatory, parameter passed to `taskset`): Bitmask / cpu list, etc. See `man taskset` for more information.
- `run_cpu_telemetry` (optional, boolean): Switch of CPU logging for current experiment. If indicated, the configuration will overwrite the top-level configuration for current experiment. (default: same as the value in top-level configuration).
- `run_gpu_telemetry` (optional, boolean): Switch of GPU logging for current experiment. If indicated, the configuration will overwrite the top-level configuration for current experiment. (default: same as the value in top-level configuration).
- `telemetry_rate` (optional, integer): If indicated, the number in this field will overwrite the timespan between two data collections of the telemetry process, else, the value will be that in the top-level dashboard configuration.
- `priority` (optional, int): If the dashboard is not set to run experiments in random order, the priority will be used to decide the experiment ordering. If unspecified, the priority will default to 0. The highest-priority experiments will run first. Ties will be broken by lexicographic order by experiment directory name. (This mechanism is included primarily for debugging purposes, like determining if the experiment ordering affects the results. Experiments should not rely on running in any particular order, however.)

Each script will be executed from its own directory so they don't have to use absolute addresses everywhere.
Expand Down Expand Up @@ -98,6 +107,15 @@ Subsystems will have config options as follows:

*(Meta-note: Something that became clear in the process of developing the subsystems is that the experiments themselves can be handled as a single subsystem that is configured to run first. This might reduce some duplicated logic in the core infrastructure but would take a lot of engineering effort to properly implement and may not be worthwhile.)*

### Telemetry Record
If the telemetry switch is enabled for some experiment, the telemetry process will collect data from CPU and/or GPU (configured by users), and the main process will parse the data to `JSON` files (separated for CPU and GPU) and store them in `DASHBOARD_HOME/results/subsystem/telemetry/EXP_NAME`, where `DASHBOARD_HOME` and `EXP_NAME` are home directory (configured by users) and experiment names. In order to make `vis_telemetry` subsystem work, parsed GPU and CPU telemetry files have to be in a certain format. The structure of `JSON` file for GPU telemetry is:
1. A timestamp
2. Topic names mapped to an object that has a `data` field and a `unit` field. `data` field is a list of pairs where the first element is time elapsed from the beginning of the experiment, and the second element is the data collected by the telemetry process. The `unit` field is the unit of the data, if it is not applicable, the value will be `null`.

The structure of `JSON` file for CPU telemetry is:
1. A timestamp
2. Adaptor names mapped to an object whose keys are names of sensors of the adapter, and values to the keys are list of pairs, where the first element is time elapsed from the beginning of the experiment , and the second element is the data collected by the telemetry process.

## Implementation Details

### Dependencies
Expand Down
63 changes: 51 additions & 12 deletions dashboard/dashboard.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from common import (check_file_exists, idemp_mkdir, invoke_main, get_timestamp,
prepare_out_file, read_json, write_json, read_config, validate_json, print_log)
from dashboard_info import DashboardInfo
from telemetry_util import start_telemetry, process_telemetry_statistics


def validate_status(dirname):
Expand Down Expand Up @@ -121,15 +122,22 @@ def target_precheck(root_dir, configs_dir, target_name,
return ({'success': True, 'message': ''}, target_info)


def experiment_precheck(info, experiments_dir, exp_name):
def experiment_precheck(info, experiments_dir, exp_name, default_telemetry_rate, run_cpu_telemetry, run_gpu_telemetry):
return target_precheck(
experiments_dir, info.exp_configs, exp_name,
{
'active': False,
'priority': 0,
'rerun_setup': False,
'tvm_remote': 'origin',
'tvm_branch': 'master'
'tvm_branch': 'master',
'telemetry_rate': default_telemetry_rate,
'run_cpu_telemetry': run_cpu_telemetry,
'run_gpu_telemetry': run_gpu_telemetry,
'process_pinning': {
'enable': False,
'cores': None
}
},
['run.sh', 'analyze.sh', 'visualize.sh', 'summarize.sh'])

Expand Down Expand Up @@ -198,7 +206,8 @@ def copy_setup(experiments_dir, setup_dir, exp_name):
cwd=exp_dir)


def run_experiment(info, experiments_dir, tmp_data_dir, exp_name):
def run_experiment(info, experiments_dir, tmp_data_dir, exp_name, pin_process=False, cores=None,
run_cpu_telemetry=False, run_gpu_telemetry=False):

to_local_time = lambda sec: time.asctime(time.localtime(sec))
exp_dir = os.path.join(experiments_dir, exp_name)
Expand All @@ -213,7 +222,11 @@ def run_experiment(info, experiments_dir, tmp_data_dir, exp_name):
start_msg = f'Experiment {exp_name} starts @ {to_local_time(start_time)}'
print_log(start_msg)
# run the run.sh file on the configs directory and the destination directory
subprocess.call([os.path.join(exp_dir, 'run.sh'), exp_conf, exp_data_dir],
if pin_process and cores:
subprocess.call(['taskset', '--cpu-list', f'{cores}', os.path.join(exp_dir, 'run.sh'), exp_conf, exp_data_dir],
cwd=exp_dir)
else:
subprocess.call([os.path.join(exp_dir, 'run.sh'), exp_conf, exp_data_dir],
cwd=exp_dir)
end_time = time.time()
delta = datetime.timedelta(seconds=end_time - start_time)
Expand All @@ -229,6 +242,8 @@ def run_experiment(info, experiments_dir, tmp_data_dir, exp_name):
status['start_time'] = to_local_time(start_time)
status['end_time'] = to_local_time(end_time)
status['time_delta'] = str(delta)
status['run_cpu_telemetry'] = run_cpu_telemetry
status['run_gpu_telemetry'] = run_gpu_telemetry
# not literally copying because validate may have produced a status that generated an error
info.report_exp_status(exp_name, 'run', status)
return status['success']
Expand Down Expand Up @@ -337,10 +352,10 @@ def summarize_experiment(info, experiments_dir, exp_name):
}
info.report_exp_status(exp_name, 'summary', status)


def run_all_experiments(info, experiments_dir, setup_dir,
tmp_data_dir, data_archive,
time_str, randomize=True):
time_str, telemetry_script_dir,
run_cpu_telemetry=False, run_gpu_telemetry=False, telemetry_interval=15, randomize=True):
"""
Handles logic for setting up and running all experiments.
"""
Expand All @@ -353,7 +368,8 @@ def run_all_experiments(info, experiments_dir, setup_dir,
# do the walk of experiment configs, take account of which experiments are
# either inactive or invalid
for exp_name in info.all_present_experiments():
precheck, exp_info = experiment_precheck(info, experiments_dir, exp_name)
precheck, exp_info = experiment_precheck(info, experiments_dir, exp_name, telemetry_interval,
run_cpu_telemetry, run_gpu_telemetry)
info.report_exp_status(exp_name, 'precheck', precheck)
exp_status[exp_name] = 'active'
exp_confs[exp_name] = exp_info
Expand Down Expand Up @@ -401,8 +417,27 @@ def run_all_experiments(info, experiments_dir, setup_dir,
tvm_hash = get_tvm_hash()

tvm_hashes[exp] = tvm_hash

success = run_experiment(info, experiments_dir, tmp_data_dir, exp)
pin_process = exp_confs[exp].get('process_pinning', None)
exp_run_cpu_telemetry = exp_confs[exp]['run_cpu_telemetry']
exp_run_gpu_telemetry = exp_confs[exp]['run_gpu_telemetry']
run_telemetry = exp_run_cpu_telemetry or exp_run_gpu_telemetry
enabled = pin_process.get('enable', False) if pin_process else False
cores = pin_process.get('cores', None) if enabled else None
telemetry_interval = exp_confs[exp].get('telemetry_rate', telemetry_interval)
if run_telemetry:
telemetry_process = start_telemetry(telemetry_script_dir, exp,
exp_run_cpu_telemetry,
exp_run_gpu_telemetry,
tmp_data_dir,
interval=telemetry_interval) if run_telemetry else None
success = run_experiment(info, experiments_dir, tmp_data_dir, exp,
pin_process=pin_process, cores=cores,
run_cpu_telemetry=exp_run_cpu_telemetry, run_gpu_telemetry=exp_run_gpu_telemetry)
# Telemetry can be disabled
if run_telemetry and telemetry_process:
telemetry_process.kill()
# Gather stat collected by the telemetry process
process_telemetry_statistics(info, exp, tmp_data_dir, time_str)
if not success:
exp_status[exp] = 'failed'

Expand Down Expand Up @@ -489,7 +524,7 @@ def run_all_subsystems(info, subsystem_dir, time_str):
success = run_subsystem(info, subsystem_dir, subsys)


def main(home_dir, experiments_dir, subsystem_dir):
def main(home_dir, experiments_dir, subsystem_dir, telemetry_script_dir):
"""
Home directory: Where config info for experiments, etc., is
Experiments directory: Where experiment implementations are
Expand Down Expand Up @@ -537,12 +572,16 @@ def main(home_dir, experiments_dir, subsystem_dir):
if 'randomize' in dash_config:
randomize_exps = dash_config['randomize']

telemetry_rate = dash_config.get('telemetry_rate', 15)
run_cpu_telemetry = dash_config.get('run_cpu_telemetry', False)
run_gpu_telemetry = dash_config.get('run_gpu_telemetry', False)
run_all_experiments(info, experiments_dir, setup_dir,
tmp_data_dir, data_archive,
time_str, randomize=randomize_exps)
time_str, telemetry_script_dir, run_cpu_telemetry=run_cpu_telemetry, run_gpu_telemetry=run_gpu_telemetry,
telemetry_interval=telemetry_rate, randomize=randomize_exps)

run_all_subsystems(info, subsystem_dir, time_str)


if __name__ == '__main__':
invoke_main(main, 'home_dir', 'experiments_dir', 'subsystem_dir')
invoke_main(main, 'home_dir', 'experiments_dir', 'subsystem_dir', 'telemetry_script_dir')
3 changes: 2 additions & 1 deletion dashboard/run_dashboard.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ cd "$(dirname "$0")"
script_dir=$(pwd)
experiments_dir=$script_dir/../experiments
subsystem_dir=$script_dir/../subsystem
telemetry_dir=$script_dir/../telemetry
rebuild_dashboard_tvm=true
if [ "$#" -ge 2 ]; then
rebuild_dashboard_tvm="$2"
Expand Down Expand Up @@ -71,4 +72,4 @@ source $BENCHMARK_DEPS/bash/common.sh
include_shared_python_deps

cd $script_dir
python3 dashboard.py --home-dir "$dashboard_home" --experiments-dir "$experiments_dir" --subsystem-dir "$subsystem_dir"
python3 dashboard.py --home-dir "$dashboard_home" --experiments-dir "$experiments_dir" --subsystem-dir "$subsystem_dir" --telemetry-script-dir "$telemetry_dir"
2 changes: 1 addition & 1 deletion experiments/relay_to_vta/summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
SIM_TARGETS = {'sim', 'tsim'}
PHYS_TARGETS = {'pynq'}
METADATA_KEYS = {'timestamp', 'tvm_hash',
'start_time', 'end_time', 'time_delta'}
'start_time', 'end_time', 'time_delta', 'run_cpu_telemetry', 'run_gpu_telemetry'}

def main(data_dir, config_dir, output_dir):
config, msg = validate(config_dir)
Expand Down
2 changes: 1 addition & 1 deletion experiments/relay_to_vta/visualize.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
'vta': 'Mobile CPU w/ FPGA'
}
METADATA_KEYS = {'timestamp', 'tvm_hash',
'start_time', 'end_time', 'time_delta'}
'start_time', 'end_time', 'time_delta', 'run_cpu_telemetry', 'run_gpu_telemetry'}

def generate_arm_vta_comparisons(data, output_prefix):
comparison_dir = os.path.join(output_prefix, 'comparison')
Expand Down
30 changes: 29 additions & 1 deletion shared/python/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,8 @@ def traverse_fields(entry, ignore_fields=None):
Set ignore_fields to a non-None value to avoid the defaults.
"""
ignore_set = {'timestamp', 'tvm_hash', 'detailed',
'start_time', 'end_time', 'time_delta', 'success'}
'start_time', 'end_time', 'time_delta', 'success',
'run_cpu_telemetry', 'run_gpu_telemetry'}
if ignore_fields is not None:
ignore_set = set(ignore_fields)

Expand Down Expand Up @@ -192,3 +193,30 @@ def invoke_main(main_func, *arg_names):

def render_exception(e):
return logging.Formatter.formatException(e, sys.exc_info())

def process_cpu_telemetry(stat:dict) -> list:
'''
Returns a list of a timestamp and a list of tuples in the form of:
(adapter_name, topic_name, data_unit, list_of_data)
where `list_of_data` in the form of:
[timestamp, data]
Make the data consistent with those processed by `process_gpu_telemetry`.
'''
result = [stat.get('timestamp')]
for adapter, stat_dict in stat.items():
if adapter != 'timestamp':
for (topic_name, data) in stat_dict.items():
result.append((adapter, f'{adapter}_{topic_name}', '', data))
return result

def process_gpu_telemetry(stat:dict) -> list:
'''
Returns a list of a timestamp and a list of tuples in the form of:
(topic_name, data_unit, list_of_data)
Make the data in a fixed accessible form.
'''
result = [stat.get('timestamp')]
for topic, stat_dict in stat.items():
if topic != 'timestamp':
result.append(('GPU', topic, stat_dict.get('unit'), stat_dict.get('data')))
return result
22 changes: 14 additions & 8 deletions shared/python/dashboard_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,14 +76,15 @@ def __init__(self, home_dir):
# better to generate the fields and accessors than have a full file of boilerplate
# info type, system type, singular, plural
dashboard_fields = [
(InfoType.config, SystemType.exp, 'config', 'configs'),
(InfoType.results, SystemType.exp, 'data', 'data'),
(InfoType.results, SystemType.exp, 'status', 'statuses'),
(InfoType.results, SystemType.exp, 'graph', 'graphs'),
(InfoType.results, SystemType.exp, 'summary', 'summaries'),
(InfoType.config, SystemType.subsys, 'config', 'configs'),
(InfoType.results, SystemType.subsys, 'status', 'statuses'),
(InfoType.results, SystemType.subsys, 'output', 'output')
(InfoType.config, SystemType.exp, 'config', 'configs'),
(InfoType.results, SystemType.exp, 'data', 'data'),
(InfoType.results, SystemType.exp, 'status', 'statuses'),
(InfoType.results, SystemType.exp, 'graph', 'graphs'),
(InfoType.results, SystemType.exp, 'summary', 'summaries'),
(InfoType.config, SystemType.subsys, 'config', 'configs'),
(InfoType.results, SystemType.subsys, 'status', 'statuses'),
(InfoType.results, SystemType.subsys, 'output', 'output'),
(InfoType.results, SystemType.subsys, 'telemetry', 'telemetry'),
]

# we need to have a function return the lambda for proper closure behavior
Expand Down Expand Up @@ -134,6 +135,11 @@ def read_exp_config(self, exp_name):
def read_subsys_config(self, subsys_name):
return read_config(self.subsys_config_dir(subsys_name))

def exp_cpu_telemetry(self, exp_name):
return os.path.join(self.subsys_telemetry_dir(exp_name), 'cpu')

def exp_gpu_telemetry(self, exp_name):
return os.path.join(self.subsys_telemetry_dir(exp_name), 'gpu')

def exp_active(self, exp_name):
return self.exp_config_valid(exp_name) and self.read_exp_config(exp_name)['active']
Expand Down
Loading

0 comments on commit 347e2fe

Please sign in to comment.