[ add ] telemetry for CPU/GPU #42 & Pin processes for experiments use…

… CPU (#48) * [ add ] Wall clock for dashboard * [ refactor ] dashboard / common * [ add ] time info for failed experiments * [ modified ] include timing information in data.json * [ modified ] style * [ merge request ] README.md Co-Authored-By: Steven S. Lyubomirsky <[email protected]> * [ modified ] do not create data.json for failed runs * [ fix ] data.json generating logic * [ refactor ] extract code for get timing info * [ modified ] code logic * [ add ] data collector for telemtry * [ upd ] integrate telemetry to dashboard * [ impl ] telemetry for cpu & gpu * [ impl ] graph generating * [ impl ] graph generating & [ refactor ] record time passed * [ impl ] pin processes for trials that use CPU * [ add ] top-level config of telemetry * [ modified ] in case of commands run inside the method cost more time than expected * [ modified ] include last run in webside used graphs * [ remove ] unused lines * [ remove ] unused lines * [ refactor ] factor out code for telemetry process * [ modified ] flexibility of telemetry rate for each experiments * [ refactor ] move telemetry results to subsystem dir & timeout for telemetry process * [ refactor ] use subprocess.run for timeout * [ upd ] modify telemetry data directory * [ upd ] switch for telemetry and process pinning * [ add ] accessor for telemetry statistics * [ remove ] logging... * [ refactor ] use shared library * [ modified ] avoid ignoring non-trivial exceptions * [ modified ] switch for cpu and gpu sep * floating point seconds\ * [ add ] docs for telemetry data Co-authored-by: Steven S. Lyubomirsky <[email protected]>
uwsampl · Jan 10, 2020 · 347e2fe · 347e2fe
1 parent b71d546
commit 347e2fe
Show file tree

Hide file tree

Showing 11 changed files with 386 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -40,6 +40,9 @@ The top-level dashboard config.json may contain the following fields:
 - `tmp_data_dir` (str, mandatory): Directory for storing experiment raw data (we hope to move this to cloud storage eventually), which are zipped CSV files
 - `backup_dir` (str, mandatory): Directory for storing compressed copies of dashboard backups AKA dumping zip files (we hope to move this to cloud storage too)
 - `setup_dir` (str, mandatory): Directory for storing persistent setup files for experiments (this probably should stay local)
+- `run_cpu_telemetry` (boolean, optional): Top-level switch for CPU logging for all experiments (can be overwritten by configurations of experiments, default false)
+- `run_gpu_telemetry` (boolean, optional): Top-level switch for GPU logging for all experiments (can be overwritten by configurations of experiments, default false)
+- `telemetry_rate` (integer, optional): The rate (in seconds) that the telemetry process collect data from `sensors` and `nvidia-smi` (e.g. setting to 30 will make the telemetry process collect data once 30 seconds). The default value is 15. To disable the telemetry process, set this field to a negative integer.
 - `randomize` (boolean, optional): Whether to randomize the experiment order. Defaults to true. If false, experiments will be run based on their specified priority (ties broken by lexicographic order by name).
 
 Example configurations for the dashboard and every experiment and subsystem are given in `sample-dashboard-home/`.
@@ -69,6 +72,12 @@ Experiment `config.json` files may contain, in addition to any fields specific t
 - `tvm_remote` (optional, string): TVM fork to use for tvm_branch's functionality
 - `tvm_branch` (optional, string): If indicated, the experiment will check out the specified branch from the `tvm_remote` repo and build that variant of TVM for the experiment
 - `rerun_setup` (optional, boolean): If indicated and the experiment has a `setup.sh`, this will force the setup to be rerun regardless of whether the experiment has changed. Defaults to false.
+- `process_pinning` (optional, dict): configuration of process pinning for experiments
+  - `enable` (mandatory, boolean): Switch for the process pinning
+  - `cores`: (mandatory, parameter passed to `taskset`): Bitmask / cpu list, etc. See `man taskset` for more information. 
+- `run_cpu_telemetry` (optional, boolean): Switch of CPU logging for current experiment. If indicated, the configuration will overwrite the top-level configuration for current experiment. (default: same as the value in top-level configuration).
+- `run_gpu_telemetry` (optional, boolean): Switch of GPU logging for current experiment. If indicated, the configuration will overwrite the top-level configuration for current experiment. (default: same as the value in top-level configuration).
+- `telemetry_rate` (optional, integer): If indicated, the number in this field will overwrite the timespan between two data collections of the telemetry process, else, the value will be that in the top-level dashboard configuration. 
 - `priority` (optional, int): If the dashboard is not set to run experiments in random order, the priority will be used to decide the experiment ordering. If unspecified, the priority will default to 0. The highest-priority experiments will run first. Ties will be broken by lexicographic order by experiment directory name. (This mechanism is included primarily for debugging purposes, like determining if the experiment ordering affects the results. Experiments should not rely on running in any particular order, however.)
 
 Each script will be executed from its own directory so they don't have to use absolute addresses everywhere.
@@ -98,6 +107,15 @@ Subsystems will have config options as follows:
 
 *(Meta-note: Something that became clear in the process of developing the subsystems is that the experiments themselves can be handled as a single subsystem that is configured to run first. This might reduce some duplicated logic in the core infrastructure but would take a lot of engineering effort to properly implement and may not be worthwhile.)*
 
+### Telemetry Record
+If the telemetry switch is enabled for some experiment, the telemetry process will collect data from CPU and/or GPU (configured by users), and the main process will parse the data to `JSON` files (separated for CPU and GPU) and store them in `DASHBOARD_HOME/results/subsystem/telemetry/EXP_NAME`, where `DASHBOARD_HOME` and `EXP_NAME` are home directory (configured by users) and experiment names. In order to make `vis_telemetry` subsystem work, parsed GPU and CPU telemetry files have to be in a certain format. The structure of `JSON` file for GPU telemetry is:
+1. A timestamp
+2. Topic names mapped to an object that has a `data` field and a `unit` field. `data` field is a list of pairs where the first element is time elapsed from the beginning of the experiment, and the second element is the data collected by the telemetry process. The `unit` field is the unit of the data, if it is not applicable, the value will be `null`.
+
+The structure of `JSON` file for CPU telemetry is:
+1. A timestamp
+2. Adaptor names mapped to an object whose keys are names of sensors of the adapter, and values to the keys are list of pairs, where the first element is time elapsed from the beginning of the experiment , and the second element is the data collected by the telemetry process.
+
 ## Implementation Details
 
 ### Dependencies

diff --git a/dashboard/dashboard.py b/dashboard/dashboard.py
@@ -11,6 +11,7 @@
 from common import (check_file_exists, idemp_mkdir, invoke_main, get_timestamp,
                     prepare_out_file, read_json, write_json, read_config, validate_json, print_log)
 from dashboard_info import DashboardInfo
+from telemetry_util import start_telemetry, process_telemetry_statistics
 
 
 def validate_status(dirname):
@@ -121,15 +122,22 @@ def target_precheck(root_dir, configs_dir, target_name,
     return ({'success': True, 'message': ''}, target_info)
 
 
-def experiment_precheck(info, experiments_dir, exp_name):
+def experiment_precheck(info, experiments_dir, exp_name, default_telemetry_rate, run_cpu_telemetry, run_gpu_telemetry):
     return target_precheck(
         experiments_dir, info.exp_configs, exp_name,
         {
             'active': False,
             'priority': 0,
             'rerun_setup': False,
             'tvm_remote': 'origin',
-            'tvm_branch': 'master'
+            'tvm_branch': 'master',
+            'telemetry_rate': default_telemetry_rate,
+            'run_cpu_telemetry': run_cpu_telemetry,
+            'run_gpu_telemetry': run_gpu_telemetry,
+            'process_pinning': {
+                'enable': False,
+                'cores': None
+            }
         },
         ['run.sh', 'analyze.sh', 'visualize.sh', 'summarize.sh'])
 
@@ -198,7 +206,8 @@ def copy_setup(experiments_dir, setup_dir, exp_name):
                     cwd=exp_dir)
 
 
-def run_experiment(info, experiments_dir, tmp_data_dir, exp_name):
+def run_experiment(info, experiments_dir, tmp_data_dir, exp_name, pin_process=False, cores=None,
+                    run_cpu_telemetry=False, run_gpu_telemetry=False):
 
     to_local_time = lambda sec: time.asctime(time.localtime(sec))
     exp_dir = os.path.join(experiments_dir, exp_name)
@@ -213,7 +222,11 @@ def run_experiment(info, experiments_dir, tmp_data_dir, exp_name):
     start_msg = f'Experiment {exp_name} starts @ {to_local_time(start_time)}'
     print_log(start_msg)
     # run the run.sh file on the configs directory and the destination directory
-    subprocess.call([os.path.join(exp_dir, 'run.sh'), exp_conf, exp_data_dir],
+    if pin_process and cores:
+        subprocess.call(['taskset', '--cpu-list', f'{cores}', os.path.join(exp_dir, 'run.sh'), exp_conf, exp_data_dir],
+                    cwd=exp_dir)
+    else:
+        subprocess.call([os.path.join(exp_dir, 'run.sh'), exp_conf, exp_data_dir],
                     cwd=exp_dir)
     end_time = time.time()
     delta = datetime.timedelta(seconds=end_time - start_time)
@@ -229,6 +242,8 @@ def run_experiment(info, experiments_dir, tmp_data_dir, exp_name):
     status['start_time'] = to_local_time(start_time)
     status['end_time'] = to_local_time(end_time)
     status['time_delta'] = str(delta)
+    status['run_cpu_telemetry'] = run_cpu_telemetry
+    status['run_gpu_telemetry'] = run_gpu_telemetry
     # not literally copying because validate may have produced a status that generated an error
     info.report_exp_status(exp_name, 'run', status)
     return status['success']
@@ -337,10 +352,10 @@ def summarize_experiment(info, experiments_dir, exp_name):
         }
     info.report_exp_status(exp_name, 'summary', status)
 
-
 def run_all_experiments(info, experiments_dir, setup_dir,
                         tmp_data_dir, data_archive,
-                        time_str, randomize=True):
+                        time_str, telemetry_script_dir, 
+                        run_cpu_telemetry=False, run_gpu_telemetry=False, telemetry_interval=15, randomize=True):
     """
     Handles logic for setting up and running all experiments.
     """
@@ -353,7 +368,8 @@ def run_all_experiments(info, experiments_dir, setup_dir,
     # do the walk of experiment configs, take account of which experiments are
     # either inactive or invalid
     for exp_name in info.all_present_experiments():
-        precheck, exp_info = experiment_precheck(info, experiments_dir, exp_name)
+        precheck, exp_info = experiment_precheck(info, experiments_dir, exp_name, telemetry_interval,
+                                                    run_cpu_telemetry, run_gpu_telemetry)
         info.report_exp_status(exp_name, 'precheck', precheck)
         exp_status[exp_name] = 'active'
         exp_confs[exp_name] = exp_info
@@ -401,8 +417,27 @@ def run_all_experiments(info, experiments_dir, setup_dir,
             tvm_hash = get_tvm_hash()
 
         tvm_hashes[exp] = tvm_hash
-
-        success = run_experiment(info, experiments_dir, tmp_data_dir, exp)
+        pin_process = exp_confs[exp].get('process_pinning', None)
+        exp_run_cpu_telemetry = exp_confs[exp]['run_cpu_telemetry']
+        exp_run_gpu_telemetry = exp_confs[exp]['run_gpu_telemetry']
+        run_telemetry = exp_run_cpu_telemetry or exp_run_gpu_telemetry
+        enabled = pin_process.get('enable', False) if pin_process else False
+        cores = pin_process.get('cores', None) if enabled else None
+        telemetry_interval = exp_confs[exp].get('telemetry_rate', telemetry_interval)
+        if run_telemetry:
+            telemetry_process = start_telemetry(telemetry_script_dir, exp,
+                                                exp_run_cpu_telemetry,
+                                                exp_run_gpu_telemetry,
+                                                tmp_data_dir,
+                                                interval=telemetry_interval) if run_telemetry else None
+        success = run_experiment(info, experiments_dir, tmp_data_dir, exp,
+                                 pin_process=pin_process, cores=cores,
+                                run_cpu_telemetry=exp_run_cpu_telemetry, run_gpu_telemetry=exp_run_gpu_telemetry)
+        # Telemetry can be disabled
+        if run_telemetry and telemetry_process:
+            telemetry_process.kill()
+            # Gather stat collected by the telemetry process
+            process_telemetry_statistics(info, exp, tmp_data_dir, time_str)
         if not success:
             exp_status[exp] = 'failed'
 
@@ -489,7 +524,7 @@ def run_all_subsystems(info, subsystem_dir, time_str):
         success = run_subsystem(info, subsystem_dir, subsys)
 
 
-def main(home_dir, experiments_dir, subsystem_dir):
+def main(home_dir, experiments_dir, subsystem_dir, telemetry_script_dir):
     """
     Home directory: Where config info for experiments, etc., is
     Experiments directory: Where experiment implementations are
@@ -537,12 +572,16 @@ def main(home_dir, experiments_dir, subsystem_dir):
     if 'randomize' in dash_config:
         randomize_exps = dash_config['randomize']
 
+    telemetry_rate = dash_config.get('telemetry_rate', 15)
+    run_cpu_telemetry = dash_config.get('run_cpu_telemetry', False)
+    run_gpu_telemetry = dash_config.get('run_gpu_telemetry', False)
     run_all_experiments(info, experiments_dir, setup_dir,
                         tmp_data_dir, data_archive,
-                        time_str, randomize=randomize_exps)
+                        time_str, telemetry_script_dir, run_cpu_telemetry=run_cpu_telemetry, run_gpu_telemetry=run_gpu_telemetry,
+                        telemetry_interval=telemetry_rate, randomize=randomize_exps)
 
     run_all_subsystems(info, subsystem_dir, time_str)
 
 
 if __name__ == '__main__':
-    invoke_main(main, 'home_dir', 'experiments_dir', 'subsystem_dir')
+    invoke_main(main, 'home_dir', 'experiments_dir', 'subsystem_dir', 'telemetry_script_dir')
diff --git a/dashboard/run_dashboard.sh b/dashboard/run_dashboard.sh
@@ -15,6 +15,7 @@ cd "$(dirname "$0")"
 script_dir=$(pwd)
 experiments_dir=$script_dir/../experiments
 subsystem_dir=$script_dir/../subsystem
+telemetry_dir=$script_dir/../telemetry
 rebuild_dashboard_tvm=true
 if [ "$#" -ge 2 ]; then
     rebuild_dashboard_tvm="$2"
@@ -71,4 +72,4 @@ source $BENCHMARK_DEPS/bash/common.sh
 include_shared_python_deps
 
 cd $script_dir
-python3 dashboard.py --home-dir "$dashboard_home" --experiments-dir "$experiments_dir" --subsystem-dir "$subsystem_dir"
+python3 dashboard.py --home-dir "$dashboard_home" --experiments-dir "$experiments_dir" --subsystem-dir "$subsystem_dir" --telemetry-script-dir "$telemetry_dir"
diff --git a/experiments/relay_to_vta/summarize.py b/experiments/relay_to_vta/summarize.py
@@ -6,7 +6,7 @@
 SIM_TARGETS = {'sim', 'tsim'}
 PHYS_TARGETS = {'pynq'}
 METADATA_KEYS = {'timestamp', 'tvm_hash',
-                 'start_time', 'end_time', 'time_delta'}
+                 'start_time', 'end_time', 'time_delta', 'run_cpu_telemetry', 'run_gpu_telemetry'}
 
 def main(data_dir, config_dir, output_dir):
     config, msg = validate(config_dir)

diff --git a/experiments/relay_to_vta/visualize.py b/experiments/relay_to_vta/visualize.py
@@ -15,7 +15,7 @@
     'vta': 'Mobile CPU w/ FPGA'
 }
 METADATA_KEYS = {'timestamp', 'tvm_hash',
-                 'start_time', 'end_time', 'time_delta'}
+                 'start_time', 'end_time', 'time_delta', 'run_cpu_telemetry', 'run_gpu_telemetry'}
 
 def generate_arm_vta_comparisons(data, output_prefix):
     comparison_dir = os.path.join(output_prefix, 'comparison')

diff --git a/shared/python/common.py b/shared/python/common.py
@@ -139,7 +139,8 @@ def traverse_fields(entry, ignore_fields=None):
     Set ignore_fields to a non-None value to avoid the defaults.
     """
     ignore_set = {'timestamp', 'tvm_hash', 'detailed', 
-                  'start_time', 'end_time', 'time_delta', 'success'}
+                  'start_time', 'end_time', 'time_delta', 'success',
+                  'run_cpu_telemetry', 'run_gpu_telemetry'}
     if ignore_fields is not None:
         ignore_set = set(ignore_fields)
 
@@ -192,3 +193,30 @@ def invoke_main(main_func, *arg_names):
 
 def render_exception(e):
     return logging.Formatter.formatException(e, sys.exc_info())
+
+def process_cpu_telemetry(stat:dict) -> list:
+    '''
+    Returns a list of a timestamp and a list of tuples in the form of:
+        (adapter_name, topic_name, data_unit, list_of_data)
+    where `list_of_data` in the form of:
+        [timestamp, data]
+    Make the data consistent with those processed by `process_gpu_telemetry`.
+    '''
+    result = [stat.get('timestamp')]
+    for adapter, stat_dict in stat.items():
+        if adapter != 'timestamp':
+            for (topic_name, data) in stat_dict.items():
+                result.append((adapter, f'{adapter}_{topic_name}', '', data))
+    return result
+
+def process_gpu_telemetry(stat:dict) -> list:
+    '''
+    Returns a list of a timestamp and a list of tuples in the form of:
+        (topic_name, data_unit, list_of_data)
+    Make the data in a fixed accessible form.
+    '''
+    result = [stat.get('timestamp')]
+    for topic, stat_dict in stat.items():
+        if topic != 'timestamp':
+            result.append(('GPU', topic, stat_dict.get('unit'), stat_dict.get('data')))
+    return result
diff --git a/shared/python/dashboard_info.py b/shared/python/dashboard_info.py
@@ -76,14 +76,15 @@ def __init__(self, home_dir):
         # better to generate the fields and accessors than have a full file of boilerplate
         # info type, system type, singular, plural
         dashboard_fields = [
-            (InfoType.config,  SystemType.exp,    'config',  'configs'),
-            (InfoType.results, SystemType.exp,    'data',    'data'),
-            (InfoType.results, SystemType.exp,    'status',  'statuses'),
-            (InfoType.results, SystemType.exp,    'graph',   'graphs'),
-            (InfoType.results, SystemType.exp,    'summary', 'summaries'),
-            (InfoType.config,  SystemType.subsys, 'config',  'configs'),
-            (InfoType.results, SystemType.subsys, 'status',  'statuses'),
-            (InfoType.results, SystemType.subsys, 'output',  'output')
+            (InfoType.config,  SystemType.exp,    'config',    'configs'),
+            (InfoType.results, SystemType.exp,    'data',      'data'),
+            (InfoType.results, SystemType.exp,    'status',    'statuses'),
+            (InfoType.results, SystemType.exp,    'graph',     'graphs'),
+            (InfoType.results, SystemType.exp,    'summary',   'summaries'),
+            (InfoType.config,  SystemType.subsys, 'config',    'configs'),
+            (InfoType.results, SystemType.subsys, 'status',    'statuses'),
+            (InfoType.results, SystemType.subsys, 'output',    'output'),
+            (InfoType.results, SystemType.subsys, 'telemetry', 'telemetry'),
         ]
 
         # we need to have a function return the lambda for proper closure behavior
@@ -134,6 +135,11 @@ def read_exp_config(self, exp_name):
     def read_subsys_config(self, subsys_name):
         return read_config(self.subsys_config_dir(subsys_name))
 
+    def exp_cpu_telemetry(self, exp_name):
+        return os.path.join(self.subsys_telemetry_dir(exp_name), 'cpu')
+
+    def exp_gpu_telemetry(self, exp_name):
+        return os.path.join(self.subsys_telemetry_dir(exp_name), 'gpu')
 
     def exp_active(self, exp_name):
         return self.exp_config_valid(exp_name) and self.read_exp_config(exp_name)['active']