EuroSys'22 Paper Reviews and Rebuttal

EuroSys 2022 Paper #207 Reviews and Comments

Paper #207 Unicorn: Reasoning about Configurable System Performance through the lens of Causality

Review #207A

Updated: 18 Jan 2022 6:02:25pm CET

Paper summary

Since configuration options can have a huge effect on performance, this paper introduces causal performance profiling to systematically search across the configuration space and provide causal performance models. These models are accurate, can be translated to other different execution environments, and scalable. The Unicorn system is thoroughly evaluated on many real-world systems, and compares well against existing state of the art performance debugging techniques.

Strengths

Comprehensive evaluation and comparison with related systems, across many applications and system stacks.
Causality based performance debugging is more principled and structured vs. more ad-hoc black-box methods
Solution is transferrable to a good degree across environments and scalable wrt. number of configuration parameters and dimensions.

Significant Weaknesses

Not sure of novelty wrt. other causal analysis work such as CADET (Neurips Sys-ML 2020).

Questions for the Authors

Is the input/workload exactly the same across all trials (i.e., unicorn is an offline solution)? If so, why are multiple trials needed for the same configuration? (Stage 2 in Section 3)
Are the configuration knobs assumed to be all independent, or can there be constraints among them? For e.g., cant increase swappiness if there is no swap space at all.
What is the structure, depth, and "complexity" of the causal models? In the example in Fig 14, it seems one parameter dominates. I assume that such single-parameter misconfigurations would be the common case?

Comments for author

Causal performance debugging is a powerful technique to reason about and diagnose performance problems. This paper applies causality to the problem of performance getting affected by system configuration knobs. Since this is a large search space, principled methods developed in this paper help reason about which combination of knobs is most effective. The resultant system, UNICORN, is a practical powerful tool for understanding and diagnosing performance of modern software systems.

The experimental evaluation is thorough, leaves little doubt, and compares many related systems such as CBI , delta debugging, EnCore, BigDoc, SMAC, and Pesmo.

The transferability experiment result shows that a causal performance model developed for one system can be used on another system too. The paper applies a model developed for one hardware platform to another similar one: pretty impressive!

Novelty and related work:

Given the vast amount of techniques in causal performance analysis, some more qualitative comparison with systems such as Coz would be appreciated.
Statistical debugging is dismissed by claiming that "may produce correlated predicates that lead to incorrect explanations". This should be justified and explained.

The system design and implementation (Section 3) seems to be too short and missing too many details, and seems like a "black box": config options and performance goes into a standard causal-model toolbox, and the causal-model is the output.

Writing and structure:

[minor] The abstract seems too long
Some subtle grammar issues
Fig 10: Skeleton spelling
Stage vs. Phase inconsistency in Section 3
Text seems repetitive across abstract, introduction, and sections 2 and 3. Specifically, the motivation around causality-based performance analysis.
Figure 13 seems unreferenced. SHD not defined? It seems important since it compares the obtained causal model vs. ground truth.

I have read the response carefully and taken it into account.

Is this paper thought provoking?

Strong but narrow appeal. Thought provoking only for people already working in this particular topic.

Is this paper convincing?

Good. The evidence is not bullet-proof but is acceptable for papers in the area.

Writing quality

Adequate

Overall merit

Weak accept (OK paper, but I am not enthusiastic)

Review #207B

Paper summary

• The paper targets the performance degradation in systems, specifically focusing on misconfigurations as a cause for this degradation. The main proposal is to use Causal Models to identify and correct the misconfigurations.

• Relying on performance models alone is not sufficient to correct degradation in a system performance. Correlation-based methods cannot predict unseen environments reliably, and they can produce incorrect explanations. Instead, the paper suggests incorporating causality to ensure consistent explanations that work even in a new environment.

• It introduces Causal Performance models that combines causality with performance modelling for computer systems to identify misconfiguration and correct them.

Strengths

• Well-written paper that motivates and pushes for causal modelling to debug system performance.

Significant Weaknesses

• A limited theoretical contribution. The contribution is using existing causality modelling tools to model and explore interventions to correct performance degradation in a system case study. It is unclear how this will generalise or how to apply it to other case-study or the reasoning behind the causal model specified as a prior.

• The evaluation section is weak for the field (see extended comments): baseline choice is minimal, questionable metric reported, not using the established benchmark to test.

• Missing several comparisons to related work such as Causal Bayesian Optimisation.

Questions for the Authors

• What limitations do causal models have? Any increase in prediction time compared to standard models? Can they model every type of distribution?

• Can you provide more insight on the type and the range of the variables that were intervened on and how the causal model was derived?

• How would these models explore better configurations in new hardware? It is also unclear what defines a bad configuration that triggers "bug fixing"/optimisation.

Comments for author

The paper advocates using Causal modelling and combining it with performance modelling to derive a causal performance model. These models are used for debugging (and optimisation) system faults. The idea is well-motivated and solid. Causal graphs provide an easy way to inject expert knowledge into the system. Combining it with causal testing ensures the causal graph reflects the underlying system and learns from its observation. Pushing for causality is a compelling direction. However, the paper would benefit from a better evaluation, drawing a comparison to more recent and relevant work, and finally, it falls slightly short in terms of novelty.

The contribution itself feels very slim. The work uses the box causal modelling tool to model the system's performance and use do-calculus to predict the impact of interventions. Neither are new contributions. Applying these tools to system context has a limited theoretical contribution since the paper does not motivate why system problems are different from typical causal modelling problems, nor state the limitations in using out-of-the-box tools that this paper correct. Flushing out the evaluation section could help sell the idea further.

First, Figure 6 reports the minimum of the p99 latency improvement, which is already the very tail-end of the latency; the minimum is not very meaningful there. Often in optimisation literature, the median (with min/max) is provided to grasp the algorithm's stability better. The minimum is bounded by several factors that will not show the instability of the application.

Secondly, the evaluation does not use an established benchmark for evaluating the system when several benchmarks exist. UNICORN is a debugging/optimisation tool, and therefore, not using an established benchmark is very concerning.

Finally, the configuration space that is being corrected using UNICORN is unclear. Please report the configurations and their ranges since different approaches yield different results depending on the nature of the configuration space.

To draw a better comparison, please compare against state of the art in optimisation techniques. Especially Bayesian Optimisation [2], which has been shown to provide a very efficient optimisation and is very much state of the art in optimisation techniques. Methods such as BOHB, BoTorch, TPH, Bayesian Causal Optimisation [3] (especially this one which uses a Gaussian process for the do-calculus and has very similar ideas to UNICORN).

Please cite Cadet [1] and contrast to Cadet.

Cosmetic: Line 792 CausalML is overflowing out of the column, same with line 1228.

[1] Krishna, R., Iqbal, M.S., Javidian, M.A., Ray, B. and Jamshidi, P., 2020. CADET: Debugging and Fixing Misconfigurations using Counterfactual Reasoning. arXiv preprint arXiv:2010.06061. [2] Shahriari, B., Swersky, K., Wang, Z., Adams, R.P. and De Freitas, N., 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), pp.148-175. [3] Aglietti, V., Lu, X., Paleyes, A. and González, J., 2020, June. Causal Bayesian Optimization. In International Conference on Artificial Intelligence and Statistics (pp. 3155-3164). PMLR.

Is this paper thought provoking?

Weak. May provoke some new thoughts, but not many (even for people already working in this topic).

Is this paper convincing?

Marginal. The paper presents weak evidence to demonstrate its main claims.

Writing quality

Well-written

Overall merit

Weak reject (This paper should be rejected, but I'll not fight strongly)

Review #207C

Paper summary

Configurations make a huge difference on the performance of large-scale software systems. The paper proposes UNICORN, a framework that uses causal models to reason about the impact of configurations on large software systems. Crucially, the paper argues that causal models are better suited for performance predictions (re: root-cause debugging, as well as optimizing to find better config options) than regression-based models. The paper outlines the design of UNICORN, and performs an extensive evaluation of the system against similar tools from the literature.

Strengths

Relevant problem for real systems; well-explained methodology, and quite thorough evaluation with good results compared to similar existing systems from literature.

Significant Weaknesses

Some parts of the paper could use a bit of work in the writing for a system audience, as well as some details about the experimental setup.

Comments for author

Thank you for your submission to Eurosys! I found your paper quite an interesting read, and enjoyed the thorough evaluation you performed of your system. A few thoughts:

Could you please provide more details about your experimental platforms? You mention TX1, TX2, and XAVIER throughout, but I can't find a detailed description of their hardware specs. This would be important for readers to understand (particularly the differences between the platforms) re: cache sizes, memory size, architectural differences, etc. This is especially true as you study generalization between the platforms.
I would maybe spend a bit more time describing graphical models and how they model distributions interacting together for a systems audience without a ton of background in ML. Giving people a better intuitive understanding of how/why graphical models can provide causal reasoning is quite important for people to understand the significance of the approach.
In Section 5 (page 9, ~ lines 925) you talk about "root cause". Which root cause is this? The config parameter with the biggest contribution? Performance metric value are a continuum based on config options, there is often no single culprit of a perf problem, but a combination of many things.
While I think it's interesting that you compared generalization across hardware platforms, I'm more curious about the limitations of your approach. When does causal reasoning with UNICORN stop performing so well? What are the edge cases? What are the difficulties of the general approach? How large does this methodology scale to? It's great that you looked at it up to > 10k options, but what about 100k (for example...)? Is that even feasible?
Page 10, line 1039, "... reaches near-optimal configuration..."; what is this optimal w.r.t.? the best-found configuration from your dataset? did you run an exhaustive search to find the absolute best config parameter config for this?
Why do the lines for Unicorn in Figure 17 only stop after a few samples? I get that Unicorn is already outperforming the other baselines, but it would be useful/interesting to see what kind of performance it could achieve with a similar number of samples.
I found Figure 5 to be quite difficult to understand and draw anything from (and I can't easily find if/where it's referenced in the text). I'm guessing the high-level point is the coefficient difference between the source and target models, however without any real idea of how the coefficients affect the output of the model it's hard to draw anything meaningful from it other than "there is variation"; you could capture the same info by quoting diffs in text or something similar.
I also found Figure 11 to not add much to the story. My takeaway is that over time your algo gets better at minimizing both latency and energy (presumably as you get a better handle on the distributions and you move from an exploration phase to more of an exploitation phase). You get the same takeaway from some of the graphs in your evaluation section later on. The blue+yellow+red plot does not provide much information though (and is visually difficult to read at the beginning since the x axes don't match, and even in color i found the red squares impossible to find; i thought you were talking about the red edges in the figure underneath). I would reclaim the page space and perhaps make better use of it.
Ditto with Figure 9; the idea of co-optimizing 2 objectives at the same time doesn't seem difficult to understand; I'm not sure the cartoon-y diagram helps much in driving that point home.
Perhaps this is a bit out of scope for your paper, but could you maybe also comment on your your causal model approach compares to queueing theoretic performance model approaches? (particularly w.r.t. answering "what if?" type questions)
Re: understanding perf (as stated in Fig. 1 as a possible question), could you comment further on how this could be extracted from your models? Would this be something similar to the ACE %s from Figure 14?
The figures throughout the paper need some re-arranging to work better with the text; I had to read a print copy of the and follow along with the figures on my monitor to avoid flipping back and forth every 2 sentences.

Is this paper thought provoking?

Moderately thought-provoking to a wide audience. Many conference attendees will be glad that they saw this paper.

Is this paper convincing?

Good. The evidence is not bullet-proof but is acceptable for papers in the area.

Writing quality

Adequate

Overall merit

Weak accept (OK paper, but I am not enthusiastic)

Review #207D

Paper summary

Existing performance models that reply on predictive ML models suffer from high cost (require a large number of configuration samples for accurate predictions) and unreliable predictions (do not transfer well for predicting performance behavior in a new environment).

Unicorn is an ew mwtholodyg that initially learns a Causal Performance Model to reliably capture intricate interactions between options across software-hardware stack, then it uses them to explain how such interactions impact the variation in performance causally. Unifocor iteratively updates the learned performance model by estimating the causal effects of configuration options, then selecting the highest impact-options.

Unicorn is evaluated on six highly configurable systems including three on-device ML systems. It is evaluated with the state-of-the-art configuration optimization and debugging methods.

Strengths

(+) It provides a strong motivation showing the limitation of regression-based models. (+) The use of three on-device ML systems as examples is exciting, given the depth of the stack and the multi-layer configuration. (+) Source code is available (+) Deep evaluation --- effectiveness, transferability, scalability; 2000 samples, 15 deployment settings, 5 systems, 3 hardware.

Significant Weaknesses

(-) The structural learning is a bit sketchy, but I think this is the most important part of the paper. (-) While the paper has a depth evaluation, it only shows 1 user case study (section 4).

Questions for the Authors

On page 4: “ performance influence models could not reliably predict performance in unseen environments” -- This is true if the model does not take the environmental settings as the input. If we take extra information as inputs, would the statement stay true?
The structural learning is a bit sketchy, but I think this is the most important part of the paper. You also mention that Unicorn is like a human-in-the-loop approach but at times I am confused where the human part plays a role here. Who decides what all the nodes are? After we decide all the nodes, will the FCI method automatically prune the causal structure?

Comments for author

This is a great systems paper, it checks all the checkboxes. There are just a few things that are a bit unclear.

Figure 2 is a great example to motivate the paper. It makes sense that different policies will result in different scatter plots. However, combining data from different policies and making a general correlation based on that is a wrong practice. I wonder if there are other more subtle examples that show the limitation of regression-based models within a ‘clear’ dataset that is not mixed with different settings.

Figures 4 and 5 -- need a little bit of friendlier explanation for non ML experts.

Figure placement can be improved.

On page 4: “ performance influence models could not reliably predict performance in unseen environments” -- This is true if the model does not take the environmental settings as the input. If we take extra information as inputs, would the statement stay true?

Going back to the example in Figure 2, the incorrect correlation in Fig 2a happens because it does not incorporate the cache policy as input.

In Figure 7, I also personally like the explainable models that are more structural as opposed to deep models that take all possible inputs. You mention that each function node can be a polynomial model or any other functional nodes such as neural networks. Thus, I wonder though that there could be an argument that we don’t need the causal relations because a stronger, deeper (single) model that again incorporates all necessary inputs might alleviate the need for a 2-layer learning like what you propose.

My understanding is that Unicorn is like a 2-level learning while the first one is about th structure learning using Fast Causal Inference and then the functional-node level learning. Is this correct?

The structural learning is a bit sketchy, but I think this is the most important part of the paper. You also mention that Unicorn is like a human-in-the-loop approach but at times I am confused where the human part plays a role here. Who decides what all the nodes are? After we decide all the nodes, will the FCI method automatically prune the causal structure?

Can you show a real output of the FCI method? I’d like to see all the nodes and the edges connecting the nodes. While the figures show structure that is easily readable, I wonder if a real system can have such a simple, clear structure.

If there are important steps where the human is in the loop, please make sure to put “Operator” as the subject, as opposed to using “Unicorn” as the subject all the time.

Causal model update (Figure 12) -- same thing here. Who does the update? Human? FCI?

Figure 11 -- great illustration.

Case study -- only one?

Is this paper thought provoking?

Moderately thought-provoking to a wide audience. Many conference attendees will be glad that they saw this paper.

Is this paper convincing?

Good. The evidence is not bullet-proof but is acceptable for papers in the area.

Writing quality

Well-written

Overall merit

Accept (Good paper, I will advocate for it)

Review #207E

Updated: 10 Jan 2022 8:50:25pm CET

Paper summary

This paper focuses on soft bugs in software systems, those that happen due to mis-configurations and don't necessarily always show up as crashes but as performance faults. Contemporary systems have a large space of configurations and these configurations for different components of a system interact in a complex way with each other making it challenging to understand and resolve such faults. This paper proposes to build causal models through UNICORN that learn these interactions of configurations and their impact on performance and help debug such performance or non-functional faults.

Strengths

The paper studies an important problem: configuring large systems is challenging and it is only getting more so with the increasing complexity of contemporary systems.
The paper is also well-written and easy to follow.

-The evaluation shows the generalizability of the approach which is very important to be able to apply the proposed framework to various contexts.

Significant Weaknesses

I am not super familiar with the work in causal modeling, but reading the paper I wasn't sure how the modeling used in UNICORN differs from existing causal models, such as, for instance, Dynamic Causal Bayesian Optimization from NeurIPS 2021.
While generalizability to unseen environments was studied in the paper, I wasn't sure how the changes in the systems over time (in the software upgrades, hardware upgrades, or in application-level knobs) can be handled by UNICORN.

Questions for the Authors

Could you please clarify the contributions w.r.t. the prior work in causal modeling? This will definitely help in placing the work in that context.
I wasn't clear about how UNICORN can handle various changes over time including software/hardware upgrades, changes in application-level knobs or requirements. Do these models need to be retrained? Will it need a complete set of new data to be able to retrain or update the models? How much overhead would that be, in terms of time and the cost of generating this data and training the models?

Comments for author

Thank you for submitting your work to EuroSys'22. The paper studies an interesting and important problem. The paper is well-written!

The use of causality in the proposed way to understand and resolve performance issues in systems makes a lot of sense and raising causality to be the first-class citizen definitely is a convincing argument. However, I am not very familiar with the space of causal models, and hence was looking to understand the contributions in the context of existing work. But the paper in its current form doesn't explicitly explain this aspect. Please consider clarifying the novelty in this aspect.

I was also not sure how UNICORN can handle changes in systems over time: as applications and underlying systems continue to evolve over time, what parts/stages/phases of UNICORN will need to be redone? How can we assess when such changes are needed and how expensive the repeated effort is? While the generalizability to different environments shades some light on this, it wasn't particularly clear how UNICORN reacts to such changes in the systems over time. Please clarify.

In terms of writing, while the paper is well-written, I had a bit of a hard time understanding the description about Causal Performance Models in section 2, which is key to the paper. Maybe simply this for better readability.

Finally, I was hoping to find any limitations noted about the proposed approach. I believe handing the changes in the systems over time will be a major one if in its current form UNICORN isn't able to accommodate such changes. But other than that, I was wondering how time-consuming the stage 1 is given that it is currently manual. The paper does mention that there are ways to automate this but I wonder how effective this automation would be. Please comment on this.

Thank you for responding to the questions! I have read the response carefully and taken it into account.

Is this paper thought provoking?

Moderately thought-provoking to a wide audience. Many conference attendees will be glad that they saw this paper.

Is this paper convincing?

Good. The evidence is not bullet-proof but is acceptable for papers in the area.

Writing quality

Adequate

Overall merit

Weak accept (OK paper, but I am not enthusiastic)

Review #207F

Updated: 20 Jan 2022 4:31:22am CET

Paper summary

The paper introduces a framework, Unicorn, that uses causality in building performance models (predicting end-to-end performance based on configurations). Instead of learning a performance model end-to-end, Unicorn first generates a causal performance model in the form of a graph structure that learns how different intermediate features (e.g., performance counters) affect the end-to-end metrics that the model aims to predict or optimize. By doing so, the framework significantly outperforms a range of prior methods.

Strengths

The idea of applying causality in a performance modeling context is powerful and could have large impact. A large portion of ML for systems work reduces to such models.
Unicorn substantially outperforms a number of baselines of different types.
The evaluation is thorough and evaluates a range of different trade-offs.

Significant Weaknesses

The paper is very dense and sometimes a bit difficult to follow. In particular, some of the figures (e.g., Figure 9) are very difficult to understand due to being very small and containing a large amount of information.

Comments for author

I liked this paper! There has been a large amount of work on ML for performance modeling and, more broadly, ML for auto-tuning and choosing configuration parameters. These approaches generally have problems with transferability, requiring them to acquire many new samples when faced with a new scenario. They are also not explainable.

Unicorn addresses these challenges by building a causal model. The key idea is to not only record configuration parameters and end-to-end metrics but also latent variables such as program counters. Armed with this information, Unicorn applies existing causal modeling tools to learn a graph formulation that learns the connection between these different variables. This makes the resulting model more transferable and explainable.

There is a lot I like about this approach. It makes some inroads into to the long-standing problem that learned systems models are often insufficiently explainable. I think this idea has applicability beyond performance models and this paper could spark a substantial amount of follow-on work.

I also like that the paper has a very thorough evaluation. It creates a large dataset by automatically injecting faults and then compares to four different baseline methods that range from delta debugging to correlation analysis to Bayesian optimization. This is a strong comparison since each of these approaches has different strengths and weaknesses, and Unicorn outperforms them all.

There is some room for improvement in terms of clarity: I had to re-read some parts of the paper since it is very dense and some of the figures are not very clear (e.g., Figure 9). I think an editing pass to simplify some of the language and figures could be helpful here. To give one example, line 140 includes a formula but does not explain what it means (or what the different variables stand for); explaining the intuition without the formula would have made the paragraph cleaner. This applies throughout the paper.

Overall, however, these are really just small issues. I am very positive about the paper overall and think it could have significant impact.

Post-rebuttal update: I have read the response carefully and taken it into account.

Is this paper thought provoking?

Very thought-provoking to a wide audience. This paper will create a buzz: most people will be talking about it during the session breaks.

Is this paper convincing?

Outstanding. The paper presents strong evidence (experimental data or proofs) to support its main claims.

Writing quality

Needs improvement

Overall merit

Accept (Good paper, I will advocate for it)

R-Revision Response by Md Shahriar Iqbal [email protected] (2179 words)

######## QUESTIONS ########

We thank the reviewers for their feedback. In the following, in addition to answering reviewers’ questions, we address some concerns that further clarify our work.

Reviewer-A

A1:Multiple trials?

This is multiple measurements of the same configuration to remove the threat to the validity of results from measurement noise, which typically arises due to various system events and/or sensor interactions [Grheban2019, Guo2013, Jamhsidi2016, Nair2017].

A2:Constraints on configuration-options?

UNICORN captures the constraints (and interactions) among configuration options—we characterize them as structural constraints, and a violation will result in an invalid configuration, preventing it from being measured (line-492).

A3:Structure, depth, and complexity and single-parameter misconfiguration

In theory, causal model discovery is an NP-Hard problem [Chickering2004]. Its complexity is bounded by the largest degree [Spirtes2000]. Thus, they tend to converge asymptotically -in polynomial time. Let k be the maximal degree of any vertex, and n be the number of vertices. In the worst case, the number of conditional independence tests required: $\frac{n^2(n-1)^k-1}{(k-1)!}$ [Spirtes2000]. However, in practice, the causal models are sparse. We observed an average degree of 3.6 and a maximum degree of 16 per node.

We found 411/494 misconfigurations resulting from incorrectly setting five or more configuration parameters/knobs (lines:924-927). While a single configuration parameter may be dominant (higher ACE value, see Figure 14), this is uncommon. It is also crucial to identify the interactions between parameters to fix misconfigurations. Such interactions not only happen between software options but also between software-hardware knobs (more difficult to detect), which motivates our cross-stack solution.

Reviewer-B

B1:Limitations/Contributions

Existing out-of-box causal graph discovery algorithms like FCI remain ambiguous while data is insufficient and returns partially directed edges. For highly configurable systems, gathering high-quality data is challenging. To address this issue, we develop a novel pipeline for causal model discovery by combining FCI with entropic causality, an information-theoretic approach [Kocaglu2017] to causality that takes the direction across which the entropy is lower as the causal direction. Such an approach helps to reduce ambiguity and thus allows the graph to converge faster. Note that estimating a theoretical guarantee for convergence is out of scope, as having a global view of the entire configuration space is infeasible. Moreover, the presence of too many confounders can affect the correctness of the causal models and this error may propagate along with the structure if the dimensionality is high. Therefore, we use a greedy refinement strategy to update the causal graph incrementally with more samples; at each step, the resultant graph can be approximate and incomplete, but asymptotically, it will be refined to its correct form given enough time and samples.

B1: Increase in prediction time

Depending on the model size, the prediction time may increase compared with the polynomial regression models. We observed that discovering performance regression models takes 1.3 times on average less time than discovering causal performance models (CPMs) while keeping the maximum degree of interactions the same across both. However, the difference between model discovery is negligible compared with the cost of measuring a single configuration (For tasks like debugging CPMs ).

B1: Modeling different types of distribution

Causal models are essentially composed of multiple models (one for each intervention). Therefore, they have the capability of modeling observational distribution as well as interventional distributions.

B2: Range and causal model derivation**

All the configuration options, their values, and a walkthrough example of discovering causal models are provided in the appendix (link in line 197-appendix_unicorn.pdf ).

B3: Explore better configurations

The configuration options remain the same when an environment changes (only the range of values change). Causal mechanisms are invariant across multiple environments [Mitrovic2015]. Therefore, the whole causal mechanism will not change, there are still some commonalities. The CPM discovered in one environment can be reliably used to explore better configuration in another using the active learning approach in UNICORN.

Reviewer D

D1: Extra information for regression

We can learn multiple perf influence models, one for each environment, but first, we do need to maintain multiple models as opposed to a single unified model such as CPM. Secondly, there is no synergy between the performance models and this increases the cost of learning, requiring many samples and this makes it not scalable for large configuration spaces, however, CPM allows for synergy between multiple environments and essentially the combination of multiple models into one unified CPM enable more accurate models with fewer samples as it was shown in the experimental results.

D2:Human-in-the-loop

CPMs can be corrected by humans via feedback, and such feedback has been used in other contexts in causality. We mimic such feedback into the causal models by external constraints, e.g., independence of configuration options, etc.

Reviewer E

E1: Contributions and limitations

See B1

E2: Retraining

If reused, no further training is needed. Figure 18 shows an example where a causal model developed in Xavier is reused in TX2 that requires no additional cost to retrain that achieves 69% gain. Another common strategy is to allocate a small budget to update the causal model. In this paper, we updated the causal model from Xavier with 25 additional samples in TX2. This required 18 minutes of additional training and achieved an 81% gain. A causal model developed from scratch in TX2 requires achieving 83% gain in 36 minutes. Therefore, updating the causal model saved 18 minutes by sacrificing only 2% gain.

######## REMAINING RESPONSES ########

A0: Comparison with CADET

CADET is a performance debugging approach, whereas UNICORN can be used as a central tool for multiple performance tasks such as performance optimization, debugging misconfiguration etc. They also differ in the causal model discovery approach. CADET uses a combination of FCI and NOTEARS algorithm, whereas UNICORN uses a principled approach incorporating FCI and an information-theoretic approach [Kocaglu2017] that uses the entropy of the causal direction for correctness. There are several major differences in the experimental evaluations too. In particular, CADET only evaluates the accuracy of finding perf issues and the gains for the fixes only in perf debugging tasks, while UNICORN evaluates accuracy and perf gain on Perf debugging as well as Performance Optimization Tasks. In addition, UNICORN evaluates the transferability of the CPMs across environments, and it demonstrates scalability to exponentially large configuration spaces.

A4: Comparison with Coz We think that UNICORN and Coz are complementary approaches. UNICORN is targeted for highly-configurable systems, and the root causes of performance issues happen because of interactions across the system stack, mainly between the software and the deployment environment. In addition, UNICORN is used by users of a system where they might not have access to the code, but they are interested to resolve the performance issues on their own by changing the configuration of the system. In these scenarios, we assume that the user intends to resolve the performance issues without much reliance on the developers. Actually, developers might not have direct access to the User’s deployment environments. In some other scenarios, Developers cannot disturb some environments, such as Production. Given these differences (i.e., highly-configurable systems across stack and users-vs-developers), by discussing this with the authors of Coz, we came to the conclusion that such a comparison may not provide additional insights.

B4: Contributions

See B1 and A0.

B5: Minimum latency

To our knowledge, reporting the minimum latency value (latency) (in Figure 16) is more useful and common practice in the literature [Wang2017, Gardner2014] for optimization (minimization). While the median might demonstrate the stability of the model, we are interested in finding the minimum latency within a given budget for latency optimization tasks.

B6: Benchmark

To our knowledge, configuration debugging is an uncharted field, and no benchmark is available for performance analysis of the configurations. To address this problem, we use DeepStream, an established, extensively used application benchmark, and create our own configuration-performance dataset. Further, we used some model/workload settings of mlperf benchmarks, e.g., Xception for image processing, BERT for NLP, and DeepSpeech for speech, SQLite for database, and x264 for video processing tasks.

B7: Comparison with other Bayesian Optimizations (BO) and Causal BO

For single objective and multi-objective problems, we used SMAC and PESMO respectively, as to our knowledge, they represent the state-of-the-art. Causal BO [PMLR’20] solves the problem of finding an optimal intervention in a DAG by modeling the intervention functions with GP as a surrogate model. Dynamic CBO [Neurips’21] extends CBO by assuming that the underlying system may change over time, this means that the input/output of the surrogate GP model may have temporal evolution. First, we used do-calculus to identify causal conclusions from observational data; note that cost of interventions is orders of magnitude higher than observations in not only computer systems, but also, in other systems. Second, neither CBO nor Dynamic CBO is multi-objective optimization. Finally, even though online optimization of systems is an interesting problem and for which Dynamic CBO is indeed relevant, the online setting is outside the scope of our current approach.

C1: Details

See appendix (link in line 197).

C2: Describing Graphical Models

The appendix walks through the elaborate description of the stages involved in UNICORN starting from CM discovery to counterfactual query evaluation using an example.

C3: Root causes

Root causes are the configuration options that change their values from the misconfiguration to the configuration that fixed it.

C4: Limitations

See B1

C5: Near-optimal configuration

Here, the optimal configuration is found by exploring the configuration space with 10 times more budget than the budget allocated for optimization.

C6: Stops after a few samples Since Figure 17 is showing results for debugging, once the fault is fixed, active learning in UNICORN stops. We agree that it would be useful to show the result for the same number of samples for each approach, albeit, this has to be done while performing optimization tasks.

C10: Queueing theoretic performance This is a great idea. Both queueing (network) models and Causal models can be considered as a white box. However, queuing models require domain knowledge about the system, so building accurate models requires expertise, and they may be expensive. Our approach, on the other hand, is data-driven and only requires appropriate system instrumentation to collect appropriate data.

D1: More example The appendix contains two more examples where regression-based models Incorrectly identify spurious correlations.

D5: Incorrect correlation It would be not possible to recover such relationships without doing interventions even if we incorporate the cache policy as input. As correlation-based models are observational and cannot identify confounding variables (e.g., cache policy).

D9: Output of an FCI method We put the output of the FCI method in the appendix. The full model might not be readable. In practice, we focus on a particular part of the model and there are approaches that summarize the graphical model into simple rules.

D7: 2-level learning Yes, absolutely. E0: Dynamic Causal Bayesian Optimization See B7 E0: Generalizability See B3

C11: Understanding Perf The dependence relationships in the causal model provide explainability of the influence of configuration options on performance objectives. It also provides developers with the information of how a configuration option or multiple configuration options are influencing a performance objective (directly or indirectly via some system events) or whether performance objectives are influenced by any system events (not by any config options) which would indicate other running processes or daemons. The causal effects can help us identify the most influencing options. One can also perform interventions using the causal model for better reasoning.

F1: Figures are Dense

We agree with the issue raised by the reviewer. Will update.

References:

[Grebahn2019] Grebhahn, A., Siegmund, N., and Apel, S. Predicting performance of software configurations: There is no silver bullet. arXiv preprint arXiv:1911.12643 (2019).

[Guo2013] Guo, J., Czarnecki, K., Apel, S., Siegmund, N., and Wasowski, A. Variability-aware performance prediction: A statistical learning approach. In Proc. Int’l Conf. Automated Software Engineering (ASE) (2013), IEEE

[Jamshidi2016] Jamshidi, P., and Casale, G. An uncertainty-aware approach to optimal configuration of stream processing systems. In Proc. Int’l Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) (2016), IEEE.

[Nair2017] Nair, V., Menzies, T., Siegmund, N., and Apel, S. Faster discovery of faster system configurations with spectral learning. arXiv:1701.08106 (2017)

[Spirtes2000] Spirtes, P., Glymour, C. N., Scheines, R., & Heckerman, D. (2000). Causation, prediction, and search. MIT press.

[Chickering2004] Chickering, D. M., Heckerman, D., and Meek, C. (2004). Large-sample learning of Bayesian networks is NP-hard. The Journal of Machine Learning Research, 5, 1287–1330.

[Mitrovic2015] Mitrovic J, McWilliams B, Walker J, Buesing L, Blundell C. Representation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922. 2020 Oct 15.

[Spirtes1999] P. Spirtes, T. Richardson, and C. Meek. Causal discovery in the presence of latent variables and selection bias. In G. Cooper and C. Glymour, editors, Computation, Causality, and Discovery, pages 211–252. AAAI Press, 1999.

[Wang2017] Wang Z, Jegelka S. Max-value entropy search for efficient Bayesian optimization. In International Conference on Machine Learning 2017 Jul 17 (pp. 3627-3635). PMLR.

[Gardner2014] Gardner JR, Kusner MJ, Xu ZE, Weinberger KQ, Cunningham JP. Bayesian Optimization with Inequality Constraints. In ICML 2014 Jun 18 (Vol. 2014, pp. 937-945).

[Kocaglu2017] Kocaoglu M, Dimakis AG, Vishwanath S, Hassibi B. Entropic causality and greedy minimum entropy coupling. In2017 IEEE International Symposium on Information Theory (ISIT) 2017 Jun 25 (pp. 1465-1469). IEEE.

Comment @A1 by Reviewer C

The reviewers felt that this was an interesting problem area and application of causal modelling, and hope the paper will generate some interesting discussion during the conference. However, we strongly encourage authors to carefully read through all the reviewer comments and improve the quality of the paper. Congratulations on your Eurosys paper! :)