-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 122: Remove browser specific failures graph #122
base: master
Are you sure you want to change the base?
Conversation
I agree that there is likely a better way to leverage these metrics, and it seems like an outright improvement to developer utility if the graph is replaced with links that display queries of all BSFs for a given browser. |
I think we should remove the graph from the top of /results, but I don't think we should just remove it. We have triaged Chrome-only failures to keep our BSF number under 500, and I see it might be time to do that again. And based on PRs from @gsnedders to the metrics code I assume they've looked at it too. /insights already has "Anomalies" which allows getting to views for browser-specific failures, like this one: (Although it's buggy, I filed web-platform-tests/wpt.fyi#2964.) If I can make a wishlist, it would be:
|
Maybe a view like this would be the most friendly: |
My view is that if people want to use the concept of browser specific failures as an internal tool for understanding areas of interop difficulty that's good, and I fully support that. But I don't think we have widespread agreement on its use as a public-facing metric, and the reasoning in the RFC suggests that the lack of curation makes the numbers difficult to interpret. If specific vendors want a number to look at I think it's reasonable to make that number an internal metric instead. That has the additional advantage that it allows some customisation e.g. it allows filtering the inputs to exclude tests that aren't considered a priority/problem for whatever reason, or dividing up the score into team-specific metrics rather than just having one top-level number. That isn't something we can do with a purely shared metric. |
While I've certainly looked at the metric, it's far from the only data derived from WPT results that I've looked at. I think I otherwise agree with @jgraham here. |
To be clear, as the RFC says, there are a variety of biases with this metric, and some of these get quite extreme: Looking at the Safari data, I don't personally believe 40.04% of Safari's "incompatibility" or "web developer pain" (or however we want to define the goal of the BSF metric) is down to those two features. If we look at the graph over the past year with those two directories removed, we see a very different graph: |
@gsnedders thanks, that clearly demonstrates the outsized impact of test suites with lots of individual tests. For comparison/posterity, here's the current BSF graph on wpt.fyi: A few options for improving the metric:
I disagree with deleting the graphs outright, but would be happy with both moving it to |
I think a proposal for a new interop metric, even if based on BSF, would clearly be something for the Interop team to consider. |
Improving the BSF metric seems like a worthwhile goal, either through ideas like the ones Sam and Philip propose or through a reimagined Interop metric based on BSF as James suggests. I would encourage the Interop team to explore that path. However, since we don't have that yet, removing the metric entirely would be a step backwards. In Chromium we do pay attention to the overall score and invest considerably in improving interoperability over time. Hiding that number in favor of team-specific metrics will regress that effort. It will reduce visibility of Chromium interoperability issues at the organizational level and will pass the burden to individual teams with different priorities. From my perspective, removing things that are currently in use without a suitable replacement is wrong. But perhaps moving the graph to /insights as an interim step before we have an improved metric would be a reasonable compromise. |
Not fully matured idea: If the graph is a kind of barometer on web technologies support across browsers, would it make sense to have there things which are only supported (standard positions) uniformly by the 3 browsers represented in the graph? |
@karlcow I've also toyed with the idea of allowing filtering by spec status or implementation status, and I think that would be valuable. I think at least the following filters would be worth trying out:
I would not describe the current graph as a barometer on web technologies support across browsers. Rather the idea is to surface browser-specific failures, problems that occur in just one of the 3 tested browsers, which would ideally trend towards zero. A barometer of cross-browser support should instead be growing as the size of the interoperable web platform grows. It's an old presentation by now, but I looked at that in The Interop Update, where I teamed up with @miketaylr. If we work on filtering and weighting we'll have to see which defaults then make the most sense, but I think it's important to be able to see Chrome-only failures over time that includes features Chrome hasn't implemented at all, such as https://wpt.fyi/results/storage-access-api, MathML (until recently) or |
What is(are) the audience(s) for the graph? And depending on that what are the useful views for each specific audience? |
The audience is senior leaders who are making sure Chromium remains interoperable and competitive with other browser engines over time. The current view of overall browser specific failures is still useful in that task. |
Whilst I'm happy that Chrome's leadership are finding the graph useful, that usefulness as a metric is not a consensus position among browser vendors, and therefore it seems more appropriate to host it at a Chromium-specific location. |
And, as I think the above slightly-modified graph shows, the experience of WebKit leadership has been that understanding the graph has been very difficult. There's no intuitive way to discover that those two directories account for such a disproportionate weight of the metric. If you look at a view of WPT such as this, seeing Safari has fixed over 10k browser-specific-failures (2102 tests (10512 subtests)) over the past year, it seems reasonable to ask "why has the score continued to creep upwards, with no notable improvement at any point?". On the face of it, there's a number of potential explanations:
Of these:
Even from all these, it's hard to understand how we end up at the graph currently on the homepage. [Edited very slightly later to actually use properly aligned runs] |
While still supporting improvements to the graph, I will say that adding up the numbers in your three bullets above seems to reasonably explain the lack of impact of the improvements you made. |
The number of subtests do, yes. But that's a complete coincidence, given the "normalisation" to test. If you look at the actual directory-level diff, it becomes very apparent the overwhelming majority of the change is in Again, the problem is to a large degree weighting all the tests the same. |
For anyone confused, I believe we (the WPT Core Team) decided to defer this RFC until the Interop Team had time to consider it. |
We never resolved (merged) #120 but indeed that seems like the best way to handle this. |
Rendered.