-
Notifications
You must be signed in to change notification settings - Fork 156
Might be nice to see MEGNet on matbench #333
Comments
We don't have an objection to MEGNet being on the website, and in fact, MEGNet results are actually reported in the matbench paper. I have no idea why the results are not reported in the website itself since those results clearly exist. See https://www.nature.com/articles/s41524-020-00406-3 In any case, I have a philosophical ambivalence to this kind of comparisons since I don't think pure MAE or classification performance is the most important metric for materials science. A MAE of 20 meV/atom or 30 meV/atom in the formation energy is not significant at all, and different CV splits can easily cause the errors to fluctuate by that amount. |
@shyuep I have a vague memory of seeing MEGNet on matbench before, so I'm glad you mentioned it being in the original matbench paper. Digging a bit further, this matbench issue explains that MEGNet is (temporarily) no longer on matbench due to file corruption, but most of the conversation seems to be in an internal thread email thread between @ardunn and @chc273.
I think you have a great point about using a single error-based metric (e.g. MAE) for quantifying performance in a materials science context. I think that the creator of matbench would probably agree that it doesn't tell the full story, but instead gives a general sense of performance across a wide range of small/large problems for different domains and modalities. As one of the leaders in the field, what metrics do you think are most important for materials science? (if you'll excuse me asking a rather broad question) Related to this, I opened up some discussion about incorporating uncertainty quantification quality metrics which can have a big effect on "suggest next best experiment" algorithms such as Bayesian optimization.
One application that seems popular in a materials discovery context is the use of formation energy to calculate decomposition energy (or similar) as a measure of stability (Bartel et al.). If I remember correctly, I've seen articles with stability filtering criteria for the related metric,
I agree about choice of CV splits often causing fluctuations on the same order of magnitude. Interestingly, in the case of the matbench formation energy task, the fluctuation across 5 CV splits seems to be fairly controlled for some models and large for others. I don't necessarily advocate |
I think there are many questions that are important for ML in materials science. But for me, the most practical question now is how can it be used for actually discovering and designing new materials? Despite lots of papers claiming incremental improvements in accuracy of ML algos, you will find that actual papers with new discoveries and experimental confirmation are few and far between. That is why I prefer to focus on things like BOWSR that handles a critical bottleneck for that purpose, i.e., how do you get structures when all you have are a theoretical ensemble of atoms. We have been working on better options than just Bayesian optimization from the energy. Another important question is basically ensuring the extrapolability and "physicality" of ML models. You mentioned matbench being the "ImageNet" of materials. Fundamentally, ML in materials science in most cases cannot be compared to ML in image recognition or other domains. In MatSci, we know there are inviolable physical and chemical laws. We know extrapolation limits imposed by them, e.g., what happens when you have an ideal homogenous electron gas and the relationship with electron density, what happens when you pull two atoms far apart, etc. The same can't be said for things like "how do I tell an image of a cat from a dog".
|
I agree 100%. The primary purpose of matbench is not for blindly chasing lower and lower errors, rather to provide some sort of common platform to compare the strengths and weaknesses of various models to be used downstream in real applications. Of course, a single benchmark can't cover all, or even a majority, of use cases - but we hoped to make it broad enough to at least be of some research use (e.g., "I want to predict some formation energies, I wonder how various models perform on the same datasets")
Yet our field regularly utilizes and modifies model architectures/training procedures/etc. from other domains in matsci ML work. Similarly, matbench uses the general idea of an ML benchmark and applies it to the materials domain. I'd argue that the lack of a direct correlate between
Matbench has the former (see the Full Benchmark Data pages). The latter would of course be the better, though more computationally expensive. Even better is some additional UQ on each prediction, which as @sgbaird mentioned, we are considering putting into matbench. Of course, exactly how that is done is open to revision... I think the best case is having some easily-accessible community benchmark that is as representative of real matsci engineering problems as possible - whether it's matbench or something else doesn't really matter. |
Yeah so to clarify, the original megnet results were done by a postdoc who left our group a couple of years ago, and when I was putting the results onto the leaderboard, the only file she still had access to was corrupted... I know, I am equally disappointed and surprised lol. Adding a newer version of megnet to matbench has been something I'd wanted to do myself for a while but hadn't gotten around to it :/ |
@ardunn Just to clarify I am not denying matbench is useful. I am merely stating my own ambivalence towards chasing performance metrics. In the end, current models are "good enough" on certain properties and there are bigger problems to deal with. I would also argue that certain datasets are nowhere large or diverse enough to even be a useful metric for comparison. e.g., datasets that are like 100s of data points. I am pretty sure if you dive into the details of the data, you will see that the dataset is biased in some way. |
@shyuep @ardunn, thank you for your comments! @shyuep I appreciate you mentioning the benefits to the field of less focus on incremental improvements in accuracy and more focus on actual materials discovery campaigns (and I would add, successful or not). I'm excited to hear about the follow-up work to BOWSR when it becomes available 🙂 Extrapolability, interpretability, and physicality. These certainly seem to be (at least a few of) the differentiators between other domains ("cats vs. dogs", Netflix movie recommenders) and materials informatics. For extrapolability, it seems like some performance metrics can be implemented such as leave-one-cluster-out cross-validation from Meredig et al., a holdout of top 1% Kauwe et al. (disclaimer: from my group), adaptive design from a list of candidates, or a made-up "ground truth" model (forgive the oximoron). For the last case, there were some interesting, limited results (in my opinion) claiming that Gaussian Process had better adaptive design results over 100 iterations than other, more accurate models (e.g. neural network ensemble and random forest, and interestingly the ground truth was chosen to be the trained neural network ensemble). For interpretability, the more common approaches seem to be either symbolic regression or determination of feature importances based on physical descriptors. I'm glad you bring up the physicality aspect, especially the consideration of physical laws. If you know of MatSci work that explicitly incorporates physical laws into a ML model rather than relying on physical descriptors alone, I'd be really interested to hear. The Bartel paper and the comments in this thread have gotten me thinking about structure vs. composition more. Structure-based formation energy ML models have gotten really accurate (e.g. MEGNet and ALIGNN, down to ~ Again, thank you for the discussion! |
https://matbench.materialsproject.org/ (I'm unaffiliated)
The text was updated successfully, but these errors were encountered: