Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/eckit stats #154

Draft
wants to merge 149 commits into
base: develop
Choose a base branch
from
Draft

Feature/eckit stats #154

wants to merge 149 commits into from

Conversation

pmaciel
Copy link
Member

@pmaciel pmaciel commented Dec 9, 2024

Mirco, I've ported 90% of the mir statistics functionality here, it's not a full PR to be merged before much discussion but rather a draft.

For MIR, the requirements were:

  • supporting finite and NaN missing values;
  • Calculate statistics in time (for weather statistics and climatotlogical studies) and/or in space (supporting a variety of interpolation methods);
  • Do online statistics, in partitioned objects that can be summed for global results (for parallel execution, although this was never taken advantage off);
  • Support comparisons for pgen (which is a lot); And
  • To support eccodes namespace "statistics" (grib_ls -n statistics), which is why there are spectral statistics too, but this isn't a requirement

I'm sure for other packages it will be different, let's see if they converge to something useful :-)

pmaciel and others added 23 commits November 19, 2021 18:00
This reverts commit 52a2178.

# Conflicts:
#	src/mir/data/MIRField.cc
…op-convert, cppcoreguidelines-avoid-non-const-global-variables, readability-qualified-auto, bugprone-fold-init-type, bugprone-macro-parentheses
…-make-unique, modernize-use-auto, readability-container-data-pointer, readability-implicit-bool-conversion, readability-size-empty
@pmaciel
Copy link
Member Author

pmaciel commented Dec 9, 2024

Again please do not merge this, before we have a wider discussion with other eckit devs.

@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.74%. Comparing base (9d2807d) to head (74e8b31).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #154   +/-   ##
========================================
  Coverage    63.74%   63.74%           
========================================
  Files         1066     1066           
  Lines        55360    55360           
  Branches      4086     4086           
========================================
  Hits         35289    35289           
  Misses       20071    20071           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@simondsmart simondsmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions in here. Looks good overall.

namespace eckit::stats {


static util::recursive_mutex* local_mutex = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a recursive_mutex?

Can we not use the idiom, used elsewhere, with a static instance() method on the Factory type, and the mutex/map being local members. It's a bit cleaner than using posix once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a wrapper around the eckit types that have been superceded by stl; There's a define above that controls this, and I've put it in one place only so it's easy to change.

I called it recursive mutex because that's what the stl calls it, and what the eckit::Mutex really is. But this can of course change, at the moment this is only for controlling access to the factories. On that note, the factories could move to a eckit-based implementation -- which I recently changed to behave correctly for concurrent access, but the change is non-trivial. I think this is what should likely happen first? In any case, happy to update

protected:
// -- Members

const Parametrisation& parametrisation1_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a touch dangerous. For general use, can we be confident that the supplied Parametrisation will outlive the Comparator instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'm not happy about this. Currently the Comparator is a class supporting comparing two statistics (say mean of two fields), but its use is quite specific to pgen-compare fields1 fields2 (which is in fact mir-compare). I probably should take the Comparator hierarchy out? I don't see that MultIO would use this.

namespace eckit::stats {


static util::once_flag once;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment RE instance() method rather than static global variables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed -- maybe factories refactoring come first though.

namespace eckit::stats {


class Distribution {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a Distribution do in this context. Not immediately obvious from the name - I think this needs some documentation comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right -- this is a class to generate noise onto a field according to specific distributions. It might not have a direct use here, but it is important functionality to support statistics testing (incl. downstream packages -- I use this in MIR to generate training data, although this isn't the greatest use). I used this to test the statistics when developing, and I haven't ported those tests as they should be done anew (they aren't useful as is)...

using MIRValuesVector = std::vector<double>;


class Field {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit odd. It isn't declared as an abstract base, but all of the methods are NOTIMP, and it doesn't carry any data. Huh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you caught on quickly -- this was meant for Mirco to have a look for MultIO first and see what the interface should look like. I've just pushed an update supporting owned and referenced fields, should be much more adequate for MultIO, and fully implemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is to serve as an argument to calculate field statistics in (for a definition of Field.) You're just too early on the PR as I intended to discuss it beforehand -- but now I've done a proper implementation that can be used in both MIR and MultIO


// -- Methods

virtual std::string execute(const Field&, const Field&) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string is a very non-obvious return type for a Comparator. Unless it is doing something weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this sholuld change -- it is a list of reasons why a comparison failed, and if empty is it a successful comparison. I should bring in more (a lot more) semantics here. Good catch!

min_(std::numeric_limits<double>::quiet_NaN()),
max_(std::numeric_limits<double>::quiet_NaN()),
hasMissing_(hasMissing),
hasLowerLimit_(lowerLimit_ == lowerLimit_),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks odd. I assume it is essentially testing if things are NaN, or similar. But I worry that the equality comparison could be optimised away to true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this level, to support the statistics, they operate on real values and NaN if not assigned/not-definable (there's no "finite missing value"). The assumption here is the NaN support -- and you're right, this can be compiled out. I know that MIR and Atlas will fail in specific places to to similar optimisations. Maybe I should protect this with a static_assert? But the problem is widespread.



static const MethodBuilder<MethodT<detail::AngleT<double, detail::AngleScale::DEGREE, detail::AngleSpace::ASYMMETRIC>>>
__stats1("angle.degree.asymmetric");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like the variable names starting with __. Those are supposed to be reserved symbols.

If you really want to ensure non-clashing, you can put them in an anonymous namespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed

namespace eckit::stats::util {


#if defined(ECKIT_GEO_ECKIT_THREADS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very odd, and somewhat worrying block of code. Why do we need to do special stuff. Why us std::mutex insufficient. And what is the logic here.

If it really is necessary, this needs to be heavily documented as to why (in comments).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the current threading model supprted in eckit (and the rest of our stacks), to be it's own implementation based on eckit::Mutex. This provides a #define that can change the threading model to stl, but for the stack we should do this in one go -- to me this is a separate discussion. It doesn't work, mixing the models. I've raised this issue over the last few years, several times, but gave up...


constexpr double EPS = 1e-6;

#define EXPECTV(a) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding these to Test.h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants