Separate out processing permanent counts from countmatch.reader #32

cczhu · 2020-01-08T16:43:01Z

Based on comments in #25 and discussions with @aharpalaniTO, we'll be revising how we treat permanent counts. To do this, we can no longer identify and process permanent count locations while reading in data year-by-year, since data imputation and outlier detection for permanent count stations requires data from other years.

To resolve this:

Separate out permanent count processing methods from AnnualCount. These will eventually go into
Rename AnnualCount to RawAnnualCount, since it's only used by Reader and gets discarded as soon as multiple years are combined into single multi-index tables.
Rework Reader so that it no longer distinguishes between PTCs and STTCs - all locations will be handled the same way.
Refactor growthfactor.py into permcount.py, which will handle all permanent count processing, including outlier detection. We'll move methods into their own files in a subsequent issue if permanentcount.py gets too long. This will also solve one point in Clean up and expand countmatch.growthfactor #26 .
Refactor test suite to handle these changes. We might need to wait until Permanent count ratio imputer #33 is finished before finalizing all tests.
Handle Refactor countmatch.reader to something less jumbled #21 in the process.

The text was updated successfully, but these errors were encountered:

cczhu · 2020-01-11T22:32:13Z

Had a rethink of this task, and now realize why Arman was reluctant to start mucking around with data imputation of permanent count stations, especially if he already had PECOUNT working to fill in missing ~2-3 months of data.

TEPs-I's PRTCS algorithm flags PTC years as having data in all 12 months and at least 3/4 of the year. Derived properties, like the ratio between daily traffic and AADT, are only calculated on those years.
Following TEPs-I's algorithm exactly, then imputing the few ratio matrices with missing values, would speed up CountMatch. This is because it would guarantee that eg. any given day of week within a month would have a ratio to use, and we wouldn't have to do a search within a matrix for the closest day of week or closest year. It would not affect the predictive accuracy of the model, though.
Imputation of derived properties is hard, since:
- We'd have to keep everything as self-consistent as possible - imputing the MADT, AADT, and ratios independently of each other may lead to inconsistencies.
- How should AADT even be imputed? Should we use Scikit-Learn's iterative imputer, TEPs-I's growth model, or something completely different (such as using the average of daily_traffic * D_ijd for all available data in that year, given that's what CountMatch suggests doing, or just using the MADT patterns from the closest year).
- If we impute the raw data, how should we even go about doing it? Isn't this what PECOUNT is for? How about using GPR instead?

cczhu · 2020-01-11T23:02:40Z

Revised class design:

A DerivedVals base class that stores methods for calculating derived properties with no imputation. Since imputation is likely to be a strategy and not a single step in a data reduction pipeline, more sophisticated derived value calculations could subclass this base class.
A GrowthFactor class that stores methods for calculating the growth factor, and results of the particular fit.
A PermCount class that stores both processed data and results/diagnostics of DerivedVals GrowthFactor instances within it. If it turns out we don't need to store anything from DerivedVals or GrowthFactor other than the results, we can mount them in PermCountProcessor instead of here (in which case PermCount will only be a useful indicator that the data structure is different than a regular count).
- Processed values need some measure of data accuracy, and an indicator of whether it was imputed or not. For daily counts, simply having an Imputed column is enough (we could also use a masked DataFrame, but storing the mask would be more expensive than storing the column, and it would look different than for the derived properties). For derived properties that are averaged over multiple days, store the number of days, and -1 if the value was imputed.
A PermCountProcessor class that cycles through every count, first determining if it meets the requirements for being a PTC, then running the derived value and growth factor calculations. If it turns out we don't need to store anything from DerivedVals or GrowthFactor other than the results, we can mount them here rather than in PermCount.

Revised work plan:

code up the framework for processing permanent counts, including default derived property processing.
code up tests for the new framework
create a basic imputer that only fills gaps in derived values after calculating them, to speed up CountMatch.
determine how many more locations x years could be made into permanent count location-years if we started relaxing the criteria to be a permanent station. If there are tons of locations, or tons of years in current permanent count locations, that would be considered permanent with a small relaxing of criteria, it would justify further work on imputation.
create a slightly more advanced imputer that handles a small number of missing months using Bagheri's algorithm (eg. estimating AADT by using conversion factors from the nearest year).

Outstanding question: if DerivedVals and GrowthFactor classes have lots of method-specific flags, how should we control for these in config.py? Nested dicts?

cczhu · 2020-01-15T16:32:52Z

As I'm refactoring to allow for imputation and other multi-year preprocessing of permanent counts, I've created a design where all data from a count location across multiple years are stored in the same object. The amount of data taken at a location can drastically change from year to year, and the permanent count criteria are defined in terms of a single year's data, so currently each location's data is checked year by year using PermCountProcessor.partition_years. The output of this function is perm_years, a list of years that meet the permanent count criteria. Once a PermCount object is created for the count location, perm_years is stored within it, and is used in a number of subsequent methods for calculating derived values and growth factors.

There are a number of issues with this design:

perm_years can freely be modified by the user, but it's a property governed by the general criteria they set.
partition_years is in PermCountProcessor because that class handles partitioning counts into STTCs and PTCs, but I'm not a fan of a class instance's fundamental properties being calculated by other classes.
AADT can only be calculated for years satisfying the permanent count conditions, but MADT can be calculated for months outside of permanent count years. This is not transparent to the user, and makes further code development susceptible to bugs.
It's unlikely, but I can foresee a situation where perm_years changes due to iterative imputation.

Alternative design:

Make perm_years a (lazy?) property of PermCount, and partition_years a method.
perm_years should be calculated in __init__, and raise an error if perm_years has zero length. from_count_object should then initialize an instance using self = cls(...) in the middle.
PermCountProcessor should try to initialize a permanent count instance for every single count location in a try/except block, and create a short term count if the initialization fails.
Just like with some growth factor methods, if a derived values method only calculates a property for a permanent year, its name should include _py.

cczhu · 2020-01-15T21:35:12Z

Decided to adopt points 1 and 3, but not 2. perm_years is now an instance property, but it is still initialized by PermCountProcessor. This is because practically everything else is initialized by PermCountProcessor (all derived values and growth factors, for example), so it would be strange to use an entirely different pattern to attempt initialization of perm_years, especially since it needs to communicate back to its parent whether it succeeded in making a permanent count. We can always refactor later.

cczhu · 2020-01-21T16:14:36Z

Resolved by #37

cczhu self-assigned this Jan 8, 2020

cczhu added the refactoring Reworking code structure without (major) API changes label Jan 8, 2020

cczhu changed the title ~~Separate out processing permanent counts from countmatch.reader~~ Separate out processing permanent counts from countmatch.reader Jan 8, 2020

cczhu mentioned this issue Jan 8, 2020

Permanent count ratio imputer #33

Closed

3 tasks

cczhu pushed a commit that referenced this issue Jan 10, 2020

#32 clean up reader.read a little.

c4e6178

cczhu pushed a commit that referenced this issue Jan 10, 2020

#32 store configuration variables on reader instance initialization.

1e17954

cczhu pushed a commit that referenced this issue Jan 10, 2020

#32 begin creating a PermCount class.

89a2538

cczhu pushed a commit that referenced this issue Jan 14, 2020

#32 begin creating a PermCount class.

b016a13

cczhu pushed a commit that referenced this issue Jan 14, 2020

#32 continuing building out PTC module.

4ebc09b

cczhu added a commit that referenced this issue Jan 14, 2020

#32 first draft of revised class structure to allow for data imputation.

9d3195e

cczhu pushed a commit that referenced this issue Jan 14, 2020

#32 switch to using pytest fixtures.

2c7e53e

cczhu pushed a commit that referenced this issue Jan 14, 2020

#32 growth factor test suite.

df815af

cczhu added a commit that referenced this issue Jan 15, 2020

#32 fixed WADT calculator in growth factor.

83652ce

cczhu pushed a commit that referenced this issue Jan 16, 2020

#32 revised PTC tests, bugfixes.

a41841f

cczhu added a commit that referenced this issue Jan 17, 2020

#32 draft of PTC module.

e2018db

cczhu mentioned this issue Jan 17, 2020

CountMatch Refactoring #37

Merged

cczhu added a commit that referenced this issue Jan 19, 2020

#32 archive of CountMatch permanent count processing refactor test.

69c6b04

cczhu pushed a commit that referenced this issue Jan 21, 2020

#32 address PR comments.

e215643

cczhu closed this as completed Jan 21, 2020

cczhu mentioned this issue Jan 21, 2020

Advanced PTC imputer #38

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate out processing permanent counts from countmatch.reader #32

Separate out processing permanent counts from countmatch.reader #32

cczhu commented Jan 8, 2020 •

edited

Loading

cczhu commented Jan 11, 2020

cczhu commented Jan 11, 2020 •

edited

Loading

cczhu commented Jan 15, 2020 •

edited

Loading

cczhu commented Jan 15, 2020

cczhu commented Jan 21, 2020

Separate out processing permanent counts from countmatch.reader #32

Separate out processing permanent counts from countmatch.reader #32

Comments

cczhu commented Jan 8, 2020 • edited Loading

cczhu commented Jan 11, 2020

cczhu commented Jan 11, 2020 • edited Loading

cczhu commented Jan 15, 2020 • edited Loading

cczhu commented Jan 15, 2020

cczhu commented Jan 21, 2020

cczhu commented Jan 8, 2020 •

edited

Loading

cczhu commented Jan 11, 2020 •

edited

Loading

cczhu commented Jan 15, 2020 •

edited

Loading