Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly useful utility functions? #15

Open
neishm opened this issue Dec 3, 2018 · 3 comments
Open

Possibly useful utility functions? #15

neishm opened this issue Dec 3, 2018 · 3 comments

Comments

@neishm
Copy link
Contributor

neishm commented Dec 3, 2018

I have some optimized routines from the fstd2nc tool which I've found helpful when scanning through many (hundreds) of files on a routine basis, where even a small overhead can add up to a noticeable delay in the script. Maybe some of them might be useful within python-rpn as utility functions? Below is a brief description of them.

all_params

The all_params method extracts all the record parameters at once, and returns a vectorized dictionary of the result. It scrapes the information directly out of the librmn data structures, and avoids the overhead of repeatedly calling fstprm. Example:

In [1]: import os, sys

In [2]: import rpnpy.librmn.all as rmn

In [3]: from fstd2nc.extra import all_params

In [4]: ATM_MODEL_DFILES = os.getenv('ATM_MODEL_DFILES').strip()

In [5]: fileId = rmn.fstopenall(ATM_MODEL_DFILES+'/bcmk', rmn.FST_RO)

In [6]: p = all_params(fileId)

In [7]: print p
{'deet': array([900, 900, 900, ..., 900, 900, 900], dtype=int32), 'typvar': array(['P ', 'P ', 'P ', ..., 'P ', 'P ', 'X '], dtype='|S2'), 'lng': array([3762, 3762, 3762, ..., 3762, 3762,   13], dtype=int32), 'ni': array([200, 200, 200, ..., 200, 200,   1], dtype=int32), 'nj': array([100, 100, 100, ..., 100, 100,   1], dtype=int32), 'nbits': array([12, 12, 12, ..., 12, 12, 32], dtype=int8), 'swa': array([   2335,    6097,    9859, ..., 4040669, 4044431, 4048193],
      dtype=uint32), 'datyp': array([1, 1, 1, ..., 1, 1, 5], dtype=uint8), 'xtra1': array([354514400, 354514400, 354514400, ..., 354525200, 354525200,
       354525200], dtype=uint32), 'xtra2': array([0, 0, 0, ..., 0, 0, 0], dtype=uint32), 'xtra3': array([0, 0, 0, ..., 0, 0, 0], dtype=uint32), 'ip2': array([ 0,  0,  0, ..., 12, 12,  0], dtype=int32), 'ip3': array([0, 0, 0, ..., 0, 0, 0], dtype=int32), 'ip1': array([       0, 97642568, 97738568, ..., 93423264, 93423264, 44140192],
      dtype=int32), 'key': array([      1,    1025,    2049, ..., 2145288, 2146312, 2147336],
      dtype=int32), 'ubc': array([0, 0, 0, ..., 0, 0, 0], dtype=uint16), 'npas': array([ 0,  0,  0, ..., 48, 48, 48], dtype=int32), 'nk': array([1, 1, 1, ..., 1, 1, 1], dtype=int32), 'ig4': array([0, 0, 0, ..., 0, 0, 0], dtype=int32), 'ig3': array([0, 0, 0, ..., 0, 0, 0], dtype=int32), 'ig2': array([   0,    0,    0, ...,    0,    0, 1600], dtype=int32), 'ig1': array([  0,   0,   0, ...,   0,   0, 800], dtype=int32), 'nomvar': array(['P0  ', 'TT  ', 'TT  ', ..., 'UU  ', 'VV  ', 'HY  '], dtype='|S4'), 'datev': array([354514400, 354514400, 354514400, ..., 354525200, 354525200,
       354525200], dtype=int32), 'dateo': array([354514400, 354514400, 354514400, ..., 354514400, 354514400,
       354514400], dtype=int32), 'etiket': array(['G133K80P    ', 'G133K80P    ', 'G133K80P    ', ...,
       'G133K80P    ', 'G133K80P    ', 'G133K80P    '], dtype='|S12'), 'grtyp': array(['G', 'G', 'G', ..., 'G', 'G', 'X'], dtype='|S1'), 'dltf': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)}

This can be combined with pandas to get a convenient table of parameters:

In [8]: import pandas as pd

In [9]: p = pd.DataFrame(p)

In [10]: print p
          dateo      datev  datyp  deet  dltf  ...   typvar ubc      xtra1  xtra2  xtra3
0     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
1     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
2     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
3     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
4     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
...         ...        ...    ...   ...   ...  ...      ...  ..        ...    ...    ...
2783  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2784  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2785  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2786  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2787  354514400  354525200      5   900     0  ...       X    0  354525200      0      0

[2788 rows x 28 columns]

Having the parameters in this pandas.DataFrame structure provides a more powerful tool for analysing the data. For instance, the pivot method could be used to quickly organize the records into multidimensional time/level structures.
The method is also about 20x faster than looping over fstprm:

In [11]: %timeit map(rmn.fstprm, rmn.fstinl(fileId))
10 loops, best of 3: 84.3 ms per loop

In [12]: %timeit pd.DataFrame(all_params(fileId))
100 loops, best of 3: 4.46 ms per loop

80ms isn't much, but it adds up if you're scanning over hundreds (or thousands) of files.

maybeFST

The maybeFST function is a more compact version of isFST which avoids any librmn calls such as c_wkoffit. That library function can incur some overhead since it's testing for many different formats, not just FST.

When combined together, the maybeFST and all_params functions can allow a user to very quickly scan over many hundreds of files and get a snapshot of all the records inside.

stamp2datetime

The stamp2datetime function converts an array of RPN date stamps into datetime objects. Useful in conjunction with all_params to get date information quickly, e.g.:

In [13]: from fstd2nc.mixins.dates import stamp2datetime

In [14]: print stamp2datetime(p['datev'])
['2009-04-27T00:00:00.000000000' '2009-04-27T00:00:00.000000000'
 '2009-04-27T00:00:00.000000000' ... '2009-04-27T12:00:00.000000000'
 '2009-04-27T12:00:00.000000000' '2009-04-27T12:00:00.000000000']

In [15]: p['date'] = stamp2datetime(p['datev'])

In [16]: print p
          dateo      datev  datyp  deet         ...              xtra1 xtra2 xtra3                date
0     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
1     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
2     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
3     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
4     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
...         ...        ...    ...   ...         ...                ...   ...   ...                 ...
2783  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2784  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2785  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2786  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2787  354514400  354525200      5   900         ...          354525200     0     0 2009-04-27 12:00:00

[2788 rows x 29 columns]

decode_ip1

The decode_ip1 function quickly decodes the levels from an array of ip1 values. For example:

In [18]: import numpy as np

In [19]: from fstd2nc.mixins.vcoords import decode_ip1

In [20]: print decode_ip1(p['ip1'])
[array([(2, 0.)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(5, 0.000125)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(5, 0.000221)], dtype=[('kind', '<i4'), ('level', '<f4')]) ...
 array([(5, 1.)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(5, 1.)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(2, 0.1)], dtype=[('kind', '<i4'), ('level', '<f4')])]

In [21]: levels = np.concatenate(decode_ip1(p['ip1']))

In [22]: print levels
[(2, 0.00e+00) (5, 1.25e-04) (5, 2.21e-04) ... (5, 1.00e+00) (5, 1.00e+00)
 (2, 1.00e-01)]

In [23]: print levels.dtype
[('kind', '<i4'), ('level', '<f4')]

In [24]: p['level'] = levels['level']

In [25]: p['kind'] = levels['kind']

In [26]: print p
          dateo      datev  datyp  deet  dltf  ...  xtra2 xtra3                date     level  kind
0     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000000     2
1     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000125     5
2     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000221     5
3     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000382     5
4     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000635     5
...         ...        ...    ...   ...   ...  ...    ...   ...                 ...       ...   ...
2783  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.995000     5
2784  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.995000     5
2785  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  1.000000     5
2786  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  1.000000     5
2787  354514400  354525200      5   900     0  ...      0     0 2009-04-27 12:00:00  0.100000     2

[2788 rows x 31 columns]

Example

For completion, here's an example using the information from all the above steps to get multidimensional structures for a field. First, get the list of variables:

In [32]: print pd.unique(p.nomvar)
['P0  ' 'TT  ' 'ES  ' 'HU  ' 'MX  ' 'LA  ' 'LO  ' 'ME  ' 'WE  ' 'GZ  '
 'HR  ' 'WW  ' 'PN  ' 'TD  ' 'QC  ' 'FC  ' 'FQ  ' 'IO  ' 'OL  ' 'UE  '
 'EN  ' 'GL  ' 'LH  ' 'MF  ' 'MG  ' 'TM  ' 'VG  ' 'RC  ' 'AL  ' 'I0  '
 'I1  ' 'SD  ' 'I6  ' 'I7  ' 'I8  ' 'I9  ' 'AB  ' 'AG  ' 'AH  ' 'AI  '
 'AP  ' 'AR  ' 'AS  ' 'AU  ' 'AV  ' 'AW  ' 'EV  ' 'FS  ' 'FV  ' 'IC  '
 'IE  ' 'IH  ' 'IV  ' 'NF  ' 'SI  ' 'I2  ' 'I3  ' 'I4  ' 'I5  ' 'DN  '
 'NR  ' 'HF  ' 'FL  ' 'N0  ' 'O1  ' 'RT  ' 'RY  ' 'PR  ' 'PE  ' 'FR  '
 'RN  ' 'SN  ' 'PC  ' 'FB  ' 'N4  ' 'CX  ' 'FI  ' 'AD  ' 'FN  ' 'EI  '
 'H   ' 'J9  ' 'TG  ' 'Z0  ' 'NT  ' 'K6  ' 'K4  ' 'U9  ' 'AE  ' 'RZ  '
 'RR  ' 'U4  ' 'L7  ' 'L8  ' 'IY  ' 'UU  ' 'VV  ' 'HY  ' '>>  ' '^^  '
 'VF  ' 'GA  ' 'J1  ' 'J2  ' 'Y7  ' 'Y8  ' 'Y9  ' 'ZP  ' 'META' 'TS  '
 'TP  ' 'HS  ' 'LG  ' 'ICEL' 'EMIB']

Pick UU:

In [46]: p = p.loc[p.dltf==0]   # Ignore deleted records

In [47]: uu = p.loc[p.nomvar=='UU  ']  # Select UU records

In [48]: print uu
          dateo      datev  datyp  deet  dltf  ...  xtra2 xtra3                date     level  kind
922   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000125     5
924   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000221     5
926   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000382     5
928   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000635     5
930   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.001010     5
...         ...        ...    ...   ...   ...  ...    ...   ...                 ...       ...   ...
2777  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.961000     5
2779  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.974000     5
2781  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.985000     5
2783  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.995000     5
2785  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  1.000000     5

[160 rows x 31 columns]

Organize the records by date/level:

In [49]: uu = uu.pivot(index='date',columns='level')  # Organize by date/level

In [51]: print uu['ip1']
level                0.000125  0.000221  0.000382    ...     0.985000  0.995000  1.000000
date                                                 ...                                 
2009-04-27 00:00:00  97642568  97738568  97899568    ...     95356840  95366840  93423264
2009-04-27 12:00:00  97642568  97738568  97899568    ...     95356840  95366840  93423264

[2 rows x 80 columns]

In [52]: print uu['ip2']
level                0.000125  0.000221  0.000382    ...     0.985000  0.995000  1.000000
date                                                 ...                                 
2009-04-27 00:00:00         0         0         0    ...            0         0         0
2009-04-27 12:00:00        12        12        12    ...           12        12        12

[2 rows x 80 columns]

In [53]: print uu['key']  # The keys/handles for these record
level                0.000125  0.000221  0.000382    ...     0.985000  0.995000  1.000000
date                                                 ...                                 
2009-04-27 00:00:00   1730561   1732609   1734657    ...      2150401   2152449   2154497
2009-04-27 12:00:00   1721352   1723400   1725448    ...      2141192   2143240   2145288

[2 rows x 80 columns]
@meteokid
Copy link
Owner

meteokid commented Dec 3, 2018

Would you mind making a pull request for this. Thanks!

@meteokid
Copy link
Owner

Pulled into rpnpy_2.1-fstfast-branch, will need some more time to review this one before merging into main dev branch.

I would consider spliting this one into 3 files...

vectorize could be used elsewhere and would probably be best by itself
"hack-ish" functions, relying on librmn internal structure knowledge... this one I would not import in "utils.all"
other functions
In any case, thanks for these optimizations.

@neishm
Copy link
Contributor Author

neishm commented Dec 17, 2018

I agree some of these should not go into "utils.all". Splitting them sounds like a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants