-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
362 lines (305 loc) · 17.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
MARGE
Machine learning Algorithm for Radiative transfer of Generated Exoplanets
===============================================================================
Author : Michael D. Himes Morgan State University/NASA GSFC (prev. University of Central Florida)
Contact: [email protected]
Advisor: Joseph Harrington University of Central Florida
Uncredited contributors: Adam D. Cobb
- plan-net code inspired NN class & training
David C. Wright
- containerized the code
Zaccheus Scheffer
- assisted with refactoring the codebase that led to MARGE
Acknowledgements
----------------
This research was supported by the NASA Fellowship Activity under NASA Grant
80NSSC20K0682. We gratefully thank Nvidia Corporation for the Titan Xp GPU
that was used throughout development of the software.
Summary
=======
MARGE is a Python package that abstracts the process of optimizing and training
neural network (NN) models on labeled data sets. Users can supply a function to
interface MARGE with another code to be used for data generation. The original
use case for MARGE was to generate exoplanet spectra across a defined parameter
space using the Bayesian Atmospheric Radiative Transfer (BART) code and train
an NN surrogate model for radiative transfer (RT).
MARGE comes with complete documentation as well as a user manual to assist
in its usage. Users can find the latest MARGE User Manual at
https://exosports.github.io/MARGE/doc/MARGE_User_Manual.html.
(Note: it is currently outdated, but updates are in progress.)
MARGE is an open-source project that welcomes improvements from the community
to be submitted via pull requests on Github. To be accepted, such improvements
must be generally consistent with the existing coding style, and all changes
must be updated in associated documentation.
MARGE is released under the Reproducible Research Software License. Users are
free to use the software for personal reasons. Users publishing results from
MARGE or modified versions of it in a peer-reviewed journal are required to
publicly release all necessary code/data to reproduce the published work.
Modified versions of MARGE are required to also be released open source under
the same Reproducible Research Software License. For more details, see the
text of the license.
Files & Directories
===================
MARGE contains various files and directories, described here.
doc/ - Contains documentation for MARGE. The User Manual contains
the information in the README with more detail, as well as a
walkthrough for setup and running an example.
environment.yml - Conda environment for MARGE.
example/ - Contains example configuration files for MARGE.
lib/ - Contains the classes and functions of MARGE.
callbacks.py - Contains Keras Callback classes.
datagen/ - Contains files related to data generation
BART/ - Files necessary for data generation/processing with BART.
BARTfunc.py - Modified MCMC function that saves out full spectra.
datagen.py - Contains functions for generating/processing data with BART.
datagen_pypsg.py - As above, but for the pypsg format
(see https://github.com/mdhimes/pypsg).
loader.py - Contains functions related to loading processed data.
NN.py - Contains the NN model class, and a driver function for model
training/validating/testing.
plotter.py - Contains plotting functions.
stats.py - Contains functions related to statistics.
utils.py - Contains utiity functions used for internal processing.
Makefile - Handles setting up BART, and creating a TLI file.
MARGE.py - The executable driver for MARGE. Accepts a configuration file.
modules/ - Contains necessary modules for data generation.
Initially empty; will contain BART via the Makefile.
README - This file!
requirements.txt- Additional packages for the conda environment installed via
pip.
Note that all .py files have complete documentation; consult a specific file
for more details.
Installation
============
To build the included conda environment, enter
conda env create -f environment.yml
into a terminal. Then, activate it via
conda activate marge
At present, MARGE only supports using BART for data generation. For details,
see the associated user manual in the modules/BART/doc directory.
If running MARGE in data generation mode, users must first compile BART's
submodules. This can be done by entering
make bart
into a terminal while in the MARGE directory.
This requires SWIG, which is not included in the default conda environment.
If the user does not have SWIG installed, they can easily add SWIG to the
current conda environment by entering
conda install swig
into the terminal.
Note that it uses tensorflow-gpu, which assumes the user has Nvidia drivers
installed. If this is not the case, after building the environment, non-GPU
users must enter
conda remove -n marge tensorflow-gpu
and follow the prompts. This will remove the requirement for Nvidia drivers.
Be aware that not using a GPU will require significantly longer times to
execute MARGE.
For more details, see the MARGE User Manual.
Executing MARGE
===============
MARGE has 2 execution modes: data generation, and ML model training.
After setting up the MARGE configuration file, run MARGE via
./MARGE.py <path/to/configuration file>
Data generation is handled via a user-provided function to interface with
some other code. At present, MARGE only includes such a function for BART.
Note that in some cases (e.g., the quick_example case in the example/
subdirectory), it is simpler to generate the data without using MARGE, then
ingest that data into MARGE.
Before data generation via BART can begin, users must first generate a TLI
file via pylineread, BART's line list-processing code. To do so, input
make TLI cfile=<path/to/configration file>
to call pylineread with a configuration file. For details on how to set this
up, see the pylineread directory within BART's RT module, transit. After
generating a TLI file, data generation can begin.
Note that data generation via BART requires a BART configuration file. In order
to save the data, the BART config file must have the 'savemodel' parameter set to
<name>.npy, where <name> is some user-chosen prefix. This config file must also
have the 'modelper' parameter set; this controls the number of iterations per
chain to save out each model file, to ensure that the evaluated models will fit
in memory. If the models will fit in memory as a single array, set modelper=0
to save out a single file with all models. If modelper>0, the generated .npy
files will be saved to a subdirectory within BART's output directory, named
<name> from the 'savemodel' parameter.
For a detailed example walkthrough, see the README in the example/BART_example/
subdirectory or the user manual.
ML Model Training: Data
-----------------------
If training an ML model using MARGE, data must be formatted a particular way.
At present, MARGE requires that the files are Numpy zipped archives (NPZ).
Each NPZ file must contain named data sets of 'x' for inputs and 'y' for
outputs. All other named data sets are ignored at present. The 'x' and 'y'
arrays must be at least 2D, where the 0th axis corresponds to each data case.
For example, arr['x'][0] would be the inputs for one data case in the
data set. Each data case can contain as many dimensions as necessary.
Users are encouraged to introduce additions to loader.py to allow additional
data formats, which would be specified in the configuration file.
Please submit any such modifications via pull requests on Github.
Setting Up a MARGE Configuration File
=====================================
A MARGE configuration file contains many options, which are detailed in this
section. For examples, see MARGE*.cfg files in the example subdirectories
and their associated READMEs with instructions on execution.
Directories
-----------
inputdir : str. Directory containing MARGE inputs.
outputdir : str. Directory containing MARGE outputs.
plotdir : str. Directory to save plots.
If relative path, subdirectory with respect to `outputdir`.
datadir : str. Directory to store generated data.
If relative path, subdirectory with respect to `outputdir`.
preddir : str. Directory to store validation and test set predictions and
true values.
If relative path, subdirectory with respect to `outputdir`.
Datagen Parameters
------------------
datagen : bool. Determines whether to generate data.
datagenfile: str. File containing functions to generate and/or process
data. Do NOT include the .py extension!
See the files in the lib/datagen directory for examples.
User-specified files must have identically-named
functions with an identical set of inputs. If an
additional input is required, the user must modify the
code in MARGE.py accordingly.
Please submit a pull request if this occurs!
cfile : str. Name of the configuration file for data generation.
Can be absolute path, or relative path to `inputdir`.
processdat : bool. Determines whether to process the generated data.
preservedat: bool. Determines whether to preserve the unprocessed data after
completing data processing.
Note: if False, it will PERMANENTLY DELETE the original,
unprocessed data!
Neural Network (NN) Parameters
------------------------------
nnmodel : bool. Determines whether to use an NN model.
resume : bool. Determines whether to resume training a previously-trained
model.
seed : int. Random seed.
trainflag : bool. Determines whether to train an NN model.
validflag : bool. Determines whether to validate an NN model.
testflag : bool. Determines whether to test an NN model.
TFR_file : str. Prefix for the TFRecords files to be created.
buffer : int. Number of batches to pre-load into memory.
ncores : int. Number of CPU cores to use to load the data in parallel.
normalize : bool. Determines whether to normalize the data by its mean and
standard deviation.
scale : bool. Determines whether to scale the data to be within a range.
scalelims : floats. The min, max of the range of the scaled data.
It is recommended to use -1, 1
weight_file: str. File containing NN model weights.
NOTE: MUST end in .h5
input_dim : int. Dimensionality of the input to the NN.
output_dim : int. Dimensionality of the output of the NN.
ilog : bool. Determines whether to take the log10 of the input data.
Alternatively, specify comma-, space-, or newline-separated
integers to selectively take the log of certain inputs.
olog : bool. Determines whether to take the log10 of the output data.
Alternatively, specify comma-, space-, or newline-separated
integers to selectively take the log of certain outputs.
gridsearch : bool. Determines whether to perform a gridsearch over
architectures.
architectures: strings. Name/identifier for a given model architecture.
Names must not include spaces!
For multiple architectures, separate with a space or
indented newlines.
nodes : ints. Number of nodes per layer with nodes.
layers: strings. Type of each hidden layer.
Options: dense, conv1d, maxpool1d, avgpool1d, flatten, dropout
lay_params: list. Parameters for each layer (e.g., kernel size).
For no parameter or the default, use None.
activations: strings. Type of activation function per layer with nodes.
Options: linear, relu, leakyrelu, elu, tanh, sigmoid,
exponential, softsign, softplus, softmax
act_params: list. Parameters for each activation.
Use None for no parameter or the default value.
Values specified only apply to LeakyReLU and ELU.
epochs : int. Maximum number of iterations through the training data set.
patience : int. Early-stopping criteria; stops training after `patience`
epochs of no improvement in validation loss.
batch_size : int. Mini-batch size for training/validation steps.
lengthscale: float. Minimum learning rate.
max_lr : float. Maximum learning rate.
clr_mode : str. Specifies the function to use for a cyclical learning rate
(CLR).
'triangular' linearly increases from `lengthscale` to
`max_lr` over `clr_steps` iterations, then decreases.
'triangular2' performs similar to `triangular', except that
the `max_lr` value is decreased by 2 every complete cycle,
i.e., 2 * `clr_steps`.
'exp_range' performs similar to 'triangular2', except that
the amplitude decreases according to an exponential based
on the epoch number, rather than the CLR cycle.
clr_steps : int. Number of steps through a half-cycle of the learning rate.
E.g., if using clr_mode = 'triangular' and clr_steps = 4,
Every 8 epochs will have the same learning rate.
It is recommended to use an even value.
For more details, see Smith (2015), Cyclical Learning Rates
for Training Neural Networks.
Plotting Parameters
-------------------
xvals : str. .NPY file with the x-axis values corresponding to
the NN output.
xlabel : str. X-axis label for plots.
ylabel : str. Y-axis label for plots.
plot_cases : ints. Specifies which cases in the test set should be
plotted vs. the true spectrum.
Note: must be separated by spaces or indented newlines.
Statistics Files
----------------
fmean : str. File name containing the mean of each input/output.
Assumed to be in `inputdir`.
fstdev : str. File name containing the standard deviation of each
input/output.
Assumed to be in `inputdir`.
fmin : str. File name containing the minimum of each input/output.
Assumed to be in `inputdir`.
fmax : str. File name containing the maximum of each input/output.
Assumed to be in `inputdir`.
rmse_file : str. Prefix for the file to be saved containing the root mean
squared error of predictions on the validation \& test data.
Saved into `outputdir`.
r2_file : str. Prefix for the file to be saved containing the
coefficient of determination (R\^2) of predictions on the
validation \& test data.
Saved into `outputdir`.
filters : strings. (optional) Paths/to/filter files that define a
bandpass over `xvals`.
Two columns, wavelength and transmission.
Used to compute statistics for band-integrated values.
filt2um : float. (default: 1.0) Factor to convert the filter files'
X values to microns.
Only used if `filters` is specified.
Versions
========
MARGE was developed on a Unix/Linux machine using the following
versions of packages:
- Python 3.12.8
- Keras 3.7.0
- Numpy 1.26.4
- Matplotlib 3.10.0
- Scipy 1.14.1
- sklearn 1.6.0
- Tensorflow 2.17.0
- CUDA 9.1.85
- cuDNN 7.5.00
MARGE also requires a working MPI distribution if using BART for
data generation. MARGE was developed using MPICH version 3.3.2.
Be kind
=======
Please cite this paper if you found this package useful for your research:
Himes et al. 2022, PSJ, 3, 91
https://iopscience.iop.org/article/10.3847/PSJ/abe3fd/meta
@ARTICLE{2022PSJ.....3...91H,
author = {{Himes}, Michael D. and {Harrington}, Joseph and {Cobb}, Adam D. and {G{\"u}ne{\c{s}} Baydin}, At{\i}l{\i}m and {Soboczenski}, Frank and {O'Beirne}, Molly D. and {Zorzan}, Simone and {Wright}, David C. and {Scheffer}, Zacchaeus and {Domagal-Goldman}, Shawn D. and {Arney}, Giada N.},
title = "{Accurate Machine-learning Atmospheric Retrieval via a Neural-network Surrogate Model for Radiative Transfer}",
journal = {\psj},
keywords = {Exoplanet atmospheres, Bayesian statistics, Posterior distribution, Convolutional neural networks, Neural networks, 487, 1900, 1926, 1938, 1933},
year = 2022,
month = apr,
volume = {3},
number = {4},
eid = {91},
pages = {91},
doi = {10.3847/PSJ/abe3fd},
adsurl = {https://ui.adsabs.harvard.edu/abs/2022PSJ.....3...91H},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
Thanks!