-
Notifications
You must be signed in to change notification settings - Fork 3
Machine learning Algorithm for Radiative transfer of Generated Exoplanets
License
exosports/MARGE
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
MARGE Machine learning Algorithm for Radiative transfer of Generated Exoplanets =============================================================================== Author : Michael D. Himes Morgan State University/NASA GSFC (prev. University of Central Florida) Contact: [email protected] Advisor: Joseph Harrington University of Central Florida Uncredited contributors: Adam D. Cobb - plan-net code inspired NN class & training David C. Wright - containerized the code Zaccheus Scheffer - assisted with refactoring the codebase that led to MARGE Acknowledgements ---------------- This research was supported by the NASA Fellowship Activity under NASA Grant 80NSSC20K0682. We gratefully thank Nvidia Corporation for the Titan Xp GPU that was used throughout development of the software. Summary ======= MARGE is a Python package that abstracts the process of optimizing and training neural network (NN) models on labeled data sets. Users can supply a function to interface MARGE with another code to be used for data generation. The original use case for MARGE was to generate exoplanet spectra across a defined parameter space using the Bayesian Atmospheric Radiative Transfer (BART) code and train an NN surrogate model for radiative transfer (RT). MARGE comes with complete documentation as well as a user manual to assist in its usage. Users can find the latest MARGE User Manual at https://exosports.github.io/MARGE/doc/MARGE_User_Manual.html. (Note: it is currently outdated, but updates are in progress.) MARGE is an open-source project that welcomes improvements from the community to be submitted via pull requests on Github. To be accepted, such improvements must be generally consistent with the existing coding style, and all changes must be updated in associated documentation. MARGE is released under the Reproducible Research Software License. Users are free to use the software for personal reasons. Users publishing results from MARGE or modified versions of it in a peer-reviewed journal are required to publicly release all necessary code/data to reproduce the published work. Modified versions of MARGE are required to also be released open source under the same Reproducible Research Software License. For more details, see the text of the license. Files & Directories =================== MARGE contains various files and directories, described here. doc/ - Contains documentation for MARGE. The User Manual contains the information in the README with more detail, as well as a walkthrough for setup and running an example. environment.yml - Conda environment for MARGE. example/ - Contains example configuration files for MARGE. lib/ - Contains the classes and functions of MARGE. callbacks.py - Contains Keras Callback classes. datagen/ - Contains files related to data generation BART/ - Files necessary for data generation/processing with BART. BARTfunc.py - Modified MCMC function that saves out full spectra. datagen.py - Contains functions for generating/processing data with BART. datagen_pypsg.py - As above, but for the pypsg format (see https://github.com/mdhimes/pypsg). loader.py - Contains functions related to loading processed data. NN.py - Contains the NN model class, and a driver function for model training/validating/testing. plotter.py - Contains plotting functions. stats.py - Contains functions related to statistics. utils.py - Contains utiity functions used for internal processing. Makefile - Handles setting up BART, and creating a TLI file. MARGE.py - The executable driver for MARGE. Accepts a configuration file. modules/ - Contains necessary modules for data generation. Initially empty; will contain BART via the Makefile. README - This file! requirements.txt- Additional packages for the conda environment installed via pip. Note that all .py files have complete documentation; consult a specific file for more details. Installation ============ To build the included conda environment, enter conda env create -f environment.yml into a terminal. Then, activate it via conda activate marge At present, MARGE only supports using BART for data generation. For details, see the associated user manual in the modules/BART/doc directory. If running MARGE in data generation mode, users must first compile BART's submodules. This can be done by entering make bart into a terminal while in the MARGE directory. This requires SWIG, which is not included in the default conda environment. If the user does not have SWIG installed, they can easily add SWIG to the current conda environment by entering conda install swig into the terminal. Note that it uses tensorflow-gpu, which assumes the user has Nvidia drivers installed. If this is not the case, after building the environment, non-GPU users must enter conda remove -n marge tensorflow-gpu and follow the prompts. This will remove the requirement for Nvidia drivers. Be aware that not using a GPU will require significantly longer times to execute MARGE. For more details, see the MARGE User Manual. Executing MARGE =============== MARGE has 2 execution modes: data generation, and ML model training. After setting up the MARGE configuration file, run MARGE via ./MARGE.py <path/to/configuration file> Data generation is handled via a user-provided function to interface with some other code. At present, MARGE only includes such a function for BART. Note that in some cases (e.g., the quick_example case in the example/ subdirectory), it is simpler to generate the data without using MARGE, then ingest that data into MARGE. Before data generation via BART can begin, users must first generate a TLI file via pylineread, BART's line list-processing code. To do so, input make TLI cfile=<path/to/configration file> to call pylineread with a configuration file. For details on how to set this up, see the pylineread directory within BART's RT module, transit. After generating a TLI file, data generation can begin. Note that data generation via BART requires a BART configuration file. In order to save the data, the BART config file must have the 'savemodel' parameter set to <name>.npy, where <name> is some user-chosen prefix. This config file must also have the 'modelper' parameter set; this controls the number of iterations per chain to save out each model file, to ensure that the evaluated models will fit in memory. If the models will fit in memory as a single array, set modelper=0 to save out a single file with all models. If modelper>0, the generated .npy files will be saved to a subdirectory within BART's output directory, named <name> from the 'savemodel' parameter. For a detailed example walkthrough, see the README in the example/BART_example/ subdirectory or the user manual. ML Model Training: Data ----------------------- If training an ML model using MARGE, data must be formatted a particular way. At present, MARGE requires that the files are Numpy zipped archives (NPZ). Each NPZ file must contain named data sets of 'x' for inputs and 'y' for outputs. All other named data sets are ignored at present. The 'x' and 'y' arrays must be at least 2D, where the 0th axis corresponds to each data case. For example, arr['x'][0] would be the inputs for one data case in the data set. Each data case can contain as many dimensions as necessary. Users are encouraged to introduce additions to loader.py to allow additional data formats, which would be specified in the configuration file. Please submit any such modifications via pull requests on Github. Setting Up a MARGE Configuration File ===================================== A MARGE configuration file contains many options, which are detailed in this section. For examples, see MARGE*.cfg files in the example subdirectories and their associated READMEs with instructions on execution. Directories ----------- inputdir : str. Directory containing MARGE inputs. outputdir : str. Directory containing MARGE outputs. plotdir : str. Directory to save plots. If relative path, subdirectory with respect to `outputdir`. datadir : str. Directory to store generated data. If relative path, subdirectory with respect to `outputdir`. preddir : str. Directory to store validation and test set predictions and true values. If relative path, subdirectory with respect to `outputdir`. Datagen Parameters ------------------ datagen : bool. Determines whether to generate data. datagenfile: str. File containing functions to generate and/or process data. Do NOT include the .py extension! See the files in the lib/datagen directory for examples. User-specified files must have identically-named functions with an identical set of inputs. If an additional input is required, the user must modify the code in MARGE.py accordingly. Please submit a pull request if this occurs! cfile : str. Name of the configuration file for data generation. Can be absolute path, or relative path to `inputdir`. processdat : bool. Determines whether to process the generated data. preservedat: bool. Determines whether to preserve the unprocessed data after completing data processing. Note: if False, it will PERMANENTLY DELETE the original, unprocessed data! Neural Network (NN) Parameters ------------------------------ nnmodel : bool. Determines whether to use an NN model. resume : bool. Determines whether to resume training a previously-trained model. seed : int. Random seed. trainflag : bool. Determines whether to train an NN model. validflag : bool. Determines whether to validate an NN model. testflag : bool. Determines whether to test an NN model. TFR_file : str. Prefix for the TFRecords files to be created. buffer : int. Number of batches to pre-load into memory. ncores : int. Number of CPU cores to use to load the data in parallel. normalize : bool. Determines whether to normalize the data by its mean and standard deviation. scale : bool. Determines whether to scale the data to be within a range. scalelims : floats. The min, max of the range of the scaled data. It is recommended to use -1, 1 weight_file: str. File containing NN model weights. NOTE: MUST end in .h5 input_dim : int. Dimensionality of the input to the NN. output_dim : int. Dimensionality of the output of the NN. ilog : bool. Determines whether to take the log10 of the input data. Alternatively, specify comma-, space-, or newline-separated integers to selectively take the log of certain inputs. olog : bool. Determines whether to take the log10 of the output data. Alternatively, specify comma-, space-, or newline-separated integers to selectively take the log of certain outputs. gridsearch : bool. Determines whether to perform a gridsearch over architectures. architectures: strings. Name/identifier for a given model architecture. Names must not include spaces! For multiple architectures, separate with a space or indented newlines. nodes : ints. Number of nodes per layer with nodes. layers: strings. Type of each hidden layer. Options: dense, conv1d, maxpool1d, avgpool1d, flatten, dropout lay_params: list. Parameters for each layer (e.g., kernel size). For no parameter or the default, use None. activations: strings. Type of activation function per layer with nodes. Options: linear, relu, leakyrelu, elu, tanh, sigmoid, exponential, softsign, softplus, softmax act_params: list. Parameters for each activation. Use None for no parameter or the default value. Values specified only apply to LeakyReLU and ELU. epochs : int. Maximum number of iterations through the training data set. patience : int. Early-stopping criteria; stops training after `patience` epochs of no improvement in validation loss. batch_size : int. Mini-batch size for training/validation steps. lengthscale: float. Minimum learning rate. max_lr : float. Maximum learning rate. clr_mode : str. Specifies the function to use for a cyclical learning rate (CLR). 'triangular' linearly increases from `lengthscale` to `max_lr` over `clr_steps` iterations, then decreases. 'triangular2' performs similar to `triangular', except that the `max_lr` value is decreased by 2 every complete cycle, i.e., 2 * `clr_steps`. 'exp_range' performs similar to 'triangular2', except that the amplitude decreases according to an exponential based on the epoch number, rather than the CLR cycle. clr_steps : int. Number of steps through a half-cycle of the learning rate. E.g., if using clr_mode = 'triangular' and clr_steps = 4, Every 8 epochs will have the same learning rate. It is recommended to use an even value. For more details, see Smith (2015), Cyclical Learning Rates for Training Neural Networks. Plotting Parameters ------------------- xvals : str. .NPY file with the x-axis values corresponding to the NN output. xlabel : str. X-axis label for plots. ylabel : str. Y-axis label for plots. plot_cases : ints. Specifies which cases in the test set should be plotted vs. the true spectrum. Note: must be separated by spaces or indented newlines. Statistics Files ---------------- fmean : str. File name containing the mean of each input/output. Assumed to be in `inputdir`. fstdev : str. File name containing the standard deviation of each input/output. Assumed to be in `inputdir`. fmin : str. File name containing the minimum of each input/output. Assumed to be in `inputdir`. fmax : str. File name containing the maximum of each input/output. Assumed to be in `inputdir`. rmse_file : str. Prefix for the file to be saved containing the root mean squared error of predictions on the validation \& test data. Saved into `outputdir`. r2_file : str. Prefix for the file to be saved containing the coefficient of determination (R\^2) of predictions on the validation \& test data. Saved into `outputdir`. filters : strings. (optional) Paths/to/filter files that define a bandpass over `xvals`. Two columns, wavelength and transmission. Used to compute statistics for band-integrated values. filt2um : float. (default: 1.0) Factor to convert the filter files' X values to microns. Only used if `filters` is specified. Versions ======== MARGE was developed on a Unix/Linux machine using the following versions of packages: - Python 3.12.8 - Keras 3.7.0 - Numpy 1.26.4 - Matplotlib 3.10.0 - Scipy 1.14.1 - sklearn 1.6.0 - Tensorflow 2.17.0 - CUDA 9.1.85 - cuDNN 7.5.00 MARGE also requires a working MPI distribution if using BART for data generation. MARGE was developed using MPICH version 3.3.2. Be kind ======= Please cite this paper if you found this package useful for your research: Himes et al. 2022, PSJ, 3, 91 https://iopscience.iop.org/article/10.3847/PSJ/abe3fd/meta @ARTICLE{2022PSJ.....3...91H, author = {{Himes}, Michael D. and {Harrington}, Joseph and {Cobb}, Adam D. and {G{\"u}ne{\c{s}} Baydin}, At{\i}l{\i}m and {Soboczenski}, Frank and {O'Beirne}, Molly D. and {Zorzan}, Simone and {Wright}, David C. and {Scheffer}, Zacchaeus and {Domagal-Goldman}, Shawn D. and {Arney}, Giada N.}, title = "{Accurate Machine-learning Atmospheric Retrieval via a Neural-network Surrogate Model for Radiative Transfer}", journal = {\psj}, keywords = {Exoplanet atmospheres, Bayesian statistics, Posterior distribution, Convolutional neural networks, Neural networks, 487, 1900, 1926, 1938, 1933}, year = 2022, month = apr, volume = {3}, number = {4}, eid = {91}, pages = {91}, doi = {10.3847/PSJ/abe3fd}, adsurl = {https://ui.adsabs.harvard.edu/abs/2022PSJ.....3...91H}, adsnote = {Provided by the SAO/NASA Astrophysics Data System} } Thanks!
About
Machine learning Algorithm for Radiative transfer of Generated Exoplanets
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published