add matched groups documentation #207

rosemccollum · 2024-04-22T16:15:24Z

/home/feczk001/shared/code/internal/utilities/automated_subset_analysis/HowToMakeSubsets.md

LuciMoore · 2024-09-26T00:55:38Z

I submitted an issue to update the README for the automated subset codebase DCAN-Labs/automated-subset-analysis#2

for CDNI Brain, we can just add a link for now and I'll add MSI- specific info as I go

LuciMoore · 2024-10-03T19:02:21Z

here is the information specific to MSI that can be added to CDNI's Brain. The rest has been integrated into the repo's readme on GitHub. some of the formatting might need updating since I just picked out the parts that were relevant to MSI from different parts of the documentation. it might also be nice to review the contents for accuracy as well (for instance, I'm not sure if the errors described at the end are an issue anymore)

Creating Matched Demographic Subgroups

For full information on how to run automated_subset_analysis.py and which arguments/options to run it with, see the README.md files in the automated subset analysis repository on GitHub, especially the top-level README.md

Location On the MSI

Several copies and subsets of the ABCD ARMS demographics files are kept in the directory at the path below:

/home/feczk001/shared/projects/ABCD/core_task_activation_study/code/demographics/

I typically use the group1_demo_original.csv and group2_demo_original.csv demographics files in the directory above.

How to Generate Group Average Matrices

You will need either

two averaged matrix .pconn.nii files, one for ARM-1 and another for ARM-2; or
two .conc files listing .pconn.nii file paths. Each .conc file must list the path to every each file matrix file for every individual subject in each ARM.

Example

To generate matched subsets using the existing subset analysis script on the MSI, run automated_subset_analysis.py using the Linux Bash Shell code provided below.

In the code below, change the variable declarations so they match (a) the data that you can access and (b) what you are trying to do. I filled in the paths to the input data I have normally used.

# Parent directory containing ABCD data, including input files for this script
dir_ABCD="/home/feczk001/shared/projects/ABCD";

# Directory containing average brain scan .pconn.nii files 
dir_pconns="${dir_ABCD}/conan_subset_analysis/gp-avg-pconns";

# Directory containing demographics .csv files
dir_demo="${dir_ABCD}/core_task_activation_study/code/demographics";

# Directory containing subset analysis code
dir_ASA="/home/feczk001/shared/code/internal/utilities/automated_subset_analysis";

# Directory to save subset analysis output files into
dir_output="./test/output_dir/"; 

# How many participants should be in each subset? Do you want to generate
# subsets of multiple sizes, e.g. 50 participants in one and 500 in another?
subjects_per_subset="50 100 500";

# How many subset pairs of each subset size do you want to generate? 
pairs_to_make=2;

# Run the subset analysis script
python3 ${dir_ASA}/automated_subset_analysis.py \
  ${dir_demo}/group1_demo_original.csv \
  ${dir_demo}/group2_demo_original.csv \
  --group-1-avg-file ${dir_pconns}/gp1_pconns_AVG.pconn.nii  \
  --group-2-avg-file ${dir_pconns}/gp2_pconns_AVG.pconn.nii  \
  --subset-size ${subjects_per_subset} \
  --n-analyses ${pairs_to_make} \
  --output ${dir_output}

Output Files

The script will save the demographically-matched subset pair into a text file in the --output directory. Each one will be named subset_{x}_with_{y}_subjects.csv, where x ranges from 1 to the --n-analyses value and y is every value in the --subset-size list.

Warnings and Troubleshooting

Script Printing Too Much Text

In its current form, the subset analysis script will print a lot of text to the terminal. If you would rather hide that text, then you can add 2>1 1>/dev/null to the end of the command in the Example section above.

Takes Too Much Time to Run

Several factors can make the automated_subset_analysis.py script take a longer amount of time than necessary to generate subset pairs:

If one of the --subset-size values is below 50, it can take a longer time to randomly generate a demographically matched subset pair. I think that is mostly because a smaller number of subjects usually means a higher standard deviation in the values of the demographic variables.
- Subset sizes below 25 (or so) take an impossibly long time to generate. If the subsets are taking far too long to generate, you may need to either (a) try a higher subset size or (b) use --no-matching to skip demographic matching on family ID.
If you provide .conc files (--matrices-conc-1 and --matrices-conc-2) instead of average pconn.nii files (--group-1-avg-file and --group-2-avg-file), then the subset analysis script will generate an average .pconn.nii file for each ARM. That takes significantly longer than using an existing average pconn.nii file.

Common Errors

`Not enough subjects..` or `ValueError`

Full error:

Not enough subjects in population to randomly select a sample with {X} subjects, because {Y} subjects cannot be randomly swapped out from a pool of {Z} subjects

or

ValueError: Cannot take a larger sample than population when 'replace=False'

Problem: At least one of the --subset-size values is too high.

Solutions:

Reduce the largest --subset-size value to, at most, about 45% of the smallest ARM's size.
Include the --no-matching flag to skip family matching.

Explanation:

All --subset-size values must be large enough to demographically match the other ARM, but small enough to swap out any participants whose inclusion is invalid for any reason (e.g. they have family members outside the subset). For example, if the smallest ARM has 3000 subjects, then errors will occur unless you keep the --subset-size numbers under 1500. If the errors may still occur,you can try reducing the largest --subset-size further. The new subset size must be less than about 45% of the smallest ARM's size excluding participants with NaNs in the demographic file. The number and percentage of participants with NaNs in each group is printed right after the script begins.
By default, automated_subset_analysis.py checks that every subset (a) has the same proportion of twins/triplets as the other ARM, and (b) excludes anyone with family members outside the subset. The --no-matching flag turns both checks off. It lets you to generate subsets of less than 25, or larger than half the ARM size. It also speeds up subset generation/checking.

`FileNotFoundError: No such file or no access: '/home/exacloud/`...

Problem: Incorrect .pconn.nii paths were read either from the group demographics file(s), from the --matrices-conc-1 file, or from the --matrices-conc-2 file.

Solutions:

If you only want to generate subset pairs, then simply ignore this error. The subset pairs were already generated.
Ensure that every line/row in the last column of the group 1 and 2 demographics files contains one path to an existing, readable .pconn.nii file.
If you can't/won't change the demographics file(s), then use the --matrices-conc-1 and --matrices-conc-2 flags with a path to a valid .conc file after each.

Explanation:

Even using the code I gave in the Example section above, the script may still crash with an error message -- after generating your subsets. That's fine if you only need the subsets pairs. (A later version of the script should include a specific option to only generate subsets.)
The automated_subset_analysis code was originally written to run on OHSU's Rushmore and Exacloud servers. It has directory paths on those servers hardcoded. So, if the script cannot find a .pconn.nii file at the given path, it will look for the file using the hardcoded Exacloud directory path.
The --matrices-conc-1 (-conc1) and --matrices-conc-2 (-conc2) arguments exist to replace the paths in the last column of the demographics file. If there are no valid paths in the last column of the demographics file, then you can provide -conc1 and -conc2 arguments. Each should be a .conc file with one valid path to an existing .nii file per line. Every line of each .conc file must be for the same subject as the corresponding line number in the demographics files: Because the demographics files should have header lines but the .conc files shouldn't, line 1 of the group 1 .conc file corresponds to line 2 of the group 1 demographics .csv file, line 2 of the .conc to line 3 of the .csv, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add matched groups documentation #207

add matched groups documentation #207

rosemccollum commented Apr 22, 2024

LuciMoore commented Sep 26, 2024

LuciMoore commented Oct 3, 2024

add matched groups documentation #207

add matched groups documentation #207

Comments

rosemccollum commented Apr 22, 2024

LuciMoore commented Sep 26, 2024

LuciMoore commented Oct 3, 2024

Creating Matched Demographic Subgroups

Location On the MSI

How to Generate Group Average Matrices

Example

Output Files

Warnings and Troubleshooting

Script Printing Too Much Text

Takes Too Much Time to Run

Common Errors

Not enough subjects.. or ValueError

FileNotFoundError: No such file or no access: '/home/exacloud/...

`Not enough subjects..` or `ValueError`

`FileNotFoundError: No such file or no access: '/home/exacloud/`...