Alternative download link for PhyloP46way & PhastCons46Way missing? #6

KANGseungseok · 2025-02-10T08:00:00Z

Hello,

I encountered the following error while running the analysis:

OSError: Failed to initialize a BedReader instance. 'CWAS/analysis/annotation-data/vertebrate.phastCons46way.hg19ToHg38.over02.bed.gz' does not exist.
After investigating, I noticed that this file is not downloaded. Additionally, the README mentions an alternative download link for the PhyloP46way and PhastCons46Way BED files, but I was unable to find the actual link in the repository.

Could you provide the correct download link or guidance on how to access these files?

Thank you for your time and assistance!

The text was updated successfully, but these errors were encountered:

randrover · 2025-02-11T00:28:22Z

Thank you for your interest.

Here are the download links (Dropbox) for the files you requested:

vertebrate.phastCons46way.hg19ToHg38.over02.bed.gz: Link

vertebrate.phyloP46way.hg19ToHg38.over2.bed.gz: Link

If you encounter any issues accessing the links, please let me know.

KANGseungseok · 2025-02-11T10:37:30Z

Thank you very much for promptly providing the requested data.

I am currently running CWAS-Plus and have encountered a few challenges, so I would appreciate your guidance.

I encountered the following error: ValueError: could not create iterator for region 'chrX'. Upon investigation, I found that in /cwas/core/preparation/annotation.py, the chromosome list is defined as chroms = [f"chr{n}" for n in range(1, 23)], which excludes chrX and chrY from parsing. After modifying the code to include chrX and chrY, the annotation process proceeded successfully. Is this modification acceptable, or could this cause any potential issues?

Additionally, when attempting to run the burden analysis, I encountered another error: ValueError: The sample IDs from the adjustment factor list are not the same as the sample IDs from the categorization result. I have verified that my sample information, adjustment factor, and categorization result files contain the same sample names. However, the CWAS-Plus documentation states that the sample information file must contain three columns: SAMPLE and PHENOTYPE, but only explicitly mentions two. Could you clarify what the third column should be? Furthermore, in the burden test options, -s, --sample_info specifies that the file must have three columns (SAMPLE, FAMILY, PHENOTYPE) with these exact names. Could you explain what the FAMILY column represents and whether it is required for all analyses?

I truly appreciate your time and assistance and look forward to your response.

randrover · 2025-02-14T08:44:03Z

"After modifying the code to include chrX and chrY, the annotation process proceeded successfully."
-> Thank you for your detailed report!
I find it great that modifying the code to include chrX and chrY allowed the annotation process to proceed successfully. I appreciate you pointing this out.

Regarding the error related to sample ID mismatches, it seems to originate from the function below, which checks whether two lists of SAMPLE IDs are identical:

def cmp_two_arr(array1: np.ndarray, array2: np.ndarray) -> bool:
    """ Return True if two arrays have the same items regardless of the order.
    Otherwise, it returns False.
    """
    if len(array1) != len(array2):
        return False

    array1_item_set = set(array1)

    for item in array2:
        if item not in array1_item_set:
            return False

    return True

You can check if your dataset passes this function using the following:

root = zarr.open(self.cat_path, mode='r')
self._categorization_result = pd.DataFrame(
    data=root['data'],
    index=root['metadata'].attrs['sample_id'],
    columns=root['metadata'].attrs['category']
)
self._categorization_result.index.name = 'SAMPLE'

_contain_same_index(self._categorization_result, self.adj_factor)

def _contain_same_index(table1: pd.DataFrame, table2: pd.DataFrame) -> bool:
    return cmp_two_arr(table1.index.values, table2.index.values)

self._adj_factor = pd.read_table(
    self.adj_factor_path, index_col="SAMPLE", dtype={"SAMPLE": str}, sep="\t"
)

If the issue persists, please verify that the sample names in your categorization result and adjustment factor files are completely identical, including case sensitivity and possible trailing spaces.

As for your question about the third column in the sample information file—thank you for pointing that out! The FAMILY column was previously used to indicate family IDs, but it is not required for burden testing anymore. The documentation has not been updated yet, but we will correct it promptly.

Please let me know if you need further clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative download link for PhyloP46way & PhastCons46Way missing? #6

Alternative download link for PhyloP46way & PhastCons46Way missing? #6

KANGseungseok commented Feb 10, 2025

randrover commented Feb 11, 2025

KANGseungseok commented Feb 11, 2025

randrover commented Feb 14, 2025

Alternative download link for PhyloP46way & PhastCons46Way missing? #6

Alternative download link for PhyloP46way & PhastCons46Way missing? #6

Comments

KANGseungseok commented Feb 10, 2025

randrover commented Feb 11, 2025

KANGseungseok commented Feb 11, 2025

randrover commented Feb 14, 2025