Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative download link for PhyloP46way & PhastCons46Way missing? #6

Open
KANGseungseok opened this issue Feb 10, 2025 · 3 comments

Comments

@KANGseungseok
Copy link

Hello,

I encountered the following error while running the analysis:

OSError: Failed to initialize a BedReader instance. 'CWAS/analysis/annotation-data/vertebrate.phastCons46way.hg19ToHg38.over02.bed.gz' does not exist.
After investigating, I noticed that this file is not downloaded. Additionally, the README mentions an alternative download link for the PhyloP46way and PhastCons46Way BED files, but I was unable to find the actual link in the repository.

Could you provide the correct download link or guidance on how to access these files?

Thank you for your time and assistance!

@randrover
Copy link
Member

Thank you for your interest.

Here are the download links (Dropbox) for the files you requested:

vertebrate.phastCons46way.hg19ToHg38.over02.bed.gz: Link

vertebrate.phyloP46way.hg19ToHg38.over2.bed.gz: Link

If you encounter any issues accessing the links, please let me know.

@KANGseungseok
Copy link
Author

Thank you very much for promptly providing the requested data.

I am currently running CWAS-Plus and have encountered a few challenges, so I would appreciate your guidance.

I encountered the following error: ValueError: could not create iterator for region 'chrX'. Upon investigation, I found that in /cwas/core/preparation/annotation.py, the chromosome list is defined as chroms = [f"chr{n}" for n in range(1, 23)], which excludes chrX and chrY from parsing. After modifying the code to include chrX and chrY, the annotation process proceeded successfully. Is this modification acceptable, or could this cause any potential issues?

Additionally, when attempting to run the burden analysis, I encountered another error: ValueError: The sample IDs from the adjustment factor list are not the same as the sample IDs from the categorization result. I have verified that my sample information, adjustment factor, and categorization result files contain the same sample names. However, the CWAS-Plus documentation states that the sample information file must contain three columns: SAMPLE and PHENOTYPE, but only explicitly mentions two. Could you clarify what the third column should be? Furthermore, in the burden test options, -s, --sample_info specifies that the file must have three columns (SAMPLE, FAMILY, PHENOTYPE) with these exact names. Could you explain what the FAMILY column represents and whether it is required for all analyses?

I truly appreciate your time and assistance and look forward to your response.

@randrover
Copy link
Member

"After modifying the code to include chrX and chrY, the annotation process proceeded successfully."
-> Thank you for your detailed report!
I find it great that modifying the code to include chrX and chrY allowed the annotation process to proceed successfully. I appreciate you pointing this out.

Regarding the error related to sample ID mismatches, it seems to originate from the function below, which checks whether two lists of SAMPLE IDs are identical:

def cmp_two_arr(array1: np.ndarray, array2: np.ndarray) -> bool:
    """ Return True if two arrays have the same items regardless of the order.
    Otherwise, it returns False.
    """
    if len(array1) != len(array2):
        return False

    array1_item_set = set(array1)

    for item in array2:
        if item not in array1_item_set:
            return False

    return True

You can check if your dataset passes this function using the following:

root = zarr.open(self.cat_path, mode='r')
self._categorization_result = pd.DataFrame(
    data=root['data'],
    index=root['metadata'].attrs['sample_id'],
    columns=root['metadata'].attrs['category']
)
self._categorization_result.index.name = 'SAMPLE'

_contain_same_index(self._categorization_result, self.adj_factor)

def _contain_same_index(table1: pd.DataFrame, table2: pd.DataFrame) -> bool:
    return cmp_two_arr(table1.index.values, table2.index.values)

self._adj_factor = pd.read_table(
    self.adj_factor_path, index_col="SAMPLE", dtype={"SAMPLE": str}, sep="\t"
)

If the issue persists, please verify that the sample names in your categorization result and adjustment factor files are completely identical, including case sensitivity and possible trailing spaces.

As for your question about the third column in the sample information file—thank you for pointing that out! The FAMILY column was previously used to indicate family IDs, but it is not required for burden testing anymore. The documentation has not been updated yet, but we will correct it promptly.

Please let me know if you need further clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants