Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add interval logic for l2g features #812

Open
wants to merge 62 commits into
base: dev
Choose a base branch
from
Open

Conversation

xyg123
Copy link
Contributor

@xyg123 xyg123 commented Oct 3, 2024

✨ Context

Adding interval based features to the l2g model, based on the feature list (opentargets/issues#3521).
opentargets/issues#3512

🛠 What does this PR implement

  • Implementation of PCHIC-based interval features for the L2G gene prediction model.
  • Added back interval processing steps into the L2G feature generation step.

🙈 Missing

More features from anderson + thurman.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@addramir addramir requested a review from ireneisdoomed October 5, 2024 07:21
# feature will be the same for any gene associated with a studyLocus)
local_max.withColumn(
"regional_maximum",
f.max(local_feature_name).over(Window.partitionBy("studyLocusId")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it maximum? According to the table and what we discussed it should be mean?
https://docs.google.com/spreadsheets/d/1wUs1AprRCCGItZmgDhc1fF5BtwCSosdzFv4NQ8V6Dtg/edit?gid=452826388#gid=452826388

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes Jack!!!

The logic to build the features looks good! Please see my comments, but they are more along the lines of how we process the interval data in the L2G step.
I suggested processing all interval sources to make the process simpler, but since the code is accommodated to take source names and paths individually and changing it is a mess, it's also fine to leave it like that as long as the interval_paths parameter is correctly configured.

The implemented changes wouldn't run, because of the creation of a Interval dataset with a mismatching schema. I would encourage you to:

  • add any features you add to the test_l2g_feature_matrix.py suite, to make sure that the code doesnt crash
  • In the same file, add a semantic test for the common logic
  • Update the documentation pages
  • Pull dev branch to bring the changes to the feature matrix step

@github-actions github-actions bot added size-L and removed size-M labels Oct 17, 2024
@project-defiant project-defiant self-requested a review February 17, 2025 06:40
"""
return Intervals(
_df=(
self.df.alias("interval")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the intervals are not big datasets (my assumption) is there a possibility to broadcast them before the join?

).alias("vi"),
on=[
f.col("vi.vi_chromosome") == f.col("interval.chromosome"),
f.col("vi.position").between(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the equi join, so the join will be less optimized.

@project-defiant project-defiant self-requested a review February 17, 2025 08:20
features_input_loader = L2GFeatureInputLoader(
variant_index=variant_index,
colocalisation=coloc,
study_index=studies,
study_locus=credible_set,
target_index=target_index,
intervals=intervals,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make sure that if the intervals are not passed (None), then:

  1. If the feature_list contains the interval features, they should be dropped, since we can not compute them
  2. The warning should be logged that the interval features can not be computed.

if target_index_path
else None
)
self.intervals = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, if the intervals nor target index are not required, all features depending on these datasets should be filtered out before running the predictions or training.

f.col("variantInLocus.posteriorProbability").alias("posteriorProbability"),
)
# Filter for PP > 0.001
.filter(f.col("posteriorProbability") > 0.001)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this value optional with default 0.001 in the function signature.

@@ -216,6 +216,17 @@ class LDBasedClumpingConfig(StepConfig):
_target_: str = "gentropy.ld_based_clumping.LDBasedClumpingStep"


@dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to register this step to be able to use it in the command line interface. See the gentropy.config.Config.register_config method

@@ -18,10 +18,16 @@
"nullable": false,
"type": "string"
},
{
Copy link
Contributor

@project-defiant project-defiant Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variant and genes could be in the nested structure, so we preserve the storage. Similar as in the pseudo code below:

ArrayType(
    StructType(
        [
            StructField(StringType(), "geneId"),
            StructField(
                ArrayType(StringType()), "variantId"
            )
        ]
    )
)

@project-defiant
Copy link
Contributor

@xyg123 there are still some comments to be addressed before we can try to set this up for the production.

  • tests passing
  • packing the variant and gene into the structure in the schema
  • move the files to correct locations
  • descriptive feature names
  • register the configuration for the IntervalStep

Moreover I want to explore the performance of the non-equi join on the intervals x variantIndex, since it could be a bottleneck for the process.

There are a few enhancements that could be done depending on the size of the intervals itself:

  • broadcast the smaller (right df) to all executors (provided the intervals without variants are small enough and we can still tweak the configuration)`
  • make the sequence of all possible positions for each interval and run the equi-join (join with equal sign, so spark uses sort merge join)

@project-defiant project-defiant linked an issue Feb 20, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the intervals features
4 participants