Skip to content

BPNet reading note 1

twang15 edited this page Aug 20, 2021 · 6 revisions
  1. Enhancers are non-coding DNA sequences, which when they are bound by specific proteins increase the level of gene transcription. Enhancers activate unique gene expression patterns within cells of different types or under different conditions. Enhancers are key contributors to gene regulation, and causative variants that affect quantitative traits in humans and mice have been located in enhancer regions.
  2. cis-regulatory element: A noncoding DNA sequence in or near a gene required for proper spatiotemporal expression of that gene, often containing binding sites for transcription factors. Often used interchangeably with enhancer.
  3. somatic mutation: In multicellular organisms, mutations can be classed as either somatic or germ-line:
  • Somatic mutations – occur in a single body cell and cannot be inherited (only tissues derived from mutated cell are affected)
  • Germline mutations – occur in gametes and can be passed onto offspring (every cell in the entire organism will be affected)
  • 知乎:https://www.zhihu.com/question/38765318

somatic mutation

  1. Cooperative TF binding: TF complex
  • direct binding: TF binds directly to DNA
  • indirect binding: TF binds another TF which binds DNA
  1. Goal: learn predictive patterns from raw DNA sequences to maximize accuracy across the whole genome
  • Output: Binary, Yes (1) / No (0), is TF bound?
  • Input: 4*L matrix, each column is one-hot encoding for a nucleotide.
  • BPNet acts like a motif pattern detector, it scores the sequences based on the weights of neurons.
  1. Motivation
  • Cons of traditional statistics-based peak calling methods:
    • difficult to tell whether overlapping peaks are driven by the same or different sequence elements
    • different peak calling method often gives different answers
    • Instead, DNN-based approach takes raw sequence reads to produce
  1. Training:
  • input: DNA sequence
  • label: chip or chip-nexus raw count data
  • note: the biggest gains in deep learning are not through architecture engineering, instead it is clever design of the loss function obeying the nature of the noise that you observe in your data.
  • joint loss function: model total occupancy (total counts), the best loss is the MSE of the log of the total counts
  • also **a multinomial loss **to capture the profile shape (how the reads are probabilistically distributed across positions on the profile)
  • Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model. Such approaches offer advantages like improved data efficiency, reduced overfitting through shared representations, and fast learning by leveraging auxiliary information.
  1. Multinomial distribution
  1. Generalization: different chromosomes of the same cell type

TF footprint shape is largerly driven by TF-DNA contact with local sequence. Hence, profile 'shapes' can be predicted extremely accurately by local sequence alone.

  • what are profile shapes? Peak/valley?
  • does TF shapes is a metric of relative TF binding strength?

Chromatin state + distal interactions contribute to total strength of measured local ChIP TF occupancy. Hence, total counts can be predicted only resonably well by local sequence alone.

  • what is chromatin state?
  • what is distal?
  • what interactions do they have?
  • is total strength of occupancy determined by total counts? is it an absolute metric of TF binding?
  1. Step 1: Take any DNA sequence as input and predict the profile
  2. Step 2: take the predicted profile and back propagate it through the NN to get the contribution of each neuron in each layer all the way back down to the input and get a contribution score for every uncleotide in the input sequence, telling you how much it contributes to that output.
  3. A profile-wide importance score

TF cooperativity

TF-cooperativity-1.pdf TF-cooperativity-2.pdf summary.pdf caveat.pdf kipio.pdf