After cloning the repo, start a virtual environment with something like:
python3 -m venv ./.env
source ./.env/bin/activate
and then:
pip install -r requirements.txt
to get the required modules and then:
jupyter-notebook
to open up this directoy in a jupyter notebook page in your browser.
Work in progress notebook reproducing the work from: https://www.toptal.com/python/comprehensive-introduction-your-genome-scipy
import pandas as pd
pd.__version__
'0.19.1'
Download the human genome annotation file in GFF3 format.
!wget ftp://ftp.ensembl.org/pub/release-85/gff3/homo_sapiens/Homo_sapiens.GRCh38.85.gff3.gz
--2017-01-16 14:28:59-- ftp://ftp.ensembl.org/pub/release-85/gff3/homo_sapiens/Homo_sapiens.GRCh38.85.gff3.gz
=> 'Homo_sapiens.GRCh38.85.gff3.gz'
Resolving ftp.ensembl.org... 193.62.203.85
Connecting to ftp.ensembl.org|193.62.203.85|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-85/gff3/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.85.gff3.gz ... 38469475
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.85.gff3.gz ... done.
Length: 38469475 (37M) (unauthoritative)
Homo_sapiens.GRCh38 100%[===================>] 36.69M 2.42MB/s in 13s
2017-01-16 14:29:13 (2.78 MB/s) - 'Homo_sapiens.GRCh38.85.gff3.gz' saved [38469475]
It is about 37 MB, a very small file compared to the information content of a human genome, which is about 3 GB in plain text. That’s because the GFF3 file only contains the annotation of the sequences, while the sequence data is usually stored in another file format called FASTA.
columnNames = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
df = pd.read_csv(
'Homo_sapiens.GRCh38.85.gff3.gz',
compression='gzip',
sep='\t',
comment='#',
low_memory=False,
header=None,
names=columnNames
)
df.head()
seqid | source | type | start | end | score | strand | phase | attributes | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | GRCh38 | chromosome | 1 | 248956422 | . | . | . | ID=chromosome:1;Alias=CM000663.2,chr1,NC_00000... |
1 | 1 | . | biological_region | 10469 | 11240 | 1.3e+03 | . | . | external_name=oe %3D 0.79;logic_name=cpg |
2 | 1 | . | biological_region | 10650 | 10657 | 0.999 | + | . | logic_name=eponine |
3 | 1 | . | biological_region | 10655 | 10657 | 0.999 | - | . | logic_name=eponine |
4 | 1 | . | biological_region | 10678 | 10687 | 0.999 | + | . | logic_name=eponine |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2601849 entries, 0 to 2601848
Data columns (total 9 columns):
seqid object
source object
type object
start int64
end int64
score object
strand object
phase object
attributes object
dtypes: int64(2), object(7)
memory usage: 178.7+ MB
df.seqid.unique()
array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',
'2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9',
'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1',
'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1',
'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1',
'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1',
'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1',
'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1',
'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1',
'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1',
'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1',
'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1',
'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1',
'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1',
'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1',
'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1',
'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1',
'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1',
'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1',
'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1',
'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1',
'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1',
'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1',
'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1',
'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1',
'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1',
'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1',
'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1',
'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1',
'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1',
'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1',
'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1',
'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1',
'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1',
'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1',
'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1',
'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1',
'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1',
'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1',
'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1',
'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1',
'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1',
'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1',
'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1',
'KI270757.1', 'MT', 'X', 'Y'], dtype=object)
The seqids starting with KI and GL are DNA sequences – or scaffolds – in the genome that have not been successfully assembled into the chromosomes.
Although the first human genome draft came out more than 15 years ago, the current human genome is still incomplete. The difficulty in assembling these sequences is largely due to complex repetitive regions in the genome.
df.source.value_counts()
havana 1441093
ensembl_havana 745065
ensembl 228212
. 182510
mirbase 4701
GRCh38 194
insdc 74
Name: source, dtype: int64
### How Much of the Genome Is Incomplete?
gdf = df[df.source == 'GRCh38']
gdf.seqid.unique().shape
(194,)
- About 0.37% of the human genome is still incomplete even though the first draft came out over 15 years ago.
- There are about 42,000 genes in the human genome based on this particular GFF3 file we used.
- The length of a gene can range from a few dozen to over two million bases.
- Genes are not evenly distributed among the chromosomes. Overall, the larger the chromosome, the more genes it hosts, but for a subset of the chromosomes, the correlation can be negative.
- The GFF3 file is very rich in annotation information, and we have just scratched the surface. If you are interested in further exploration, here are a few questions you can play with:
- How many transcripts does a gene typically have ?
- What percentage of genes have more than 1 transcript ?
- How many isoforms does a transcript typically have ?
- How many exons, CDS, and UTRs does a transcript typically have? What sizes are they ?
- Is it possible to categorize the genes based on their function as described in the description column ?
How many transcripts does a gene typically have ?
What percentage of genes have more than 1 transcript ?
How many isoforms does a transcript typically have ?
How many exons, CDS, and UTRs does a transcript typically have? What sizes are they ?
Is it possible to categorize the genes based on their function as described in the description column ?