GitHub

After cloning the repo, start a virtual environment with something like:

python3 -m venv ./.env
source ./.env/bin/activate

and then:

pip install -r requirements.txt

to get the required modules and then:

jupyter-notebook

to open up this directoy in a jupyter notebook page in your browser.

Work in progress notebook reproducing the work from: https://www.toptal.com/python/comprehensive-introduction-your-genome-scipy

and then extending the suggested questions at the end of the post

import pandas as pd

pd.__version__

'0.19.1'

Download the human genome annotation file in GFF3 format.

!wget ftp://ftp.ensembl.org/pub/release-85/gff3/homo_sapiens/Homo_sapiens.GRCh38.85.gff3.gz

--2017-01-16 14:28:59--  ftp://ftp.ensembl.org/pub/release-85/gff3/homo_sapiens/Homo_sapiens.GRCh38.85.gff3.gz
           => 'Homo_sapiens.GRCh38.85.gff3.gz'
Resolving ftp.ensembl.org... 193.62.203.85
Connecting to ftp.ensembl.org|193.62.203.85|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-85/gff3/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.85.gff3.gz ... 38469475
==> PASV ... done.    ==> RETR Homo_sapiens.GRCh38.85.gff3.gz ... done.
Length: 38469475 (37M) (unauthoritative)

Homo_sapiens.GRCh38 100%[===================>]  36.69M  2.42MB/s    in 13s     

2017-01-16 14:29:13 (2.78 MB/s) - 'Homo_sapiens.GRCh38.85.gff3.gz' saved [38469475]

It is about 37 MB, a very small file compared to the information content of a human genome, which is about 3 GB in plain text. That’s because the GFF3 file only contains the annotation of the sequences, while the sequence data is usually stored in another file format called FASTA.

columnNames = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']

df = pd.read_csv(
    'Homo_sapiens.GRCh38.85.gff3.gz',
    compression='gzip',
    sep='\t', 
    comment='#', 
    low_memory=False,
    header=None, 
    names=columnNames
)

df.head()

	seqid	source	type	start	end	score	strand	phase	attributes
0	1	GRCh38	chromosome	1	248956422	.	.	.	ID=chromosome:1;Alias=CM000663.2,chr1,NC_00000...
1	1	.	biological_region	10469	11240	1.3e+03	.	.	external_name=oe %3D 0.79;logic_name=cpg
2	1	.	biological_region	10650	10657	0.999	+	.	logic_name=eponine
3	1	.	biological_region	10655	10657	0.999	-	.	logic_name=eponine
4	1	.	biological_region	10678	10687	0.999	+	.	logic_name=eponine

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2601849 entries, 0 to 2601848
Data columns (total 9 columns):
seqid         object
source        object
type          object
start         int64
end           int64
score         object
strand        object
phase         object
attributes    object
dtypes: int64(2), object(7)
memory usage: 178.7+ MB

df.seqid.unique()

array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',
       '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9',
       'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1',
       'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1',
       'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1',
       'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1',
       'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1',
       'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1',
       'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1',
       'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1',
       'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1',
       'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1',
       'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1',
       'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1',
       'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1',
       'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1',
       'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1',
       'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1',
       'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1',
       'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1',
       'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1',
       'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1',
       'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1',
       'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1',
       'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1',
       'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1',
       'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1',
       'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1',
       'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1',
       'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1',
       'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1',
       'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1',
       'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1',
       'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1',
       'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1',
       'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1',
       'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1',
       'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1',
       'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1',
       'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1',
       'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1',
       'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1',
       'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1',
       'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1',
       'KI270757.1', 'MT', 'X', 'Y'], dtype=object)

The seqids starting with KI and GL are DNA sequences – or scaffolds – in the genome that have not been successfully assembled into the chromosomes.

Although the first human genome draft came out more than 15 years ago, the current human genome is still incomplete. The difficulty in assembling these sequences is largely due to complex repetitive regions in the genome.

df.source.value_counts()

havana            1441093
ensembl_havana     745065
ensembl            228212
.                  182510
mirbase              4701
GRCh38                194
insdc                  74
Name: source, dtype: int64

### How Much of the Genome Is Incomplete?

gdf = df[df.source == 'GRCh38']

gdf.seqid.unique().shape

(194,)

In summary

About 0.37% of the human genome is still incomplete even though the first draft came out over 15 years ago.
There are about 42,000 genes in the human genome based on this particular GFF3 file we used.
The length of a gene can range from a few dozen to over two million bases.
Genes are not evenly distributed among the chromosomes. Overall, the larger the chromosome, the more genes it hosts, but for a subset of the chromosomes, the correlation can be negative.
The GFF3 file is very rich in annotation information, and we have just scratched the surface. If you are interested in further exploration, here are a few questions you can play with:

Next Questions

How many transcripts does a gene typically have ?
What percentage of genes have more than 1 transcript ?
How many isoforms does a transcript typically have ?
How many exons, CDS, and UTRs does a transcript typically have? What sizes are they ?
Is it possible to categorize the genes based on their function as described in the description column ?

How many transcripts does a gene typically have ?

What percentage of genes have more than 1 transcript ?

How many isoforms does a transcript typically have ?

How many exons, CDS, and UTRs does a transcript typically have? What sizes are they ?

Is it possible to categorize the genes based on their function as described in the description column ?

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
genomics_scipy_example.ipynb		genomics_scipy_example.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Work in progress notebook reproducing the work from: https://www.toptal.com/python/comprehensive-introduction-your-genome-scipy

and then extending the suggested questions at the end of the post

In summary

Next Questions

About

Releases

Packages

Languages

greglever/genomics

Folders and files

Latest commit

History

Repository files navigation

Work in progress notebook reproducing the work from: https://www.toptal.com/python/comprehensive-introduction-your-genome-scipy

and then extending the suggested questions at the end of the post

In summary

Next Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages