Skip to content

Tools to mine plasmid sequences, build Gateway consensus sites and make a web application to host plasmid database of Gateway plasmids

License

Notifications You must be signed in to change notification settings

dgruano/GateWayMine

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GateWayMine πŸš€

This repository contains the code and data for extracting Gateway sequence sites from AddGene and SnapGene plasmids, and use them to:

  • Extract consensus sequences from the Gateway sites.
  • Create a web application to visualize the Gateway plasmids. Can be accessed at https://gatewaymine.netlify.app/.

Summary of the analysis πŸ“Š

---
config:
  layout: elk
---
flowchart LR
SnapGene ==> Plasmids[~14k<br>plasmids]
AddGene ==> Plasmids
Plasmids ==> Sites[extracted<br>att sites]
Sites ==> CombinatorialSites[combinatorial<br>att sites]
Plasmids ==> SequenceFeatures[extracted<br>sequence features]
CombinatorialSites ==> Alignments[aligned<br>att sites]
CombinatorialSites ==> GatewayMine[GatewayMine]
SequenceFeatures ==> GatewayMine
Alignments ==> Consensus[consensus<br>sequences]
Loading

Data mining ⛏️

The scripts in this repository were used to download plasmids from AddGene and copying the Gateway plasmis from the SnapGene collection (present in the SnapGene installation folder) to create a collection of Gateway plasmids (attached as a release artifact).

Additional information about AddGene plasmids was also mined, such as whether they are part of kits, and related publication links.

Formatting πŸ”„

The plasmids files were read, and if their annotation contained Gateway sequence sites (identied by a label attXn, where X is the att site and n is the version), the sequence of those sites was extracted. The file results/plasmid_site_dict.json contains a dictionary of all the plasmids in the collection, with the Gateway sequence sites they contain.

This file was then used to generate a collection of all version of each Gateway sequence site (e.g. all the versions of attP1, attB1, etc.). Contained in the file results/att_sites.json. Because sites can be recombined with each other (e.g. attP1 + attB1 -> attL1), the sites found in plasmids were recombined in all possible combinations, to yield an even larger collection of sites. This is contained in the file results/att_sites_combinatorial.json. These two files were used for alignment in the next step.

In addition, the mined plasmid information was used to create a summary dataset, contained in the file results/plasmid_features.json. This file contains a dictionary of all the plasmids in the collection, listing:

  • Their name
  • The Gateway sequence sites they contain
  • The sequence features they contain
  • Whether they were extracted from SnapGene or AddGene
  • If they are from AddGene:
    • Their AddGene ID
    • Kit to which they belong (if any)
    • Publication links (if any)

This dataset was made queryable as a web application, of which the source code is in the web_app folder.

Alignment and consensus 🧬

To generate the consensus sequences, the Gateway sequence sites were aligned using Clustal Omega. The alignment files are contained in the alignments folder (generated only with the sites found in plasmids) and alignments_combinatorial (using also the combinatorial sites).

From an alignment, consensus sequences were generated by removing flaking positions containing either spacers and/or all ACGT nucleotides, and then using ambiguous nucleotides to represent the consensus, for example the below dummy example:

seq1      TGCTAATA
seq2      -GCTCCTT
seq3      TGCTCCCG
seq4      TGCTGACC

consensus  GCTVMY

The consensus sequences are contained in the files results/consensus_sites.tsv and results/consensus_sites_combinatorial.tsv, which were created with only the sites found in plasmids and those plus the combinatorial sites respectively.

In addition, even more permissive consensus sequences were generated. What gives specificity for recombination of a pair of sites is the overlap sequence conserved in all attB, P, L and R sites with the same number (e.g. all attX1 sites contain twtGTACAAAaaa as the overlap sequence). An aligment of all sites of a given type (e.g. all attB sites), excluding the overlap sequence was used to generate more permissive consensus sequences. Those consensus sequences are in the mentioned files, but with merged_ prefixed to the site name.

Running the analysis locally πŸ’»

Dependencies πŸ“¦

The analysis is run using python, using poetry to manage dependencies.

# Install dependencies
poetry install

# Activate virtual environment
poetry shell

clustalo is used for alignment, and can be downloaded as a binary from here. Once downloaded, rename it to clustalo and place it in the root folder of the repository. If you want to provide an alternative path, you can do so with script arguments (see script docs).

Pipeline βš™οΈ

The pipeline is described in the following diagram:

---
config:
  layout: elk
---
flowchart TD

      subgraph DataMining["Data Mining"]
      snapgene_application{{SnapGene Folder}} ==> snapgene_script
      addgene{{AddGene}} ==> get_addgene_kits_info
      addgene ==> get_all_gateway_plasmids
      addgene ==> get_addgene_article_refs
      addgene ==> get_other_plasmids
      snapgene_script[get_snapgene_files.py] ==> snapgene_plasmids([data/snapgene_plasmids/*.dna])
      get_addgene_kits_info[get_addgene_kits_info.py] ==> addgene_kits([data/addgene_kit_plasmids.json])
      get_all_gateway_plasmids[get_all_gateway_plasmids.py] ==> all_gateway_plasmids([data/all_gateway_plasmids.tsv])
      addgene_kits ==> get_addgene_kit_plasmids[get_addgene_kit_plasmids.py]
      get_addgene_kit_plasmids ==> addgene_plasmids([data/addgene_plasmids/*.dna])
      all_gateway_plasmids ==> get_other_plasmids[get_other_plasmids.py]
      get_other_plasmids ==> addgene_plasmids
      get_addgene_article_refs[get_addgene_article_refs.py] ==> addgene_article_refs([data/addgene_article_refs.tsv])
      addgene_plasmids ==> plasmid_collection[(plasmid_collection)]
      snapgene_plasmids ==> plasmid_collection
      end

      subgraph Formatting
      all_gateway_plasmids ==> make_plasmid_summary[make_plasmid_summary.py]
      plasmid_collection ==> make_plasmid_summary
      addgene_article_refs ==> make_plasmid_summary
      addgene_kits ==> make_plasmid_summary
      make_plasmid_summary ==> plasmid_summary([results/plasmid_summary.json])
      plasmid_collection ==> make_plasmid_site_dict
      make_plasmid_site_dict[make_plasmid_site_dict.py] ==> plasmid_site_dict([results/plasmid_site_dict.json])
      plasmid_site_dict ==> make_feature_dict
      plasmid_summary ==> make_feature_dict
      make_feature_dict[make_feature_dict.py] ==> feature_dict[(results/feature_dict.json)]
      plasmid_site_dict ==> make_unique_sites
      make_unique_sites[make_unique_sites.py] ==> att_sites([results/att_sites.json])
      att_sites ==> make_combinatorial_att_sites
      make_combinatorial_att_sites[make_combinatorial_att_sites.py] ==> att_sites_combinatorial([results/att_sites_combinatorial.json])
      make_combinatorial_att_sites ==> combinatorial_att_sites_only([results/att_sites_combinatorial_only.json])
      end

      subgraph AlignmentAndConsensus["Alignment and Consensus"]
      att_sites ==> make_alignments
      att_sites_combinatorial ==> make_alignments
      make_alignments[make_alignments.py] ==> alignments([results/alignment/*])
      make_alignments ==> alignments_combinatorial([results/alignment_combinatorial/*])
      alignments ==> make_consensus_sites
      alignments_combinatorial ==> make_consensus_sites
      make_consensus_sites[make_consensus_sites.py] ==> consensus_sites[(results/consensus_sites.tsv)]
      make_consensus_sites ==> consensus_sites_combinatorial[(results/consensus_sites_combinatorial.tsv)]
      alignments ==> make_logos
      alignments_combinatorial ==> make_logos
      make_logos[make_logos.py] ==> logos([results/alignment/*.svg])
      make_logos ==> logos_combinatorial([results/alignment_combinatorial/*.svg])

      end

      feature_dict ==> GateWayMine{{GatewayMine}}

      get_addgene_kits_info:::Sky
      get_addgene_kit_plasmids:::Sky
      get_addgene_article_refs:::Sky
      get_all_gateway_plasmids:::Sky
      get_other_plasmids:::Sky
      snapgene_script:::Sky
      make_plasmid_summary:::Sky
      make_feature_dict:::Sky
      make_unique_sites:::Sky
      make_combinatorial_att_sites:::Sky
      make_alignments:::Sky
      make_consensus_sites:::Sky
      make_plasmid_site_dict:::Sky
      make_logos:::Sky

      att_sites:::Lavender
      att_sites_combinatorial:::Lavender
      combinatorial_att_sites_only:::Lavender
      plasmid_site_dict:::Lavender
      feature_dict:::Pine
      consensus_sites:::Pine
      consensus_sites_combinatorial:::Pine
      plasmid_summary:::Lavender
      alignments:::Lavender
      alignments_combinatorial:::Lavender
      addgene_article_refs:::Lavender
      addgene_kits:::Lavender
      all_gateway_plasmids:::Lavender
      logos:::Lavender
      logos_combinatorial:::Lavender
      GateWayMine:::Peacock

      classDef Sky stroke-width:1px, stroke-dasharray:none, stroke:#374D7C, fill:#E2EBFF, color:#374D7C
      classDef Pine stroke-width:1px, stroke-dasharray:none, stroke:#254336, fill:#27654A, color:#FFFFFF
      classDef Lavender stroke-width:1px, stroke-dasharray:none, stroke:#8E6C9E, fill:#8E6C9E, color:#FFFFFF
      classDef Peacock stroke-width:1px, stroke-dasharray:none, stroke:#006666, fill:#006666, color:#FFFFFF

      style DataMining color:#000000,fill:#E2EBFF90
      style Formatting color:#000000,fill:#D1FFE290
      style AlignmentAndConsensus color:#000000,fill:#FFE5F790
Loading

Data mining ⛏️

To run locally, first download the plasmid collection from the lastest release of this repository, and place the folders addgene_plasmids and snapgene_plasmids in the data folder.

If you want to re-download the plasmids (reproduce the data mining). You can do so with the bash script run_data_mining.sh, see the documentation of called scripts.

playwright is used for scraping. If used for the first time, you will be prompted to run playwright install

Formatting πŸ”„

See the documentation of scripts called in run_formatting.sh.

Alignment and consensus 🧬

See the documentation of scripts called in run_alignments_and_consensus.sh.

Web application 🌐

The web application is a simple React application built with Vite. It was generated with yarn create vite (see docs), so the yarn package manager is required. The directory structure is standard, and documented in the vite docs.

# Enable yarn 3
corepack enable

# Install dependencies
yarn install

# Run dev server
yarn dev

# Build for production
yarn build

The only extra configuration is copying the plasmid_features.json file to the public folder when building or serving locally, so it can be requested by the frontend application, see the config at web_app/vite.config.js.

Contributing 🀝

Adding new AddGene plasmids πŸ”¬

NOTE: Make sure the plasmid is not already there!

To add new plasmids from AddGene, add a row to the file all_gateway_plasmids.tsv, with the following columns:

  • plasmid_id: The plasmid ID from AddGene

  • plasmid_name: The name of the plasmid

  • reference: (optional) The reference article id for the plasmid (this is the number at the end of the URL in the AddGene page). For the example below the publication links to https://www.addgene.org/browse/article/7274/, so the reference is 7274.

    publication in AddGene

Once you have done this:

# Get publication links (if not already present)
python get_addgene_article_refs.py

# Download the new plasmid
python get_other_plasmids.py

# Run the formatting pipeline
bash run_formatting.sh

# Run the alignment and consensus pipeline
bash run_alignments_and_consensus.sh

# Re-build the web app

Other sources of plasmids ✨

This could be extended to support other plasmid sources, but for now it only supports the SnapGene and AddGene plasmid collections. Feel free to submit an issue to discuss it!

About

Tools to mine plasmid sequences, build Gateway consensus sites and make a web application to host plasmid database of Gateway plasmids

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 69.6%
  • JavaScript 26.8%
  • Shell 3.0%
  • HTML 0.6%