diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..8663f3c3 --- /dev/null +++ b/404.html @@ -0,0 +1 @@ +
AGAT a GFF/GTF toolkit allowing you to perform almost everything you might want to achieve ^^
AGAT has the power to check, fix, pad missing information (features/attributes) of any kind of GTF and GFF to create complete, sorted and standardised gff3 format. Over the years it has been enriched by many many tools to perform just about any tasks that is possible related to GTF/GFF format files (sanitizing, conversions, merging, modifying, filtering, FASTA sequence extraction, adding information, etc). Comparing to other methods AGAT is robust to even the most despicable GTF/GFF files.
_sp_
prefix)task | tool |
---|---|
check, fix, pad missing information into sorted and standardised gff3 | agat_convert_sp_gxf2gxf.pl |
* add missing parent features (e.g. gene and mRNA if only CDS/exon exists).
+* add missing features (e.g. exon and UTR).
+* add missing mandatory attributes (i.e. ID, Parent).
+* fix identifiers to be uniq.
+* fix feature locations.
+* remove duplicated features.
+* group related features (if spread in different places in the file).
+* sort features (tabix optional).
+* merge overlapping loci into one single locus (only if option activated).
+
task | tool |
---|---|
convert any GTF/GFF into BED format | agat_convert_sp_gff2bed.pl |
convert any GTF/GFF into GTF format | agat_convert_sp_gff2gtf.pl |
convert any GTF/GFF into tabulated format | agat_sp_gff2tsv.pl |
convert any BAM from minimap2 into GFF format | agat_convert_sp_minimap2_bam2gff.pl |
convert any GTF/GFF into ZFF format | agat_sp_gff2zff.pl |
convert any GTF/GFF into any GTF/GFF (bioperl) format | agat_convert_sp_gxf2gxf.pl |
convert BED format into GFF3 format | agat_convert_bed2gff.pl |
convert EMBL format into GFF3 format | agat_convert_embl2gff.pl |
convert genscan format into GFF3 format | agat_convert_genscan2gff.pl |
convert mfannot format into GFF3 format | agat_convert_mfannot2gff.pl |
task | tool |
---|---|
make feature statistics | agat_sp_statistics.pl |
make function statistics | agat_sp_functional_statistics.pl |
extract any type of sequence | agat_sp_extract_sequences.pl |
extract attributes | agat_sp_extract_attributes.pl |
complement annotations (non-overlapping loci) | agat_sp_complement_annotations.pl |
merge annotations | agat_sp_merge_annotations.pl |
filter gene models by ORF size | agat_sp_filter_by_ORF_size.pl |
filter to keep only longest isoforms | agat_sp_keep_longest_isoform.pl |
create introns features | agat_sp_add_introns.pl |
fix cds phases | agat_sp_fix_cds_phases.pl |
manage IDs | agat_sp_manage_IDs.pl |
manage UTRs | agat_sp_manage_UTRs.pl |
manage introns | agat_sp_manage_introns.pl |
manage functional annotation | agat_sp_manage_functional_annotation.pl |
specificity sensitivity | agat_sp_sensitivity_specificity.pl |
fusion / split analysis between two annotations | agat_sp_compare_two_annotations.pl |
analyze differences between BUSCO results | agat_sp_compare_two_BUSCOs.pl |
... and much more ... | ... see here ... |
All tools taking GFF/GTF as input can be divided in two groups: _sp_
and _sq_
.
_sp_
prefix_sp_ stands for SLURP. Those tools will charge the file in memory in a specific data structure. It has a memory cost but makes life smoother. Indeed, it allows to perform complicated tasks in a more time efficient way ( Any features can be accessed at any time by AGAT). Moreover, it allows to fix all potential errors in the limit of the possibilities given by the format itself. See the AGAT parser section for more information about it.
_sq_
prefix_sq_ stands for SEQUENTIAL. Those tools will read and process GFF/GTF files from the top to the bottom, line by line, performing tasks on the fly. This is memory efficient but the sanity check of the file is minimum. Those tools are not intended to perform complex tasks.
The first step of AGAT' tools with the _sp_ prefix of is to fix the file to standardize it. (e.g. a file containing only exon will be modified to create mRNA and gene features). To perform this task AGAT parses and slurps the entire data into a specific data structure. Below you will find more information about peculiarity of this data structure, and the parsing approach used.
The method create a hash structure containing all the data in memory. We call it OMNISCIENT. The OMNISCIENT structure is a three levels structure:
$omniscient{level1}{tag_l1}{level1_id} = feature <= tag could be gene, match
+$omniscient{level2}{tag_l2}{idY} = @featureListL2 <= tag could be mRNA,rRNA,tRNA,etc. idY is a level1_id (know as Parent attribute within the level2 feature). The @featureList is a list to be able to manage isoform cases.
+$omniscient{level3}{tag_l3}{idZ} = @featureListL3 <= tag could be exon,cds,utr3,utr5,etc. idZ is the ID of a level2 feature (know as Parent attribute within the level3 feature). The @featureList is a list to be able to put all the feature of a same tag together.
+
The AGAT parser phylosophy will use several approach to understand the links/relationships betwen the featrures:
To resume by priority of way to parse: Parent/child or gene_id/transcript_id relationship > common attribute/tag > sequential.
The parser may used only one or a mix of these approaches according of the peculiarity of the gtf/gff file you provide.
1. Parsing approach 1: by Parent/child relationship
Example of Parent/ID relationship used by the GFF format:
chr12 HAVANA gene 100 500 . + . ID=gene1
+chr12 HAVANA transcript 100 500 . + . ID=transcript1;Parent=gene1
+chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1
+chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1
+
Example of gene_id/transcript_id relationship used by the GTF format:
chr12 HAVANA gene 100 500 . + . gene_id "gene1";
+chr12 HAVANA transcript 100 500 . + . gene_id "gene1"; transcript_id "transcript1";
+chr12 HAVANA exon 100 500 . + . gene_id "gene1"; transcript_id "transcript1"; exon_id=exon1;
+chr12 HAVANA CDS 100 500 . + 0 gene_id "gene1"; transcript_id "transcript1"; cds_id=cds-1;
+
2. ELSE Parsing approach 2: by a common attribute/tag
a common attribute (or common tag) is an attribute value shared by feature that must be grouped together. AGAT uses default attributes (gene_id
and locus_tag
) displayed in the log but can be set by the user modifying the AGAT configuration file agat_config.yaml
.
You can modify the agat_config.yaml
either running agat config --expose
to access it (it will be copied in the current directory) and then modifying it manually; or running agat config --expose --locus_tag attribute_name
that will copy the agat_config.yaml
locally with the modification of the locus_tag
parameter accordingly.
Example of relationship made using a common tag (here locus_tag):
chr12 HAVANA gene 100 500 . + . locus_tag="gene1"
+chr12 HAVANA transcript 100 500 . + . locus_tag="gene1";ID="transcript1"
+chr12 HAVANA exon 100 500 . + . locus_tag="gene1";ID=exon1;
+chr12 HAVANA CDS 100 500 . + 0 locus_tag="gene1";ID=cds-1;
+
3. ELSE Parsing approach 3: sequentially
Reading from top to the botom of the file, level3 features (e.g. exon, CDS, UTR) are attached to the last level2 feature (e.g. mRNA) met, and level2 feature are attached to the last L1 feature (e.g. gene) met. To see the list of features of each level see the feature_levels.yaml file (In the share folder in the github repo or using agat levels --expose
).
Example of relationship made sequentially:
chr12 HAVANA gene 100 500 . + . ID="aaa"
+chr12 HAVANA transcript 100 500 . + . ID="bbb"
+chr12 HAVANA exon 100 500 . + . ID="ccc"
+chr12 HAVANA CDS 100 500 . + 0 ID="ddd"
+chr12 HAVANA gene 1000 5000 . + . ID="xxx"
+chr12 HAVANA transcript 1000 5000 . + . ID="yyy"
+chr12 HAVANA exon 1000 5000 . + . ID="zzz"
+chr12 HAVANA CDS 1000 5000 . + 0 ID="www"
+
/!\ Cases with only level3 features (i.e rast or some prokka files), sequential parsing may not work as expected if Parent/ID gene_id/transcript_id attributes are missing. Indeed all features will be the child of only one newly created Parent. To create a parent per feature or group of features, a common tag must be used to group them correctly (by default gene_id and locus_tag but you can set up the ones of your choice). See Particular case.
Below you will find more information about peculiar GXF files and how the AGAT parser behaves and uses the different parsing approaches.
If you have isoforms (for Eukaryote organism) in your files and the common attribute
used is not set properly you can end up with isoforms having independent parent gene features. See below for more details.
Here an example of three transcripts from two different genes (isoforms exist - testA.gff):
chr12 HAVANA transcript 100 500 . + . ID="bbb";common_tag="gene1";transcript_id="transcript1";gene_info="gene1"
+chr12 HAVANA exon 100 500 . + . ID="ccc";common_tag="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID="ddd";common_tag="gene1"
+chr12 HAVANA transcript 100 600 . + . ID="bbb2";common_tag="gene1";transcript_id="transcript2";gene_info="gene1"
+chr12 HAVANA exon 100 600 . + . ID="ccc2";common_tag="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID="ddd2";common_tag="gene1"
+chr12 HAVANA transcript 1000 5000 . + . ID="yyy";common_tag="gene2";transcript_id="transcript3";gene_info="gene2"
+chr12 HAVANA exon 1000 5000 . + . ID="zzz";common_tag="gene2"
+chr12 HAVANA CDS 1000 5000 . + 0 ID="www";common_tag="gene2"
+
agat_convert_sp_gxf2gxf.pl --gff testA.gff
chr12 HAVANA gene 100 500 . + . ID=nbisL1-transcript-1;common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA transcript 100 500 . + . ID="bbb";Parent=nbisL1-transcript-1;common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA exon 100 500 . + . ID="ccc";Parent="bbb";common_tag="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID="ddd";Parent="bbb";common_tag="gene1"
+chr12 HAVANA gene 100 600 . + . ID=nbisL1-transcript-2;common_tag="gene1";gene_info="gene1";transcript_id="transcript2"
+chr12 HAVANA transcript 100 600 . + . ID="bbb2";Parent=nbisL1-transcript-2;common_tag="gene1";gene_info="gene1";transcript_id="transcript2"
+chr12 HAVANA exon 100 600 . + . ID="ccc2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID="ddd2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA gene 1000 5000 . + . ID=nbisL1-transcript-3;common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA transcript 1000 5000 . + . ID="yyy";Parent=nbisL1-transcript-3;common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA exon 1000 5000 . + . ID="zzz";Parent="yyy";common_tag="gene2"
+chr12 HAVANA CDS 1000 5000 . + 0 ID="www";Parent="yyy";common_tag="gene2"
+
locus_tag
and gene_id
by default. If you are lucky those attributes already exist. Here they are absent, you can use either common_tag
, transcript_id
, or gene_info
. Let's investigate each case:agat config --expose --locus_tag common_tag # Modify the locus_tag parameter via the AGAT configuration file agat_config.yaml
agat_convert_sp_gxf2gxf.pl --gff testA.gff
This will work well even if transcript isoforms exist. This will use the parsing approach 2 (only using common attribute).
chr12 HAVANA gene 100 600 . + . ID=nbisL1-transcript-1;common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA transcript 100 500 . + . ID="bbb";Parent=nbisL1-transcript-1;common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA exon 100 500 . + . ID="ccc";Parent="bbb";common_tag="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID="ddd";Parent="bbb";common_tag="gene1"
+chr12 HAVANA transcript 100 600 . + . ID="bbb2";Parent=nbisL1-transcript-1;common_tag="gene1";gene_info="gene1";transcript_id="transcript2"
+chr12 HAVANA exon 100 600 . + . ID="ccc2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID="ddd2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA gene 1000 5000 . + . ID=nbisL1-transcript-2;common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA transcript 1000 5000 . + . ID="yyy";Parent=nbisL1-transcript-2;common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA exon 1000 5000 . + . ID="zzz";Parent="yyy";common_tag="gene2"
+chr12 HAVANA CDS 1000 5000 . + 0 ID="www";Parent="yyy";common_tag="gene2"
+
agat config --expose --locus_tag gene_info # Modify the locus_tag parameter via the AGAT configuration file agat_config.yaml
agat_convert_sp_gxf2gxf.pl --gff testA.gff
This will work well even if transcript isoforms exist. This will use the parsing approach 2 (common attribute gene_info) for transcript features and approach 3 (sequential) for subfeatures, which do not have the transcript_id attribute.
chr12 HAVANA gene 100 600 . + . ID="gene1";common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA transcript 100 500 . + . ID="bbb";Parent="gene1";common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA exon 100 500 . + . ID="ccc";Parent="bbb";common_tag="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID="ddd";Parent="bbb";common_tag="gene1"
+chr12 HAVANA transcript 100 600 . + . ID="bbb2";Parent="gene1";common_tag="gene1";gene_info="gene1";transcript_id="transcript2"
+chr12 HAVANA exon 100 600 . + . ID="ccc2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID="ddd2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA gene 1000 5000 . + . ID="gene2";common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA transcript 1000 5000 . + . ID="yyy";Parent="gene2";common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA exon 1000 5000 . + . ID="zzz";Parent="yyy";common_tag="gene2"
+chr12 HAVANA CDS 1000 5000 . + 0 ID="www";Parent="yyy";common_tag="gene2"
+
agat config --expose --locus_tag transcript_id # Modify the locus_tag parameter via the AGAT configuration file agat_config.yaml
agat_convert_sp_gxf2gxf.pl --gff testA.gff
/!\ In our case, using transcript_id
is not a good choice. Indeed each transcript will have its own gene feature, so isoform will not be linked to the same gene feature as expected. This will use the parsing approach 2 (common attribute transcript_id) for transcript features and approach 3 (sequential) for subfeatures that do not have the transcript_id attribute.
chr12 HAVANA gene 100 500 . + . ID="transcript1";common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA transcript 100 500 . + . ID="bbb";Parent="transcript1";common_tag="gene1";gene_info="gene1";transcript_id="transcript1"
+chr12 HAVANA exon 100 500 . + . ID="ccc";Parent="bbb";common_tag="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID="ddd";Parent="bbb";common_tag="gene1"
+chr12 HAVANA gene 100 600 . + . ID="transcript2";common_tag="gene1";gene_info="gene1";transcript_id="transcript2"
+chr12 HAVANA transcript 100 600 . + . ID="bbb2";Parent="transcript2";common_tag="gene1";gene_info="gene1";transcript_id="transcript2"
+chr12 HAVANA exon 100 600 . + . ID="ccc2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID="ddd2";Parent="bbb2";common_tag="gene1"
+chr12 HAVANA gene 1000 5000 . + . ID="transcript3";common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA transcript 1000 5000 . + . ID="yyy";Parent="transcript3";common_tag="gene2";gene_info="gene2";transcript_id="transcript3"
+chr12 HAVANA exon 1000 5000 . + . ID="zzz";Parent="yyy";common_tag="gene2"
+chr12 HAVANA CDS 1000 5000 . + 0 ID="www";Parent="yyy";common_tag="gene2"
+
In such case the sequential approach cannot be used (Indeed no level1 (e.g. gene) and no lelve2 (e.g. mrna) feature is present in the file). So the presence of parent/ID transcript_id/gene_id relationships and/or a proper common attribute is crucial.
If you have isoforms (for Eukaryote organism) in your files and the common attribute
used is not set properly you can end up with isoforms having independent parent gene features. See below for more details.
1.1
Input (testB.gff):
chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1;locus_id="gene1"
+ chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1;locus_id="gene1"
+ chr12 HAVANA exon 100 600 . + . ID=exon2;Parent=transcript2;locus_id="gene1"
+ chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=transcript2;locus_id="gene1"
+ chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=transcriptb;locus_id="gene2"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=transcriptb;locus_id="gene2"
+
agat_convert_sp_gxf2gxf.pl --gff testB.gff
chr12 HAVANA gene 100 500 . + . ID=nbis-gene-1;locus_id="gene1"
+ chr12 HAVANA mRNA 100 500 . + . ID=transcript1;Parent=nbis-gene-1;locus_id="gene1"
+ chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1;locus_id="gene1"
+ chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1;locus_id="gene1"
+ chr12 HAVANA gene 100 600 . + . ID=nbis-gene-2;locus_id="gene1"
+ chr12 HAVANA mRNA 100 600 . + . ID=transcript2;Parent=nbis-gene-2;locus_id="gene1"
+ chr12 HAVANA exon 100 600 . + . ID=exon2;Parent=transcript2;locus_id="gene1"
+ chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=transcript2;locus_id="gene1"
+ chr12 HAVANA gene 700 900 . + . ID=nbis-gene-3;locus_id="gene2"
+ chr12 HAVANA mRNA 700 900 . + . ID=transcriptb;Parent=nbis-gene-3;locus_id="gene2"
+ chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=transcriptb;locus_id="gene2"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=transcriptb;locus_id="gene2"
+
common attribute
to group the feature properly: AGAT uses locus_tag
and gene_id
by default. If you are lucky those attributes already exist. Here they are absent, you can use locus_id
instead.agat config --expose --locus_tag locus_id # Modify the locus_tag parameter via the AGAT configuration file agat_config.yaml
agat_convert_sp_gxf2gxf.pl --gff testB.gff
chr12 HAVANA gene 100 600 . + . ID="gene1";locus_id="gene1"
+chr12 HAVANA mRNA 100 500 . + . ID=transcript1;Parent="gene1";locus_id="gene1"
+chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1;locus_id="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1;locus_id="gene1"
+chr12 HAVANA mRNA 100 600 . + . ID=transcript2;Parent="gene1";locus_id="gene1"
+chr12 HAVANA exon 100 600 . + . ID=exon2;Parent=transcript2;locus_id="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=transcript2;locus_id="gene1"
+chr12 HAVANA gene 700 900 . + . ID="gene2";locus_id="gene2"
+chr12 HAVANA mRNA 700 900 . + . ID=transcriptb;Parent="gene2";locus_id="gene2"
+chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=transcriptb;locus_id="gene2"
+chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=transcriptb;locus_id="gene2"
+
1.2
Here we have only level3 features, Parent/ID transcript_id/gene_id relationships present, default common attributes
( locus_tag
or gene_id
) is set for some features.
Input testF.gff:
chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1;locus_tag="gene1"
+ chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1;locus_tag="gene1"
+ chr12 HAVANA exon 100 600 . + . ID=exon2;Parent=transcript2;locus_tag="gene1"
+ chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=transcript2;locus_tag="gene1"
+ chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=transcriptb;locus_tag="gene2"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=transcriptb;locus_tag="gene2"
+ chr12 HAVANA exon 1000 1110 . + . ID=exon4;Parent=transcript4
+ chr12 HAVANA CDS 1000 1110 . + 0 ID=cds4;Parent=transcript4
+
agat_convert_sp_gxf2gxf.pl --gff testF.gff
chr12 HAVANA gene 100 600 . + . ID="gene1";locus_tag="gene1"
+ chr12 HAVANA mRNA 100 500 . + . ID=transcript1;Parent="gene1";locus_tag="gene1"
+ chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1;locus_tag="gene1"
+ chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1;locus_tag="gene1"
+ chr12 HAVANA mRNA 100 600 . + . ID=transcript2;Parent="gene1";locus_tag="gene1"
+ chr12 HAVANA exon 100 600 . + . ID=exon2;Parent=transcript2;locus_tag="gene1"
+ chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=transcript2;locus_tag="gene1"
+ chr12 HAVANA gene 700 900 . + . ID="gene2";locus_tag="gene2"
+ chr12 HAVANA mRNA 700 900 . + . ID=transcriptb;Parent="gene2";locus_tag="gene2"
+ chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=transcriptb;locus_tag="gene2"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=transcriptb;locus_tag="gene2"
+ chr12 HAVANA gene 1000 1110 . + . ID=nbis-gene-1
+ chr12 HAVANA mRNA 1000 1110 . + . ID=transcript4;Parent=nbis-gene-1
+ chr12 HAVANA exon 1000 1110 . + . ID=exon4;Parent=transcript4
+ chr12 HAVANA CDS 1000 1110 . + 0 ID=cds4;Parent=transcript4
+
The common attributes
is used to attach isoforms to a common gene feature. As transcript4 has no common attribute, it will have its own parent features.
common attribute
approach to parse the file can be used.#2.1
Here we have only level3 features, no Parent/ID transcript_id/gene_id relationships, but a default common attributes
( locus_tag
or gene_id
) is present.
Input testE.gff:
chr12 HAVANA exon 100 300 . + . ID=exon1;locus_tag="gene1"
+ chr12 HAVANA CDS 100 300 . + 0 ID=cds-1;locus_tag="gene1"
+ chr12 HAVANA exon 500 600 . + . ID=exon2;locus_tag="gene1"
+ chr12 HAVANA CDS 500 600 . + 0 ID=cds-2;locus_tag="gene1"
+ chr12 HAVANA exon 700 900 . + . ID=exonb;locus_tag="gene2"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;locus_tag="gene2"
+
agat_convert_sp_gxf2gxf.pl --gff testE.gff
chr12 HAVANA gene 100 600 . + . ID=nbis-gene-1;locus_tag="gene1"
+ chr12 HAVANA mRNA 100 600 . + . ID=nbisL2-exon-1;Parent=nbis-gene-1;locus_tag="gene1"
+ chr12 HAVANA exon 100 300 . + . ID=exon1;Parent=nbisL2-exon-1;locus_tag="gene1"
+ chr12 HAVANA exon 500 600 . + . ID=exon2;Parent=nbisL2-exon-1;locus_tag="gene1"
+ chr12 HAVANA CDS 100 300 . + 0 ID=cds-1;Parent=nbisL2-exon-1;locus_tag="gene1"
+ chr12 HAVANA CDS 500 600 . + 0 ID=cds-2;Parent=nbisL2-exon-1;locus_tag="gene1"
+ chr12 HAVANA gene 700 900 . + . ID=nbis-gene-2;locus_tag="gene2"
+ chr12 HAVANA mRNA 700 900 . + . ID=nbisL2-exon-2;Parent=nbis-gene-2;locus_tag="gene2"
+ chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=nbisL2-exon-2;locus_tag="gene2"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=nbisL2-exon-2;locus_tag="gene2"
+
/!\ In Eukaryote annotation containing isoforms it will not work properly. Indeed, it will result of isoforms merged in chimeric transcripts (It will be really unlucky to end up in such situation, because even a human cannot resolve such type of situation. There is no information about isoforms structure...). In Eukaryote cases (even for multi-exon CDS) with absence of isoforms, it will work correctly.
2.2
Here the worse case that can append: only level3 features, no Parent/ID transcript_id/gene_id relationships, and the default common attributes
( locus_tag
and gene_id
) are absent. Sequential approach will be used by AGAT but as there are only level3 features, all will be linked to only one parent. See below for more details.
Input testC.gff:
chr12 HAVANA exon 100 500 . + . ID=exon1;locus_id="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;locus_id="gene1"
+chr12 HAVANA exon 510 600 . + . ID=exon2;locus_id="gene1"
+chr12 HAVANA CDS 510 600 . + 0 ID=cds-2;locus_id="gene1"
+chr12 HAVANA exon 700 900 . + . ID=exonb;locus_id="gene2"
+chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;locus_id="gene2"
+
agat_convert_sp_gxf2gxf.pl --gff testC.gff
chr12 HAVANA gene 100 900 . + . ID=nbis-gene-1;locus_id="gene1"
+chr12 HAVANA mRNA 100 900 . + . ID=nbisL2-exon-1;Parent=nbis-gene-1;locus_id="gene1"
+chr12 HAVANA exon 100 600 . + . ID=exon1;Parent=nbisL2-exon-1;locus_id="gene1"
+chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=nbisL2-exon-1;plocus_id="gene2"
+chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=nbisL2-exon-1;locus_id="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=nbisL2-exon-1;locus_id="gene1"
+chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=nbisL2-exon-1;locus_id="gene2"
+
/!\ All features are collected under a single gene and mRNA feature, which is wrong.
As the default common attribute
are absent (gene_id or locus_tag), you have to inform AGAT what attribute to use to group features together properly, here locus_id
is a good choice:
agat config --expose --locus_tag locus_id # Modify the locus_tag parameter via the AGAT configuration file agat_config.yaml
agat_convert_sp_gxf2gxf.pl --gff testC.gff
chr12 HAVANA gene 100 600 . + . ID=nbis-gene-1;locus_id="gene1"
+chr12 HAVANA mRNA 100 600 . + . ID=nbisL2-exon-1;Parent=nbis-gene-1;locus_id="gene1"
+chr12 HAVANA exon 100 600 . + . ID=exon1;Parent=nbisL2-exon-1;locus_id="gene1"
+chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=nbisL2-exon-1;locus_id="gene1"
+chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=nbisL2-exon-1;locus_id="gene1"
+chr12 HAVANA gene 700 900 . + . ID=nbis-gene-2;locus_id="gene2"
+chr12 HAVANA mRNA 700 900 . + . ID=nbisL2-exon-2;Parent=nbis-gene-2;locus_id="gene2"
+chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=nbisL2-exon-2;locus_id="gene2"
+chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=nbisL2-exon-2;locus_id="gene2"
+
/!\ In Eukaryote annotation containing isoforms it will not work properly. Indeed, it will result of isoforms merged in chimeric transcripts (It will be really unlucky to end up in such situation, because even a human cannot resolve such type of situation. There is no information about isoforms structure...). In Eukaryote cases (even for multi-exon CDS) with absence of isoforms, it will work correctly.
This is the same problem as seen previously. Here the worse case that can append: only level3 features, no Parent/ID transcript_id/gene_id relationships, and the default common attributes
( locus_tag
and gene_id
) are absent. Sequential approach will be used by AGAT but as there are only level3 features, all will be linked to only one parent. See below for more details.
Input (testD.gff):
chr10 Liftoff CDS 100 300 . + 0 ID=cds1
+ chr10 Liftoff CDS 600 900 . + 0 ID=cds2
+ chr10 Liftoff CDS 400 490 . - 0 ID=cds3
+
agat_convert_sp_gxf2gxf.pl --gff testD.gff
chr10 Liftoff gene 100 900 . + . ID=nbis-gene-1
+ chr10 Liftoff mRNA 100 900 . + . ID=nbisL2-cds-1;Parent=nbis-gene-1
+ chr10 Liftoff exon 100 300 . + . ID=nbis-exon-1;Parent=nbisL2-cds-1
+ chr10 Liftoff exon 400 490 . + . ID=nbis-exon-2;Parent=nbisL2-cds-1
+ chr10 Liftoff exon 600 900 . + . ID=nbis-exon-3;Parent=nbisL2-cds-1
+ chr10 Liftoff CDS 100 300 . + 0 ID=cds1;Parent=nbisL2-cds-1
+ chr10 Liftoff CDS 400 490 . - 0 ID=cds3;Parent=nbisL2-cds-1
+ chr10 Liftoff CDS 600 900 . + 0 ID=cds2;Parent=nbisL2-cds-1
+
/!\ All features are collected under a single gene and mRNA feature, which is wrong.
agat config --expose --locus_tag ID # Modify the locus_tag parameter via the AGAT configuration file agat_config.yaml
agat_convert_sp_gxf2gxf.pl --gff testD.gff
chr10 Liftoff gene 100 300 . + 0 ID=nbis-gene-1
+ chr10 Liftoff mRNA 100 300 . + 0 ID=nbisL2-cds-1;Parent=nbis-gene-1
+ chr10 Liftoff exon 100 300 . + . ID=nbis-exon-1;Parent=nbisL2-cds-1
+ chr10 Liftoff CDS 100 300 . + 0 ID=cds1;Parent=nbisL2-cds-1
+ chr10 Liftoff gene 400 490 . - 0 ID=nbis-gene-3
+ chr10 Liftoff mRNA 400 490 . - 0 ID=nbisL2-cds-3;Parent=nbis-gene-3
+ chr10 Liftoff exon 400 490 . - . ID=nbis-exon-3;Parent=nbisL2-cds-3
+ chr10 Liftoff CDS 400 490 . - 0 ID=cds3;Parent=nbisL2-cds-3
+ chr10 Liftoff gene 600 900 . + 0 ID=nbis-gene-2
+ chr10 Liftoff mRNA 600 900 . + 0 ID=nbisL2-cds-2;Parent=nbis-gene-2
+ chr10 Liftoff exon 600 900 . + . ID=nbis-exon-2;Parent=nbisL2-cds-2
+ chr10 Liftoff CDS 600 900 . + 0 ID=cds2;Parent=nbisL2-cds-2
+
This case is fine for Prokaryote annotation.
/!\ For Eukaryote it might work is some conditions:
A) The annotation should not contain isoforms (Indeed, there is no existing information to decipher to which isoform a CDS will be part of. If isoforms are present, each one will be linked to its own gene feature).
B) If there are multi-exon CDS, CDS parts must share the same ID (Indeed multi-exon CDS can share or not the same ID. Both way are allowed by the GFF format. If the CDS parts share the same ID, the CDS parts will be collected properly. If the CDS parts do not share the same ID, AGAT will slice it and create a gene/mRNA feature by CDS part!).
Input (testG.gff):
chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1
+ chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1
+ chr12 HAVANA exon 100 600 . + . ID=exon2;Parent=transcript2
+ chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=transcript2
+ chr12 HAVANA exon 700 900 . + . ID=exonb;locus_tag="gene1"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;locus_tag="gene1"
+ chr12 HAVANA exon 1000 1110 . + . ID=exon4;locus_tag="gene2"
+ chr12 HAVANA CDS 1000 1110 . + 0 ID=cds4;locus_tag="gene2"
+
agat_convert_sp_gxf2gxf.pl --gff testG.gff
chr12 HAVANA gene 100 500 . + . ID=nbis-gene-3
+ chr12 HAVANA mRNA 100 500 . + . ID=transcript1;Parent=nbis-gene-3
+ chr12 HAVANA exon 100 500 . + . ID=exon1;Parent=transcript1
+ chr12 HAVANA CDS 100 500 . + 0 ID=cds-1;Parent=transcript1
+ chr12 HAVANA gene 100 600 . + . ID=nbis-gene-4
+ chr12 HAVANA mRNA 100 600 . + . ID=transcript2;Parent=nbis-gene-4
+ chr12 HAVANA exon 100 600 . + . ID=exon2;Parent=transcript2
+ chr12 HAVANA CDS 100 600 . + 0 ID=cds-2;Parent=transcript2
+ chr12 HAVANA gene 700 900 . + . ID=nbis-gene-1;locus_tag="gene1"
+ chr12 HAVANA mRNA 700 900 . + . ID=nbisL2-exon-1;Parent=nbis-gene-1;locus_tag="gene1"
+ chr12 HAVANA exon 700 900 . + . ID=exonb;Parent=nbisL2-exon-1;locus_tag="gene1"
+ chr12 HAVANA CDS 700 900 . + 0 ID=cds-b;Parent=nbisL2-exon-1;locus_tag="gene1"
+ chr12 HAVANA gene 1000 1110 . + . ID=nbis-gene-2;locus_tag="gene2"
+ chr12 HAVANA mRNA 1000 1110 . + . ID=nbisL2-exon-2;Parent=nbis-gene-2;locus_tag="gene2"
+ chr12 HAVANA exon 1000 1110 . + . ID=exon4;Parent=nbisL2-exon-2;locus_tag="gene2"
+ chr12 HAVANA CDS 1000 1110 . + 0 ID=cds4;Parent=nbisL2-exon-2;locus_tag="gene2"
+
/!\ For Eukaryote annotation with isoforms, features would need to have the Parent attribute along with a common attribute to help AGAT to properly reconstruct the parental features (a single gene feature for isoforms).