agat_sp_filter_feature_from_kill_list.pl adding large number of unexpected rows to output #460

sgoldman101 · 2024-05-31T12:43:10Z

sgoldman101
May 31, 2024

It's great to find a gtf filtering tool that is aware of features' parent/child relationships and able to cleanly remove dependents but I'm confused by the behavior when filtering out gene_biotypes to exclude.

My starting gtf has 1972950 rows but after filtering the output is 163,996 rows longer despite the logs showing that >8600 features were removed. I've attached a condensed version of the log file which does not appear to report that any new features are being added. In fact, 4486 features were removed during L1 orphan checking.

For consistency with prior analyses, I need to remove the features on my kill list without altering feature boundaries or, apparently, adding new features. Perhaps I'm doing something wrong? Is this the expected filtering behavior?

# My agat.log
agat_sp_filter_feature_from_kill_list.pl -f bosTau9v2.gtf --kl kill.txt -a gene_biotype -o bosTau9v2.basic.gtf

We will discard all features that share the value of the gene_biotype attribute with the kill list.
The kill list contains 7 uniq IDs
8627 features removed:
2494 features level1 (e.g. gene) removed
2494 features level2 (e.g. mRNA) removed
3639 features level3 (e.g. exon) removed

# my kill list
rRNA
tRNA
V_segment
V_gene_segment
C_region
C_gene_region

Condensed agat.log (just the step headers/footers)
bosTau9v2_condensed_agat.log

Answered by Juke34

May 31, 2024

To track what is going on at a more detailed level I would suggest to first run the "gxf2gxf" script to stardarise your file. Standardisation is made by default for all _sp_ script and it is not possible to completlely deactivate it (A minimal structure is needed to load it in memory). So, agat_sp_filter_feature_from_kill_list.pl is doing to things, standardisation and then the kill_list job.
You may follow what is doing the standardisation process by looking at the generated log file but it may be clearer to split the process in two distinct jobs.
Run the standardisation with the "gxf2gxf". The you look at the stats (e.g. using the agat statistics script).
Then run agat_sp_filter_feature…

View full answer

Juke34 · 2024-05-31T13:12:18Z

Juke34
May 31, 2024
Maintainer

To track what is going on at a more detailed level I would suggest to first run the "gxf2gxf" script to stardarise your file. Standardisation is made by default for all _sp_ script and it is not possible to completlely deactivate it (A minimal structure is needed to load it in memory). So, agat_sp_filter_feature_from_kill_list.pl is doing to things, standardisation and then the kill_list job.
You may follow what is doing the standardisation process by looking at the generated log file but it may be clearer to split the process in two distinct jobs.
Run the standardisation with the "gxf2gxf". The you look at the stats (e.g. using the agat statistics script).
Then run agat_sp_filter_feature_from_kill_list.pl and compare the result with the standardized file.
It would be easier to follow what happens.

What I guess is the creation of UTRs based on exon and CDS information or exon based on CDS and UTRs.
There is no way to deactivate that behavior at the moment but this is something that could be implemented.

1 reply

sgoldman101 May 31, 2024
Author

Oh, indeed you're right. I broke it down into steps and gxf2gxf reports adding ~177k missing UTRs which should have no effect on my downstream analysis. Subsequent filtering with the kill list successfully removed my targeted features at both gene and transcript levels. Thanks for the quick reply. This has been a great help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agat_sp_filter_feature_from_kill_list.pl adding large number of unexpected rows to output #460

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

agat_sp_filter_feature_from_kill_list.pl adding large number of unexpected rows to output #460

sgoldman101 May 31, 2024

Replies: 1 comment · 1 reply

Juke34 May 31, 2024 Maintainer

sgoldman101 May 31, 2024 Author

sgoldman101
May 31, 2024

Replies: 1 comment 1 reply

Juke34
May 31, 2024
Maintainer

sgoldman101 May 31, 2024
Author