agat_sp_filter_feature_from_kill_list.pl adding large number of unexpected rows to output #460
-
It's great to find a gtf filtering tool that is aware of features' parent/child relationships and able to cleanly remove dependents but I'm confused by the behavior when filtering out gene_biotypes to exclude. My starting gtf has 1972950 rows but after filtering the output is 163,996 rows longer despite the logs showing that >8600 features were removed. I've attached a condensed version of the log file which does not appear to report that any new features are being added. In fact, 4486 features were removed during L1 orphan checking. For consistency with prior analyses, I need to remove the features on my kill list without altering feature boundaries or, apparently, adding new features. Perhaps I'm doing something wrong? Is this the expected filtering behavior?
Condensed agat.log (just the step headers/footers) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
To track what is going on at a more detailed level I would suggest to first run the "gxf2gxf" script to stardarise your file. Standardisation is made by default for all What I guess is the creation of UTRs based on exon and CDS information or exon based on CDS and UTRs. |
Beta Was this translation helpful? Give feedback.
To track what is going on at a more detailed level I would suggest to first run the "gxf2gxf" script to stardarise your file. Standardisation is made by default for all
_sp_
script and it is not possible to completlely deactivate it (A minimal structure is needed to load it in memory). So, agat_sp_filter_feature_from_kill_list.pl is doing to things, standardisation and then the kill_list job.You may follow what is doing the standardisation process by looking at the generated log file but it may be clearer to split the process in two distinct jobs.
Run the standardisation with the "gxf2gxf". The you look at the stats (e.g. using the agat statistics script).
Then run agat_sp_filter_feature…