-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmanuscript.rmd
1112 lines (765 loc) · 106 KB
/
manuscript.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
bibliography: manuscript.bib
csl: science.csl
header-includes:
- \usepackage{lineno}
- \linenumbers
output:
html_document: default
pdf_document:
latex_engine: xelatex
word_document: default
github_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
echo = FALSE,
dpi = 300,
cache = FALSE
)
```
```{r load_data}
# Load data and analysis results generated by
# manuscript_kernel.R
load("manuscript.RData")
```
```{r preliminaries}
library( tidyverse )
library( magrittr )
library( viridis )
library( knitr )
options(digits=2)
library( gridExtra )
library( ggpubr )
library( grid )
library( gtable )
library( ggtree )
library( ape )
library(kableExtra)
theme_set(theme_pubr())
source( "functions.R" )
source( "phylobayes.R" )
# A colorblind friendly palette from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette:
cbPalette = c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
result_colors = c( cbPalette[2], cbPalette[4], cbPalette[1] )
names( result_colors ) = c( "Ctenophora-sister", "Porifera-sister", "Unresolved" )
cat_categories = read_tsv("../data_processed/tables/cat_categories.tsv")
```
# Rooting the animal tree of life
Yuanning Li^1,3^, Xing-Xing Shen^2,3^, Benjamin Evans^4^, Casey W. Dunn^1^* and Antonis Rokas^3^*
^1^Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT 06511, USA
^2^State Key Laboratory of Rice Biology and Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Institute of Insect Sciences, Zhejiang University, Hangzhou 310058, China
^3^Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235, USA
^4^Yale Center for Research Computing, Yale University, New Haven, CT 06511, USA
\* Corresponding authors, [email protected], [email protected]
ORCIDs:
Yuanning Li: 0000-0002-2206-5804
Xing-Xing Shen: 0000-0001-5765-1419
Benjamin Evans: 0000-0003-2069-4938
Casey W. Dunn: 0000-0003-0628-5150
Antonis Rokas: 0000-0002-7248-6551
**Short title:** Rooting the animal tree of life
**One sentence Summary**: We explore the root of animal tree by synthesizing past work, perform new phylogenomic and sensitivity analyses, and found that the primary source of variation between results is the number of categories used in site-heterogeneous models.
## Abstract
There has been considerable debate about the placement of the root in the animal tree of life, which has emerged as one of the most challenging problems in animal phylogenetics. This debate has major implications for our understanding of the earliest events in animal evolution, including the origin of the nervous system. Some phylogenetic analyses support a root that places the first split in the phylogeny of living animals between sponges and all other animals (the Porifera-sister hypothesis), and others find support for a split between comb jellies and all other animals (Ctenophora-sister). These analyses differ in many respects, including in the genes considered, species considered, molecular evolution models, and software. Here we systematically explore the rooting of the animal tree of life under consistent conditions by synthesizing data and results from 15 previous phylogenomic studies and performing a comprehensive set of new standardized analyses. It has previously been suggested that site-heterogeneous models favor Porifera-sister, but we find that this is not the case. Rather, Porifera-sister is only obtained under a narrow set of conditions when the number of site-heterogeneous categories is unconstrained and range into the hundreds. Site-heterogenous models with a fixed number of dozens of categories support Ctenophora-sister, and cross-validation indicates that such models fit the data just as well as the unconstrained models. Our analyses shed light on an important source of variation between phylogenomic studies of the animal root. The datasets and analyses consolidated here will also be a useful test-platform for the development of phylogenomic methods for this and other difficult problems.
## Main Text
Historically, there was little debate about the root of the animal tree of life. Porifera-sister (Fig. 1E), the hypothesis that the animal root marks the divergence of Porifera (sponges) from all other animals (Fig. 1B), was widely accepted though rarely tested. By contrast, there has long been uncertainty about the placement of Ctenophora (comb jellies) (Fig. 1D) in the animal tree of life [@Wallberg:2004ws]. The first phylogenomic study to include ctenophores [@Dunn:2008ky] suggested a new hypothesis, now referred to as Ctenophora-sister, that ctenophores rather than sponges are our most distant living animal relative (Fig. 1A). Since then, many more phylogenomic studies have been published (Fig. 2), with some analyses finding support for Ctenophora-sister, some for Porifera-sister, and some neither [@King:2017ie] (Table 1, Supplementary Information section S1). The extensive technical variation across these studies has been important to advancing our understanding of the sensitivity of these analyses, demonstrating for example that outgroup and model selection can have a large impact on these results (Fig. 3A). But the extensive technical variation has also made it difficult to synthesize these results to understand the underlying causes of this sensitivity. Several factors make resolution of the root of the animal tree a particularly challenging problem. For one, the nodes in question are the deepest in the animal tree of life. Another factor that has been invoked is branch lengths (*e.g.* Figs. S1, S2), which are impacted by both divergence times and shifts in rate of evolution. Some sponges have a longer root to tip length, indicating an accelerated rate of evolution in those lineages. The stem branch of Ctenophora is longer than Porifera stem which, together with a more typical root-to-tip distance, is consistent with a more recent radiation of extant ctenophores [@podar2001] than extant poriferans. The longer ctenophore stem has led some to suggest that Ctenophora-sister could be an artifact of long-branch attraction to outgroups [@pisani2015genomic].
To advance this debate it is critical to examine variation in methods and results in a standardized and systematic framework. Here, we synthesize data and results from 15 previous phylogenomic studies that tested the Ctenophora-sister and Porifera-sister hypotheses (Fig. 2, Table 1). This set includes all phylogenomic studies of amino acid sequence data published before 2019 for which we could obtain data matrices with gene partition annotations. Among these 15 studies, three studies [@pisani2015genomic; @whelan2016let; @feuda2017improved] are based entirely on previously published data, and gene-partition data is not available from one study [@Pick:2010eb].
![](figures/Figure_overview.png)
**Fig. 1.** Two alternative phylogenetic hypotheses for the root of the animal tree. (A) The Ctenophora-sister hypothesis posits that there is a clade (designated by the orange node) that includes all animals except Ctenophora, and that Ctenophora is sister to this clade. (B) The Porifera-sister hypothesis posits that there is a clade (designated by the green node) that includes all animals except Porifera, and that Porifera is sister to this clade. Testing these hypotheses requires evaluating the support for each of these alternative nodes. (C) The animals and their outgroup choice, showing the three progressively more inclusive clades Choanozoa, Holozoa, and Opisthokonta. (D) A ctenophore, commonly known as a comb jelliy. (E) A poriferan, commonly known as a sponge.
## Variation in models and sampling across published analyses
The models of sequence evolution in the studies considered here differ according to two primary components: the exchangeability matrix $R$ and amino acid equilibrium frequencies $\Pi$. The exchangeability matrix $R$ describes the relative rates at which one amino acid changes to another. The studies considered here use exchangeabilities that are the same between all amino acids (Poisson, also referred to as F81), or different. If different, the exchangeabilities can either be fixed based on previously empirically estimated rates (WAG or LG), or independently estimated from the data (GTR). The analyses considered here have site-homogeneous exchangeability models (site-homogeneous model), which means that the same matrix is used for all sites. The equilibrium frequencies describe the expected frequency of each amino acid, which captures the fact that some amino acids are much more common than others. The published analyses differ in whether they take a homogeneous approach and jointly estimate the same frequency across all sites in a partition, or add parameters that allow heterogeneous equilibrium frequencies that differ across sites. Heterogeneous approaches include CAT [@Lartillot:2004dq], which is implemented in the software PhyloBayes and has been widely applied to questions of deep animal phylogeny. The models that are applied in practice are heavily influenced by computational costs, model availability in software, and convention. While studies often discuss CAT and WAG models as if they are mutually exclusive, we note that these particular terms apply to non-exclusive model components -- CAT refers to heterogeneous equilibrium frequencies across sites and WAG to a particular exchangeability matrix. In this literature, CAT is generally shorthand for Poisson+CAT and WAG is shorthand for WAG+homogeneous equilibrium frequency estimation. To avoid confusion on this point, here we always specify the exchangeability matrix first (*e.g.* GTR), followed by modifiers that describe the accommodation of heterogeneity in equilibrium frequencies (*e.g.* CAT). If site homogeneous equilibrium frequencies are used, we refer to the exchangeability matrix alone. Gamma-rate heterogeneity, a scalar that accommodates the total rate of change across sites, is used in almost every analysis conducted here and we generally omit its designation. Some analyses partition the data by genes, and use different models for each gene (Supplementary Information section S2).
High-throughput sequencing allows investigators to readily assemble matrices with hundreds or thousands of protein-coding genes from a broad diversity of animal species (Table 1). Studies of animal phylogeny have used a wide variety of different approaches to identifying and selecting genes and taxa for their matrices. As a result, the particular genes selected for analysis vary widely (Fig. 2, Fig. S3). Gene sampling varies in several ways, including in the fractions of single-copy orthologs (*e.g.* BUSCO genes) and ribosomal protein genes in the matrix (Fig. S4).
```{r figure_taxon_rectangles, fig.height = 7.2, fig.width = 7.2 }
no_subsets = unlist(lapply(sequence_matrices, function(x) !grepl("only", x@matrix_name)))
taxon_rectangles = lapply(
sequence_matrices[no_subsets],
function ( sequence_matrix ){
clade_rects(sequence_matrix)
}
) %>%
bind_rows()
taxon_rectangles$matrix[taxon_rectangles$matrix == "Simion2017_subsampled"] = "Simion2017"
taxon_rectangles$matrix[taxon_rectangles$matrix == "Hejnol2009_subsampled"] = "Hejnol2009"
clade_colors = c("#e6194b", "#f032e6", "#ffe119", "#4363d8", "#E69F00", "#009E73", "#82d1f4", "#e6beff", "#911eb4")
taxon_rectangles =
taxon_rectangles %>%
filter(matrix != "Nosenko2013_nonribo_9187") %>%
filter(matrix != "Whelan2017_strict") %>%
filter(matrix != "Whelan2015_D1") %>%
filter(matrix != "Whelan2015_D10") %>%
filter(matrix != "Nosenko2013_ribo_11057") %>%
filter(matrix != "Borowiec2015_Best108")
taxon_rectangles$matrix =
factor(
taxon_rectangles$matrix,
levels = c("Dunn2008", "Philippe2009", "Hejnol2009", "Nosenko2013_ribo_14615", "Ryan2013_est",
"Moroz2014_3d", "Borowiec2015_Total1080", "Whelan2015_D20", "Chang2015",
"Simion2017", "Whelan2017_full"))
p =taxon_rectangles %>%
ggplot() +
scale_x_continuous(name = "Gene sampling") +
scale_y_continuous(name = "Taxa sampling") +
geom_rect( mapping = aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = clade )) +
scale_fill_manual(values = clade_colors, guide = guide_legend(reverse = TRUE)) +
facet_wrap( ~ matrix ) + theme_classic()
ggsave("figures/Figure2_raw.pdf", width=7.2, height=7.2)
```
![](figures/Figure_2_manuscript.png)
**Fig. 2.** An overview of previous phylogenomic studies on animals. The horizontal axis is a time-series showing the main conclusions of previous phylogenomic studies that indicated their analyses supported Ctenophora-sister (orange nodes) or Porifera-sister (green nodes) based on their main conclusion from each study. Each of the primary matrices considered here is shown above the axis, color coded by taxon sampling. Horizontal size is proportional to the number of genes sampled, vertical size to the number of species sampled. Only 11 matrices were shown here since three studies [@pisani2015genomic; @whelan2016let; @feuda2017improved] are based entirely on previously published data and gene-partition data is not available from one study [@Pick:2010eb].
Ingroup taxon sampling also varies widely between studies (Fig. 2, Fig. S3). Sampling of ingroup taxa (animals) in early studies was biased toward Bilateria. Sampling of non-bilaterian animals, including sponges and ctenophores, has improved over time (Fig. S3). Within each clade, there is often considerable variation in taxon sampling and therefore often little species overlap across studies (Fig. S3). This variation is in part because newer sequencing technologies in more recent studies are usually not applied to the exact same species that were included in earlier studies.
Sampling of outgroup taxa (non-animals, in this case) is critical to phylogenetic rooting questions, since the node where the outgroup subtree attaches to the ingroup subtree is the root of the ingroup. There has therefore been extensive focus on improving outgroup sampling when testing phylogenetic hypotheses about rooting [@graham2002rooting]. Most studies addressing the animal root have removed more distantly related outgroup taxa in some analyses to explore the effect of outgroup selection to ingroup topology [@ryan2013genome; @pisani2015genomic]. Three progressively more inclusive clades have often been investigated: Choanozoa (animals plus most closely related Choanoflagellatea), Holozoa (Choanozoa plus more distantly related Holozoa), and Opisthokonta (Holozoa + Fungi).
## Variation in results across published analyses
```{r figure_support_analyses, fig.width=7.2, fig.height=3.6}
analyses_published_processed =
analyses_published %>%
dplyr::filter( !grepl( "Recoding", model_combined ) ) %>%
dplyr::rename( model = model_combined ) %>%
select( model, clade, final, inference, manuscript, matrix, software ) %>%
dplyr::rename( result = final ) %>%
mutate( status = "published" )
analyses_new_processed =
analyses_new %>%
mutate(model = factor(model)) %>%
select( model, clade, result, inference, matrix ) %>%
mutate( manuscript = NA ) %>%
mutate( status = "new" ) %>%
mutate( software = NA )
analyses_new_processed$software[ analyses_new_processed$inference == "ML" ] = "IQTREE"
analyses_new_processed$software[ analyses_new_processed$inference == "Bayesian" ] = "PhyloBayes"
analyses_processed =
rbind( analyses_published_processed, analyses_new_processed ) %>%
mutate(software=recode( software, `PhyloBayes MPI`="PhyloBayes") ) %>%
mutate(model = recode(model,
`GTR20` = "GTR",
`CAT+F81` = "Poisson+CAT",
`Poisson + CAT` = "Poisson+CAT",
`GTR + CAT` = "GTR+CAT")) %>%
mutate(
model = factor( model,
levels = c("WAG", "LG", "GTR", "data partitioning",
"Poisson+C60", "WAG+C60", "LG+C60", "Poisson+CAT", "GTR+CAT") )) %>%
mutate( status = factor(status, levels = c("published", "new")))
p = analyses_processed %>%
ggplot( aes(x = model, y = clade, col = result, shape = software) ) +
facet_wrap( ~ status ) +
geom_jitter( width = 0.20, height = 0.20, alpha = 0.7, size = 2 ) +
scale_colour_manual(values = result_colors) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Model") +
ylab("Clade")
ggsave("figures/Figure3_raw.pdf", width=7.2, height=3.6)
```
![](figures/Figure3_published.png)
**Fig. 3.** Summary of phylogenomic results from previous studies and reanalysis conducted in this study. Each point represents a phylogenetic analysis with a support of bootstrap values less than `r bootstrap_threshold`% or a posterior probability less than `r posterior_prob_threshold`% are considered as unresolved. The clades are organized with increased outgroup sampling higher on the vertical axis, and the models are organized in general by increasing complexity in terms of numbers of parameters to the right on the horizontal axis. (A) Analyses transcribed from the literature (related to Table S1). (B) New phylogenomic analyses conducted in this study (related to Table S2). Over 1,011,765 CPU hours (~115 CPU years) were used for this study.
We parsed `r analyses_processed %>% filter(status=="published") %>% nrow()` previous phylogenetic analyses from `r analyses_processed %>% filter(status=="published") %>% pull (manuscript) %>% unique() %>% length()` studies (Fig. 3A and Table S1). The conclusions of five studies strongly favor Porifera-sister and ten favor Ctenophora-sister (Table 1). Three studies are based entirely on previously published data, and the remainder add data for one or more species.s
```{r eval=FALSE}
# Code to double check and explore some of the assertions in the text
# Simion2017 is the only Holozoa analysis to support Porifera-sister
analyses_processed %>%
filter(status == "published") %>%
filter(clade == "Holozoa") %>%
filter(result == "Porifera-sister")
# Check out the CAT analyses that supported Ctenophora sister
analyses_processed %>%
filter(status == "published") %>%
filter(result == "Ctenophora-sister") %>%
filter( grepl("CAT", model))
analyses_processed %>%
filter(status == "published") %>%
filter(result == "Ctenophora-sister") %>%
filter( grepl("CAT", model)) %>%
filter( clade != "Opisthokonta")
```
Our summary of previous phylogenetic analyses (Fig. 3A) shows that Ctenophora-sister is supported in analyses that span the full range of outgroup sampling and models used to date, including some analyses with restricted outgroup sampling and models that accommodate site-heterogeneous equilibrium frequencies with CAT. This is consistent with previous assessments of the problem [@whelan2016let], but is drawn from a much more extensive and systematic examination. The only analyses that support Porifera-sister have reduced outgroup sampling (Choanozoa, Holozoa) and site-heterogeneous models with CAT. Model adequacy assessments generally favor GTR+CAT over Poisson+CAT or site-homogeneous models [@pisani2015genomic; @feuda2017improved], but because GTR+CAT is so parameter rich, many analyses that use a model with GTR+CAT do not converge. The fact that Porifera-sister is recovered only for particular models with particular outgroup sampling indicates that model and outgroup interact, and that this interaction is fundamental to understanding the range of results obtained across analyses.
## New standardized analyses of published matrices
One of the challenges of interpreting support for the placement of the animal root across published studies is that different programs, software versions, and settings have been used across analyses. This extensive variation makes it difficult to identify the primary factors that lead to different results. Here we first reanalyze the primary matrices from each study under the same conditions with IQ-TREE [@nguyen2014iq] with multiple evolutionary models. We selected this tool because it has greater model flexibility than most other phylogenetic tools [@zhou2017evaluating] (Fig. 3B; Table S3). Importantly, it has C models (C10-C60 equilibrium frequencies) [@si2008empirical] that, like CAT, allow models to accommodate heterogeneity in equilibrium amino acid frequencies across sites.
```{r eval=FALSE}
# Code to double check some of the assertions in the text
# See results (and other fields) for new Cxx analyses, all are Ctenophora-sister with exception of Moroz2014_3d and Nosenko matrices
analyses_processed %>%
filter(status == "new") %>%
filter( grepl("C\\d\\d", model) ) %>%
print(n = 40)
```
For each of the published studies, we selected the matrix that was the primary focus of the manuscript, or has been reanalyzed extensively in other studies, for further analysis. For each of these matrices, we progressively trimmed taxon sampling to create Opisthokonta, Holozoa, and Choanozoa versions, where permitted by original outgroup sampling. This produced `r analyses_new$matrix%>% unique()%>% length()` data matrices from 11 studies that presented new sequence data and for which partition data were available.
For all but the three largest matrices, we tested the relative fit of a variety of models, both with and without C10-C60 accommodation of site heterogeneity in equilibrium frequencies, using ModelFinder [@kalyaanamoorthy2017modelfinder] in IQ-TREE. In all cases, models with C60 fit these matrices better than the site-homogeneous models. This is consistent with the importance of accommodating site heterogeneity noted by previous investigators [@Lartillot:2004dq; @Philippe:2009hh; @nosenko2013deep; @pisani2015genomic; @simion2017large; @feuda2017improved]. We then inferred support under the best-fit model (Table S3), except for the three largest matrices where we used LG+C60. We then analyzed each matrix under a panel of standard site-heterogeneous and site-homogeneous models, including WAG, GTR and Poisson+C60 (Table S3).
All IQ-TREE analyses, apart from unresolved analyses (for Moroz2014_3d and all Nosenko2013 matrices), supported Ctenophora-sister (Fig. 3B). No IQ-TREE analyses supported Porifera-sister, including those that restrict outgroup sampling to Choanoflagellatea and use models with site-heterogeneous equilibrium frequencies (Fig. 3B lower right; Table S2), the conditions under which published PhyloBayes CAT analyses recover strong support for Porifera-sister (Fig. 3A). To further verify this difference in a controlled manner we reran PhyloBayes analyses with CAT, using both Poisson and GTR substitution matrices, for some matrices that had led to support for Porifera-sister in published analyses. Consistent with published results, some of these supported Porifera-sister.
Our new analyses show that, with restricted outgroup sampling, analyses of the same matrices with two different means of accommodating site heterogeneity in equilibrium frequencies (C60 in IQ-TREE and CAT in PhyloBayes) yield different results. This indicates that the traditional framing of the problem, that accommodating site heterogeneity leads to support for Porifera-sister, is not correct. Instead, there is something about the PhyloBayes CAT analyses specifically that leads to support for Porifera-sister.
### Category number explains differences between site-heterogeneous analyses
There are several factors, including variations in models (C60 vs CAT), software (Phylobayes vs IQ-TREE), and implementation details (*e.g.* number of categories used to accommodate site heterogeneity) that could explain the new variation in results noted here among site-heterogeneous models. In published analyses of the animal root, these factors were confounded since all previous heterogeneous analyses used the CAT model in PhyloBayes. Here we seek to deconfound these factors to gain a finer-grained perspective on why results differ between analyses of the same matrices.
```{r}
cat_categories$n_categories_poisson = cat_categories$`Infered categories from Poisson-CAT`
cat_categories$n_categories_gtr = cat_categories$`Infered categories from GTR-CAT`
# https://www.datanovia.com/en/lessons/t-test-in-r/
# Whelan
Dw = cross_validation %>%
filter(matrix=="Whelan2017_strict")
stat_test_w =
t.test(score ~ model, var.equal = TRUE, data=Dw)
P_value_Whelan2017_strict=stat_test_w$p.value
# Philippe
Dp = cross_validation %>%
filter(matrix=="Philippe2009_Chanimalia")
stat_test_p =
t.test(score ~ model, var.equal = TRUE, data=Dp)
P_value_Philippe2009=stat_test_p$p.value
```
A primary difference between the C (like C60) and CAT site-heterogeneous models is the number of equilibrium frequency categories. The standard CAT model employs a Dirichlet process prior to inferring the number of equilibrium frequency categories, so the number of categories is variable [@Lartillot:2004dq]. IQ-TREE implements C models [@si2008empirical] with a fixed number of categories, from 10 (C10) to 60 (C60). Differences in analysis results could therefore be due to differences in the number of categories. The number of categories inferred by CAT in PhyloBayes can be very high (Table S4), with a mean here of `r cat_categories$n_categories_poisson %>% mean() %>% round(1)` categories for Poisson+CAT analyses in all matrices and `r cat_categories$n_categories_gtr %>% mean(na.rm=TRUE) %>% round(1)` categories for GTR+CAT analyses in several representative matrices. This requires a very large number of additional estiamted parameters.
We examined the specific impact of this large difference in category number on animal root position. It is currently not possible to use more than 60 categories for C models in IQ-TREE, but the number of categories can be set *a priori* in Phylobayes using nCAT. We therefore varied the number of categories in Phylobayes analyses (Fig. 4; Table S5). We found that Poisson+nCAT60 analyses in PhyloBayes, like Poisson+C60 and WAG+C60 IQ-TREE analyses, provide strong support of Ctenophora-sister. This indicates that the difference in results between unconstrained CAT analyses in PhyloBayes and C60 analyses in IQ-TREE is not due to differences in the software or other implementation factors, but due to the large difference, in excess of ten-fold, in the number of site categories. When we increased the number of categories in PhyloBayes nCAT analyses, we observe the transition from support for Ctenophora-sister to an unresolved root to Porifera-sister (Fig. 4). For example, for the Whelan2017_strict matrix this transition occurs between 60 -- 120 categories when using Poisson+nCAT model. Due to computational limitations of GTR models, we only ran GTR+CAT and GTR+nCAT60 models on representative matrices with Choanozoa sampling. We found that for Whelan2017 matrices, support shifted from Porifera-sister with Poisson+CAT model to Ctenophora-sister using GTR+CAT model. Moreover, we also found all results strongly supported Ctenophora-sister with GTR+nCAT60 models.
These results further clarify when analyses support Porifera-sister (*e.g.* Whelan2017: Fig. S1; Philippe2009: Fig. S2): when outgroup sampling is restricted (Choanozoa), when a Poisson (rather than a GTR) exchange matrix is used, and when a very large number of site categories is used (unconstrained CAT, giving hundreds of equilibrium frequency categories). Analyses under other conditions either support Ctenophora-sister or are unresolved. This is consistent across published analyses and our new panels of analyses. The question, then, isn’t why similar analyses give different results, but how we should interpret variation in results when we run analyses that differ in these specific respects. If we fix the first two features to conditions that are necessary for Porifera-sister support (Choanozoa taxon sampling and a Poisson exchange matrix), there are several insights that we can glean from examining how the number of equilibrium frequency categories impacts results that sheds light on interpretation of those results.
The first questions regard model fit. ModelFinder selects site heterogeneous C60 models according to BIC (Table S3), but IQ-TREE often gives a warning under C60 that the model may overfit, with too many categories, because some mixture weights are close to 0 (Table S6). In PhylobBayes, Cross-validation is a reliable and suggested approach to model evaluation [@lartillot2009phylobayes]. We elvaluated Poisson+CAT and Poisson+nCAT60 models with cross-validation in PhyloBayes for the Whelan2017_strict and Philippe2009_Choanozoa matrices. For both matrices, we found nearly identical distributions of cross-validation scores for Poisson+nCAT60 and Poisson+CAT models (Fig. S5). A t-test analysis shows that there is no significant difference between them for either matrix (Whelan2017_strict: $p-value=`r P_value_Whelan2017_strict`$; Philippe2009_Choanozoa: $p-value=`r P_value_Philippe2009`$). Cross-validation therefore does not indicate that the unconstrained CAT models are a better fit than the models with 60 categories. This suggests there is not a reason to favor the narrow analysis conditions (Choanozoa taxon sampling, Poisson exchange matrix, and unconstrained CAT models that use hundreds of categories) under which Porifera-sister are obtained.
In a site-heterogeneous model, each site is allocated to one of the categories in each sample. Different categories can have very different numbers of sites. We examine category allocations in the last chain samples of analyses of the Philippe2009_Choanozoa and Whelan2017_strict matrices (Fig. S 6A-B), which have `r n_categories_phil_cat` and `r n_categories_whel_cat` categories, respectively. The fraction of sites allocated to the 50% of the least frequent categories is `r mid_phil_cat`% in the analysis of the Philippe2009_Choanozoa matrix and `r mid_whel_cat`% in the analysis of the Whelan2017_strict matrix. This long tail of categories that apply to a very small fraction of sites is in stark contrast to nCAT60 analyses, constrained to 60 categories, which have no such long tail of rare categories (Fig. S6A-B). The fact that so many categories apply to such a small fraction of data may help explain why increasing the number of categories more than ten-fold has so little impact on the predictive power of these far more complex models.
``` {r sensitive, fig.width=7.2, fig.height=7.2}
Philippe2009 =
analyses_sensitive %>%
filter(model != "GTR+CAT60") %>%
filter(matrix == "Philippe2009_only_choanozoa" )
Philippe2009$model =
factor(
Philippe2009$model,
levels = c("Poisson+CAT60", "Poisson+CAT90", "Poisson+CAT120", "Poisson+CAT150", "Poisson+CAT180"))
p1 = Philippe2009 %>%
ggplot( aes(model) ) +
geom_line( aes(y = `support_ctenophora_sister`, colour = "Ctenophora-sister", group = 1 ), color = ("#E69F00")) +
geom_point( aes(y = `support_ctenophora_sister` ), color = ("#E69F00")) +
geom_line( aes(y = `support_porifera_sister`, colour = "Porifera-sister", group = 2), color = ("#009E73")) +
geom_point( aes(y = `support_porifera_sister` ), color = ("#009E73")) +
scale_fill_manual(values = c("#E69F00", "#ebd196")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Number of CAT substitutional categories", y = "Support") +
ggtitle("Philippe2009_Choanimalia")
Philippe2009 =
analyses_sensitive %>%
filter(matrix == "Philippe2009_only_Holozoa")
Philippe2009$model =
factor(
Philippe2009$model,
levels = c("Poisson+CAT60", "Poisson+CAT270", "Poisson+CAT360", "Poisson+CAT400", "Poisson+CAT440", "Poisson+CAT480"))
p2 = Philippe2009 %>%
ggplot( aes(model)) +
geom_line(aes(y = `support_ctenophora_sister`, colour = "Ctenophora-sister", group = 1), color = ("#E69F00")) +
geom_point(aes(y = `support_ctenophora_sister` ), color = ("#E69F00")) +
geom_line(aes(y = `support_porifera_sister`, colour = "Porifera-sister", group = 2), color = ("#009E73")) +
geom_point(aes(y = `support_porifera_sister` ), color = ("#009E73")) +
scale_fill_manual(values = c("#E69F00", "#ebd196")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Number of CAT substitutional categories", y = "Support") +
ggtitle("Philippe2009_Holozoa")
Whelan2017_full_only_choanozoa =
analyses_sensitive %>%
filter(model != "GTR+CAT60") %>%
filter(matrix == "Whelan2017_full_only_choanozoa")
Whelan2017_full_only_choanozoa$model =
factor(
Whelan2017_full_only_choanozoa$model,
levels = c("Poisson+CAT60", "Poisson+CAT340", "Poisson+CAT380", "Poisson+CAT420", "Poisson+CAT460"))
p3 = Whelan2017_full_only_choanozoa %>%
ggplot( aes(model)) +
geom_line(aes(y = `support_ctenophora_sister`, colour = "Ctenophora-sister", group = 1), color = ("#E69F00")) +
geom_point(aes(y = `support_ctenophora_sister` ), color = ("#E69F00")) +
geom_line(aes(y = `support_porifera_sister`, colour = "Porifera-sister", group = 1), color = ("#009E73")) +
geom_point(aes(y = `support_porifera_sister` ), color = ("#009E73")) +
scale_fill_manual(values = c("#E69F00", "#ebd196")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Number of CAT substitutional categories", y = "Support") +
ggtitle("Whelan2017_full_Choanimalia")
Whelan2017_strict =
analyses_sensitive %>%
filter(matrix == "Whelan2017_strict" )
Whelan2017_strict$model =
factor(
Whelan2017_strict$model,
levels = c("Poisson+CAT60", "Poisson+CAT70", "Poisson+CAT80", "Poisson+CAT90", "Poisson+CAT100",
"Poisson+CAT110", "Poisson+CAT120", "Poisson+CAT150", "Poisson+CAT180", "Poisson+CAT360"))
p4 = Whelan2017_strict %>%
ggplot( aes(model)) +
geom_line(aes(y = `support_ctenophora_sister`, colour = "Ctenophora-sister", group = 1), color = ("#E69F00")) +
geom_point(aes(y = `support_ctenophora_sister` ), color = ("#E69F00")) +
geom_line(aes(y = `support_porifera_sister`, colour = "Porifera-sister", group = 1), color = ("#009E73")) +
geom_point(aes(y = `support_porifera_sister` ), color = ("#009E73")) +
scale_fill_manual(values = c("#E69F00", "#ebd196")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Number of CAT substitutional categories", y = "Support") +
ggtitle("Whelan2017_strict")
p=ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2, labels = c("A", "B", "C", "D"), common.legend = TRUE)
ggsave("figures/Figure4_raw.pdf", width=7.2, height=7.2)
```
![](figures/Figure4_manuscript.png)
**Fig. 4.** Sensitivity analyses with representative data matrices were analyzed by different number of equilibrium frequency categories (nCAT) in PhyloBayes. Statistical support values (posterior probabilities) were obtained from three data matrices using the site-heterogeneous Poisson+CAT model with different categories. (A) Phlippe2009_Choanozoa; (B) Philippe2009_Holozoa; (C) Whelan2017_full_Choanozoa; (D) Whelan2017_strict. Statistical support for Ctenophora-sister and Porifera-sister is indicated in orange and green, respectively. Support values from the sensitivity analyses are shown in Table S5.
### Phylogenetic signal
To further explore the phylogenetic signal of different models we discussed above, we quantified the phylogenetic signal for Porifera-sister and Ctenophora-sister topologies across three representative data matrices when varying outgroup sampling and model (Fig. S7). By calculating differences in log-likelihood scores for these topologies for every gene ($\Delta$|lnL|) in each matrix when using site-homogeneous models in IQ-TREE, we found that the Ctenophora-sister had the higher proportions of supporting genes in every analysis. Moreover, outgroup choice has little impact on the distribution of the support for phylogenetic signals in analyses with site-homogeneous models. This finding is largely consistent with the previously observed distribution of support for Ctenophora-sister in other data matrices [@shen2017contentious].
Although a higher proportion of genes support Ctenophora-sister with site-heterogeneous C60 models, the phylogenetic signal decreases in many genes using C60 models compared to site-homogeneous models. In an extreme case, in matrices from Ryan2013_est nearly 30% of genes changed from strong $\Delta$|lnL|>2) to weak Ctenophora-sister signal ($\Delta$|lnL|<2) (Fig. S7; Table S7). In contrast to the C60 models, there is a major increase in phylogenetic signal in Poisson+CAT models in PhyloBayes towards Porifera-sister, and outgroup choice has a major effect of the distribution of phylogenetic signal (Fig. S7). For example, in Whelan2017_full matrix, we found that the number of genes that support Ctenophora-sister in analyses with CAT decreases from 57.5% in matrices with distant outgroups (Holozoa) to 35.4% when outgroups are restricted (Choanozoa, Table S7).
### Other approaches to accommodating site heterogeneity
Other approaches have been taken to addressing base compositional heterogeneity across taxa. For example, Feuda *et al.* [@Feuda:2017ew] recoded the full set of twenty amino acids into six groups of amino acids. These groups tend to have more frequent evolutionary changes within them than between them [@Susko:2007ds]. Recoding could, like CAT and C models, address variation across sites, but it could also accommodate variation across lineages, and it was suggested that this approach favors Porifera-sister [@Feuda:2017ew]. However, our own analyses (Supplementary Information section S3, Figs. S8-S9) comparing the performance of random and empirical recoding indicate that the impact of data recoding is largely due to discarding information, not accommodating variation in amino acid composition. Consistent with a recent simulation study on data recoding [@hernandez2019six], these findings indicate that recoding can be a problematic method for addressing heterogeneity.
## Conclusions
Resolving the placement of the root in the animal tree of life has proved very challenging [@laumer2019revisiting]. By synthesizing past phylogenomic studies and performing new analyses, we find that support of Porifera-sister is only recovered by site-heterogeneous CAT models with restricted outgroup sampling, and then only in some such analyses. Through controlled analyses we are able to identify the specific aspect of the models that is involved in this variation -- the number of categories used to accommodate site heterogeneity in equilibrium frequency (Fig. 4). The 10-fold difference in category number seen in the more complex CAT models that support Porifera-sister does not improve model fit according to cross-validation, though. This suggests that we shouldn't privilege these narrow analysis conditions that recover Porifera-sister over the much broader range of conditions that recover Ctenophora-sister.
Pin-pointing category number as an issue with large effect on analyses of the animal root will help guide future analyses that address this question. We hope that the work we have conducted here to consolidate many datasets and analyses in standard formats will make it easier for other investigators to engage in this particularly interesting and difficult phylogenetic problem, and that this problem can be a test-bed to develop methods and tools that will help with other difficult phylogenetic problems as well. Advances on the question of the animal root will come from progress on other fronts as well. For example, there are many organisms that are highly relevant to this problem, in particular outgroup, ctenophore, and sponge taxa, for which no genome or transcriptome data are available [@King:2017ie]. More broadly, there are very few chromosome-level genome assemblies for animals outside of Bilateria. Future analyses focused on complete genomes rather than transcriptomes and partial genomes will have multiple advantages. Data matrices derived from these more complete sources will have a lower fraction of missing sequences. Complete gene sampling within each species will also greatly improve analyses of gene duplciation and loss, a crititical step in building phylogenomic matrices such as those presented here. Analyses of this new generation of matrices derived from complete genomes will be well served my understandind the sources of analysis variation in the generation of matrices that came before them.
## Acknowledgements
We thank members of the Dunn and Rokas laboratories for discussions and comments. We are grateful to Steve Haddock, Kyle David, Stephen Haddock and Nathan Whelan for feedback on an earlier version of this manuscript. We are grateful to Nicole King, Daniel Richter, Paul Lewis, Warren Francis and Joseph Ryan for feedback on the preprint version of this manuscript. We thank the Yale Center for Research Computing for use of the research computing infrastructure, specifically the Farnam HPC cluster. Yuanning Li was partially supported by a scholarship from the China Scholarship Council (CSC) for studying and living abroad. This work was conducted in part using the resources of the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University. Research in A.R.'s lab is supported by the National Science Foundation (DEB-1442113), the National Institutes of Health / National Institute of Allergy and Infectious Diseases (1R56AI146096-01A1), the Guggenheim Foundation, and the Burroughs Wellcome Fund.
## Author contributions
C.W.D., Y.L. conceived this study. C.W.D. and A.R. supervised the project. Y.L., B.E., X-X.S., and C.W.D. conducted analyses. Y.L., B.E. and C.W.D. wrote the paper. All authors discussed the results and implications and commented on the manuscript at all stages.
## List of Supplementary materials
Materials and methods
Table S1 - S8
Fig. S1 - S9
Supplementary text S1 - S3
## Supplementary Materials
## Materials and Methods
### Data and code availability
The main data and results associated with the main text and supplementary materials are available in the GitHub data repository at https://github.com/dunnlab/animal_tree_root. All tree files, intermediate results and scripts/commands associated with this study are available in the Figshare data repository at https://doi.org/10.6084/m9.figshare.13085081.v1.
### Data selection and wrangling
We retrieved matrices from each publication (Table 1), storing the raw data in this manuscript's version control repository. We manually made some formatting changes to make the batch processing of the matrices work well, *e.g.* standardizing the format of Nexus CHARSET blocks. All changes made are tracked with git.
### Matrix comparison and annotation
#### Taxon name reconciliation
We programmatically queried the NCBI Taxonomy database to standardize names of samples in each matrix. We also used a table where manual entries were needed (manual_taxonomy_map.tsv), *e.g.* authors of the original matrix indicate species name in original manuscript. For a table summarizing all samples and their new or lengthened names, see taxon_table.tsv.
#### Sequence comparisons
Using the original partition files for each matrix, we separated each sequence for each taxon from each partition. Because many of the matrices had been processed by the original authors to remove columns that are poorly sampled or highly variable, these matrix-derived sequences can have deletions relative to the actual gene sequences.
We used DIAMOND v0.9.26 [@Buchfink:2014] to compare each sequence to all others using default diamond Blastp parameters. We further filtered DIAMOND results such that we retained hits for 90% of partitions (pident > 50.0, eValue < 1e-5, no self vs self). We ran BUSCO with default parameters for all sequences against the provided Metazoa gene set. We also ran a BLAST+ v2.8.1 [@Camacho:2009] blastp search against the SwissProt [@boeckmann:2003swiss] database, filtering results such that we retain at least one hit for ~97% of partitions (pident > 50.0, eValue < 1e-15).
#### Partition network
We used the sequence similarity comparisons described above to compare partitions.
We constructed a network with Python and NetworkX v2.2 [@Hagberg:2008] where each node is a partition and each edge represents a DIAMOND sequence-to-sequence match between sequences in the partitions. We extracted each connected component from this network. We further split these components if the most connected node (i.e. most edges) had two times more the standard deviation from the mean number of edges in the component it is a member of and if removing that node splits the component into two or more components. We then decorated every node in the partition network with the most often found SwissProt BLAST+ result and BUSCO results to see which components contain which classes and families of genes. See partition_network_summary in Rdata for a summary tally of each part of the comparison.
### Phylogenetic analyses
#### Phylogenetic analyses in IQ-TREE
To investigate the phylogenetic hypotheses and distribution of phylogenetic signal in studies aiming to find the root position of animal phylogeny, we considered 16 data matrices from all phylogenomic studies that were constructed from EST, transcriptomic, or genomic data (Table 1). Because different choices of substitution models could largely influence phylogenetic inference of the placement of the root position of animal phylogeny (*e.g.* site-heterogeneous vs. site-homogeneous models), we first investigated model-fit from each matrix using ModelFinder in IQ-TREE v1.6.7, including site-heterogenous C10 to C60 profile mixture models (C60 models) as variants of the CAT models in ML framework (C10-C60 model were included for model comparison via -madd option). We included models that are commonly used in previous analyses, including site-homogeneous Poisson, WAG, LG, GTR models plus C10-C60 models in the model testing. For computational efficiency, the GTR+C60 models were not included in model testing since it requires to estimate over 10,000 parameters. For large matrices like those from Hejnol2009, Borrowiec2015, and Simion2017, model testing is also not computational feasible so only LG+C60 models were used since LG/WAG+C60 models were suggested as the best-fit model in all other matrices.
We then reanalyzed each matrix under a panel of evolutionary models, including WAG, GTR, Poisson+C60 and associated best-fit model identified above. Nodal support was assessed with 1000 ultrafast bootstrap replicates for each analysis. Because of the large size of Hejnol2009 and Simion2017, it was not computationally feasible to analyze the whole matrix using the C60 model or CAT site-heterogeneous models. To circumvent his limitation, we reduced the data size from their full matrices to facilitate computational efficiency for site-heterogeneous models. For Hejnol2009 matrix, we instead used the 330-gene matrix constructed by Hejnol *et al*. 2009, since the main conclusion for their study is based on this subsampled matrix; For Simion2017 matrix, we only included the most complete 25% of genes (genes that were present in less than 79 taxa were removed; 428 genes were kept). It should be noted that the main conclusion of Simion *et al*. was also based on selection of 25% of genes for their jackknife approach.
#### Outgroup taxa sampling with C60 and CAT models
Because different choices of outgroups could also affect phylogenetic inference as suggested in previous analyses, we parsed the full data matrices into three different types of outgroups: Choanozoa, Holozoa and Opisthokonta. These datasets include the same set of genes but differ in the composition of outgroup species. Choanozoa only includes Choanoflagellatea outgroup; Holozoa also includes more distantly related Holozoans; Opistokonta also includes Fungi. For each Choanozoa data matrice, both C60 models in IQ-TREE and Poisson+CAT models in PhyloBayes were conducted. The maximum likelihood analysis was performed using the best-fit substitution model identified as above and nodal support was assessed with 1000 ultrafast bootstrap replicates using IQ-TREE. Moreover, Bayesian inference with the site-heterogeneous Poisson+CAT model was done with PhyloBayes-MPI v1.8. To minimize computational burden, GTR+CAT models were only performed in the representative Choanozoa matrices from Philippe2009, Ryan2013_est and Whelan2017_full.
For several Choanozoa matrices indicated strong support for the hypothesis that sponges are the sister group to the remaining Metazoa using the Poisson+CAT model, Bayesian inference with Poisson+CAT model was also conducted to Holozoa and Opisthokonta data matrices with the same settings as above. For all the analyses with Poisson+CAT models in PhyloBayes, two independent chains were sampled every generation. Tracer plots of MCMC runs were visually inspected in Tracer v1.6 to assess stationarity and appropriate burn-in. Chains were considered to have reached convergence when the maxdiff statistic among chains was below 0.3 (as measured by bpcomp) and effective sample size > 50 for each parameter (as measured by tracecomp). A 50% majority-rule consensus tree was computed with bpcomp, and nodal support was estimated by posterior probability. Most Phylobayes runs converged, although several large matrices have not reached convergence after at least a month's computational time. For those matrices that were not converged, PhyloBayes analyses were run for at least two weeks. We also summarized the average number of substitutional categories inferred for each PhyloBayes analysis using Tracer.
#### Phylogenetic distribution of support
To investigate the distribution of phylogenetic signal of the animal-root position in data matrices, we considered three major data matrices from three studies that had different topology between ML and BI using CAT model in our reanalysis, including Philippe2008, Ryan2013_est, and Whelan2017_full data matrices. We examined two hypotheses: Ctenophora-sister (T1) and Porifera-sister (T2) to the rest of metazoans, under a panel of evolutionary models with different outgroup schemes (Choanozoa and the full matrix). For IQ-TREE analysis in each data matrix, site-wise likelihood scores were inferred for both hypotheses using IQ-TREE (option -g) with the LG+G4 model. The two different phylogenetic trees passed to IQ-TREE 1.6.12 (via -z) were the tree where Ctenophora-sister and a tree modified to have Porifera placed as the sister to the rest of animals. The numbers of genes and sites supporting each hypothesis were calculated from IQ-TREE output and Perl scripts from a previous study [@shen2017contentious]. By calculating gene-wise log-likelihood scores between T1 and T2 for every gene, we considered a gene with an absolute value of log-likelihood difference of two as a gene with strong (|ΔlnL| > 2) or weak (|ΔlnL| < 2) phylogenetic signal as done in a previous study [@smith2020phylogenetic].
For Poisson+CAT and LG in PhyloBayes, we only considered the Philippe2009 and Whelan2017 matrices due to computational efficiency. Since the default option in PhyloBayes does not provide the feature to calculate site-wise log likelihood for every generation, we replaced the line “int sitelogl = 0” with “int sitelogl = 1” in the file named “ReadSample.cpp” and installed PhyloBayes 4.1c so that site-wise log likelihood value can be stored to a file that ends with ".sitelogl" (via readpb –sitelogl). For each condition, we first calculated site-wise log likelihoods for each of two hypotheses (T1 and T2) using pb (via –T) and then stored site-wise log likelihood (a total number of samples for each site is 20) every ten until 300th generations, after discarding the first 100 generations using readpb (via -sitelogl -x 100 10 300). Next, we normalized site-wise log likelihood value across 20 samples for each of two hypotheses (T1 and T2) and combined normalized site-wise log likelihood values of T1 and T2 into a single file that was used to calculate gene-wise log-likelihood scores between T1 and T2 with Perl scripts from a previous study [@shen2017contentious].
#### Sensitivity analyses with different number of substitutional categories
To explore how the number of substitutional categories may affect the phylogenetic inference related to the animal phylogeny, we conducted PhyloBayes analyses with a panel of different substitutional categories in the Whelan2017_strict (ncat=60, 70, 80, 90, 110, 120, 150, 180, 360), Whelan2017_full_Choanozoa (nCAT=60, 340, 380, 420, 460), Philippe2009 (nCAT=60, 90, 120, 150, 180) Philippe2009_Holozoa (nCAT=360, 400, 440, 480) and Ryan2013_est (ncat=60). To compare the results between Poisson+CAT and GTR+CAT and minimize computational burden, GTR+CAT and GTR+CAT60 models were only performed in the representative Choanozoa matrices from Philippe2009, Ryan2013_est and Whelan2017_full. All PhyloBayes analyses were carried out using the same settings as above (see Outgroup taxa sampling with C60 and CAT models section), except when a different number of categories was used.
To compare the allocation of frequency categories across sites in the Philippe2009_Choanozoa and Whelan2017_strict matrices for the constrained Poisson+CAT60 model and unconstrained Posson+CAT model, we parsed the information of PhyloBayes chain files by sampling one in every 1000 generations after burnin determined above. The scripts and subsampled chain files are in the `../trees_new/frequency` subdirectory of the git repository.
#### Cross-validation analyses
Bayesian cross-validation implemented in PhyloBayes-MPI was used to compare the fit of Poisson+nCAT60 and Poisson+CAT models coupled with a gamma distribution of site-rate heterogeneity in Whelan2017_strict and Philippe2009_Choanozoa data matrices. Ten replicates were run, each replicate consisting of a random subsample of 10,000 sites for training the model and 2,000 sites for computing the cross-validation likelihood score. For each run, 5,000 generations were run and the first 2,000 generations were discarded as burn-in.
#### Performance of data-recoding
All code used for the analyses presented here is available in a git repository at https://github.com/caseywdunn/feuda_2017. The randomized recoding analyses are in the `recoding/alternative` subdirectory of the git repository.
The original SR-6 recoding scheme is "`APST CW DEGN FHY ILMV KQR`" [@Susko:2007ds], where spaces separate amino acids that are placed into the same group. This recoding is one member of a family of recodings, each with a different number of groups, based on clustering of the JTT matrix. The other recoding used by Feuda *et al.*, KGB-6 and D-6, are based on different matrices and methods [@Feuda:2017ew].
The `alt_recode.py` script was used to generate the randomized recoding schemes and apply the recodings to the data. To create the randomized recoding schemes, the amino acids in the SR-6 recoding were randomly reshuffled. This creates new recodings that have the same sized groups as SR-6. The new recodings were, from `random-00` to `random-03`:
MPKE AY IDGQ TRC WNLF SVH
EIFT WL QVPG NKM SCHA RYD
LCQS GK WPTI VRD YEFN MAH
IWQV TY KDLM ESH APCF GRN
To apply these to the data, each amino acid was replaced with the first amino acid in the group. When applying `random-00`, for example, each instance of `R` and `C` would be replaced by a `T`.
The 20 state matrices are the same across all analyses since they are not recoded. Since all 20 state matrices are the same, variation between 20-state results (as in the left side of each pane of Fig. S9) gives insight into the technical variance of the inference process.
Each new matrix was analyzed with PhyloBayes-MPI. Analyses were run for 1000 generations, and a 200 generation burnin applied. The resulting tree files and posterior predictive scores were parsed for display with the code in `manuscript.rmd`.
The statistics presented in Fig. S8A were parsed from the Feuda *et al.* manuscript into the file `tidy_table_3.tsv` and rendered for display with the code in `manuscript.rmd`.
## Tables
**Table 1.** Overview of data matrices used in this study.
```{r table_study_summary}
table_study_summary = read_tsv("../data_processed/tables/study_summary.tsv")
opts <- options(knitr.kable.NA = "")
knitr::kable(table_study_summary, digits = 3, format.args = list(big.mark = ",",
scientific = FALSE)) %>%
kable_styling(latex_options="scale_down")
```
## Supplementary Figures
```{r figure_Whelan2017_trees,, fig.height = 8, fig.width = 10}
outgroup= taxa_map_whelan %>% filter (clade_assignment=="Choanoflagellida")
tree1_rooted = root(tree1, outgroup$relabelled_name, resolved.root=TRUE)
p1 = ggtree(tree1_rooted) +
geom_cladelabel(node=90, label="Ctenophora", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=114, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=28, label="Placozoa") +
geom_cladelabel(node=109, label="Bilateria") +
geom_cladelabel(node=95, label="Cnidaria") +
geom_hilight(node=90, fill="#E69F00") +
geom_hilight(node=114, fill="#009E73") +
geom_hilight(node=28, fill="#82D1F4") +
geom_hilight(node=109, fill="#E6BEFF") +
geom_hilight(node=95, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 95), hjust=-.3, size=2) + ggtitle(" WAG+C60")
tree2_rooted = root(tree2, outgroup$relabelled_name, resolved.root=TRUE)
p2= ggtree(tree2_rooted) +
geom_cladelabel(node=119, label="Ctenophora", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=100, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=22, label="Placozoa") +
geom_cladelabel(node=93, label="Bilateria") +
geom_cladelabel(node=86, label="Cnidaria") +
geom_hilight(node=119, fill="#E69F00") +
geom_hilight(node=100, fill="#009E73") +
geom_hilight(node=22, fill="#82D1F4") +
geom_hilight(node=93, fill="#E6BEFF") +
geom_hilight(node=86, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 1), hjust=-.3, size=2) + ggtitle(" Poisson+nCAT60")
tree3_rooted = root(tree3, outgroup$relabelled_name, resolved.root=TRUE)
p3=ggtree(tree3_rooted) +
geom_cladelabel(node=100, label="Ctenophora", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=130, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=22, label="Placozoa") +
geom_cladelabel(node=93, label="Bilateria") +
geom_cladelabel(node=86, label="Cnidaria") +
geom_hilight(node=100, fill="#E69F00") +
geom_hilight(node=130, fill="#009E73") +
geom_hilight(node=22, fill="#82D1F4") +
geom_hilight(node=93, fill="#E6BEFF") +
geom_hilight(node=86, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 1), hjust=-.3, size=2) + ggtitle(" Poisson+nCAT90")
p3 = flip(p3, 100, 98)
tree4_rooted = root(tree4, outgroup$relabelled_name, resolved.root=TRUE)
p4=ggtree(tree4_rooted) +
geom_cladelabel(node=100, label="Ctenophora", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=130, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=22, label="Placozoa") +
geom_cladelabel(node=93, label="Bilateria") +
geom_cladelabel(node=86, label="Cnidaria") +
geom_hilight(node=100, fill="#E69F00") +
geom_hilight(node=130, fill="#009E73") +
geom_hilight(node=22, fill="#82D1F4") +
geom_hilight(node=93, fill="#E6BEFF") +
geom_hilight(node=86, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) <1), hjust=-.3, size=2) + ggtitle(" Poisson+CAT")
p4 = flip(p4, 100, 98)
tree5_rooted = root(tree5, outgroup$relabelled_name, resolved.root=TRUE)
p5=ggtree(tree5_rooted) +
geom_cladelabel(node=119, label="Ctenophora", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=100, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=22, label="Placozoa") +
geom_cladelabel(node=93, label="Bilateria") +
geom_cladelabel(node=86, label="Cnidaria") +
geom_hilight(node=119, fill="#E69F00") +
geom_hilight(node=100, fill="#009E73") +
geom_hilight(node=22, fill="#82D1F4") +
geom_hilight(node=93, fill="#E6BEFF") +
geom_hilight(node=86, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 1), hjust=-.3, size=2) + ggtitle(" GTR+CAT")
p=ggarrange(p1, p2, p3, p4, p5, ncol = 3, nrow = 2, labels = c("A", "B", "C", "D" , "E"), common.legend = TRUE)
ggsave("figures/Extended_Data_Fig1_raw.pdf", width=10, height=8)
```
![](figures/Extended_Data_Fig1_manuscript.png)
**Fig. S1.** Comparison of the tree topologies under different substitutional models in Whelan2017_strict data matrix. (A). WAG+C60. (B). Poisson+nCAT60. (C). Poisson+nCAT90. (D). Poisson+CAT. (E). GTR+CAT. All trees were visualized and annotated in R package ggtree [@yu2018two].
```{r figure_Philippe2009_trees,, fig.height = 8, fig.width = 10}
outgroup= taxa_map_philippe %>% filter (clade_assignment=="Choanoflagellida")
tree6_rooted = root(tree6, outgroup=outgroup$relabelled_name, resolved.root=TRUE)
p1 = ggtree(tree6_rooted) +
geom_cladelabel(node=85, label="Ctenophora") +
geom_cladelabel(node=76, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=33, label="Placozoa") +
geom_cladelabel(node=53, label="Bilateria") +
geom_cladelabel(node=51, label="Cnidaria") +
geom_hilight(node=85, fill="#E69F00") +
geom_hilight(node=76, fill="#009E73") +
geom_hilight(node=33, fill="#82D1F4") +
geom_hilight(node=53, fill="#E6BEFF") +
geom_hilight(node=51, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 95), hjust=-.3, size=2) + ggtitle(" WAG+C60")
tree7_rooted = root(tree7, outgroup$relabelled_name, resolved.root=TRUE)
p2= ggtree(tree7_rooted) +
geom_cladelabel(node=83, label="Ctenophora") +
geom_cladelabel(node=85, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=32, label="Placozoa") +
geom_cladelabel(node=58, label="Bilateria") +
geom_cladelabel(node=52, label="Cnidaria") +
geom_hilight(node=83, fill="#E69F00") +
geom_hilight(node=85, fill="#009E73") +
geom_hilight(node=32, fill="#82D1F4") +
geom_hilight(node=58, fill="#E6BEFF") +
geom_hilight(node=52, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 0.95), hjust=-.3, size=2) + ggtitle(" Poisson+nCAT60")
tree8_rooted = root(tree8, outgroup$relabelled_name, resolved.root=TRUE)
p3=ggtree(tree8_rooted) +
geom_cladelabel(node=83, label="Ctenophora") +
geom_cladelabel(node=85, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=32, label="Placozoa") +
geom_cladelabel(node=58, label="Bilateria") +
geom_cladelabel(node=52, label="Cnidaria") +
geom_hilight(node=83, fill="#E69F00") +
geom_hilight(node=85, fill="#009E73") +
geom_hilight(node=32, fill="#82D1F4") +
geom_hilight(node=58, fill="#E6BEFF") +
geom_hilight(node=52, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 0.95), hjust=-.3, size=2) + ggtitle(" Poisson+nCAT150")
tree9_rooted = root(tree9, outgroup$relabelled_name, resolved.root=TRUE)
p4=ggtree(tree9_rooted) +
geom_cladelabel(node=58, label="Ctenophora") +
geom_cladelabel(node=84, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=35, label="Placozoa") +
geom_cladelabel(node=60, label="Bilateria") +
geom_cladelabel(node=52, label="Cnidaria") +
geom_hilight(node=58, fill="#E69F00") +
geom_hilight(node=84, fill="#009E73") +
geom_hilight(node=35, fill="#82D1F4") +
geom_hilight(node=60, fill="#E6BEFF") +
geom_hilight(node=52, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 0.95), hjust=-.3, size=2) + ggtitle(" Poisson+CAT")
tree10_rooted = root(tree10, outgroup$relabelled_name, resolved.root=TRUE)
p5=ggtree(tree10_rooted) +
geom_cladelabel(node=58, label="Ctenophora") +
geom_cladelabel(node=85, label="Porifera", angle=90, hjust='center', offset.text=.01) +
geom_cladelabel(node=38, label="Placozoa") +
geom_cladelabel(node=61, label="Bilateria") +
geom_cladelabel(node=52, label="Cnidaria") +
geom_hilight(node=58, fill="#E69F00") +
geom_hilight(node=85, fill="#009E73") +
geom_hilight(node=38, fill="#82D1F4") +
geom_hilight(node=61, fill="#E6BEFF") +
geom_hilight(node=52, fill="#911EB4") +
geom_text2(aes(label=label, subset = !is.na(as.numeric(label)) & as.numeric(label) < 0.95), hjust=-.3, size=2) + ggtitle(" GTR+CAT")
p5 = flip(p5, 85, 38)
p= ggarrange(p1, p2, p3, p4, p5, ncol = 3, nrow = 2, labels = c("A", "B", "C", "D" , "E"), common.legend = TRUE)
ggsave("figures/Extended_Data_Fig2_raw.pdf", width=10, height=8)
```
![](figures/Extended_Data_Fig2_manuscript.png)
**Fig. S2.** Comparison of the tree topologies under different substitutional models in Philippe2009_Choanozoa data matrix. (A). WAG+C60. (B). Poisson+nCAT60. (C). Poisson+nCAT150. (D). Poisson+CAT. (E). GTR+CAT. All trees were visualized and annotated in R package ggtree [@yu2018two].
```{r figure_matrix_overlap, fig.height = 12, fig.width = 20}
matrix_overlap_rectangles =
lapply(
sequence_matrices[no_subsets],
function(x) lapply(sequence_matrices[no_subsets], function(y) overlap_rects(x, y))) %>%
unlist(recursive = FALSE) %>%
bind_rows()
matrix_overlap_rectangles %>%
ggplot( ) +
scale_x_continuous(name = "Genes") +
scale_y_continuous(name = "Species") +
geom_rect( mapping = aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = MSA ), alpha = 0.5) +
facet_grid( matrix_1 ~ matrix_2) +
theme( axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.y = element_text(angle = 0),
strip.text.x = element_text(angle = 60))
ggsave("figures/Extended_Data_Fig3_raw.pdf", width=20, height=12)
```
**Fig. S3.** Pairwise overlap between each of the primary matrices considered here (related to Table S8). Horizontal size is proportional to the number of genes sampled, vertical size to the number of taxa sampled. The horizontal intersection shows the proportions of shared genes, the vertical intersection shows the proportions of shared taxa.
```{r figure_gene_composition,, fig.height = 4, fig.width = 10}
ribo$matrix =
factor(
ribo$matrix,
levels = c("Dunn2008", "Philippe2009", "Hejnol2009", "Nosenko2013_nonribo_9187", "Nosenko2013_ribo_11057",
"Nosenko2013_ribo_14615", "Ryan2013_est", "Moroz2014_3d", "Chang2015", "Borowiec2015_Best108",
"Borowiec2015_Total1080", "Whelan2015_D1", "Whelan2015_D10", "Whelan2015_D20", "Simion2017",
"Whelan2017_full", "Whelan2017_strict") )
p1 = ribo %>%
ggplot( aes(x = matrix, y = number_of_genes, fill = gene_category) ) +
geom_bar(stat = "identity") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Matrix", y = "Number of genes") +
scale_fill_manual(values = c("grey", "black"), labels = c("Non-ribosomal genes", "Ribosomal genes"))
busco$matrix =
factor(
busco$matrix,
levels = c("Dunn2008", "Philippe2009", "Hejnol2009", "Nosenko2013_nonribo_9187", "Nosenko2013_ribo_11057",
"Nosenko2013_ribo_14615", "Ryan2013_est", "Moroz2014_3d", "Chang2015", "Borowiec2015_Best108",
"Borowiec2015_Total1080", "Whelan2015_D1", "Whelan2015_D10", "Whelan2015_D20", "Simion2017",
"Whelan2017_full", "Whelan2017_strict") )
busco$gene_category = factor(busco$gene_category, levels = c("non_busco", "busco"))
p2 = busco %>%
ggplot( aes(x = matrix, y = number_of_genes, fill = gene_category) ) +
geom_bar(stat = "identity") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Matrix", y = "Number of genes") +
scale_fill_manual( values = c("grey", "black"), labels = c("Non-BUSCO genes", "BUSCO genes") )
ggarrange(p1, p2, labels = c("A", "B"))
ggsave("figures/Extended_Data_Fig4_raw.pdf", width=10, height=4)
```
**Fig. S4.** Annotation and representation of BUSCO and ribosomal genes in each data matrix (related to Table S8). (A). The number of partitions with ribosomal annotations in each matrix, relative to the number of partitions. (C). The number of partitions with annotations of BUSCO genes in each matrix, relative to the number of partitions.
```{r cross_validation, fig.width=6, fig.height=3.5}
p_w = cross_validation %>%
filter(matrix=="Whelan2017_strict") %>%
ggplot(aes(x=model, y=score)) +
geom_boxplot() +
ggtitle("Whelan2017") +
theme_classic()
p_p = cross_validation %>%
filter(matrix=="Philippe2009_Chanimalia") %>%
ggplot(aes(x=model, y=score)) +
geom_boxplot() +
ggtitle("Philippe2009") +
theme_classic()
ggarrange(
p_w,
p_p,
ncol=2, labels = c("A", "B"))
ggsave("figures/Extended_Data_Fig5_raw.pdf", width=6, height=7)
```
**Fig. S5.** Box plots of log likelihood score for 10-fold cross-validation between Poisson+CAT and Poisson+nCAT60 models in Whelan2017_strict and Philippe2009_Choanozoa data matrices.
```{r n_cat_render, fig.width=6, fig.height=3.5}
edf2_cols = c("#74c2e7", "#410058")
names(edf2_cols) = c("Poisson+nCAT60", "Poisson+CAT")
# Generate plots
gg_rank_phil =
pb_summaries %>%
filter( matrix=="Philippe2009" ) %>%
ggplot() +
geom_line(aes(x=rank, y=count, col=model)) +
scale_colour_manual(values = edf2_cols) +
xlab( "category rank" ) +
ylab( "site count" ) +
theme(legend.position="right") + ggtitle("Philippe2009_Choanozoa")
gg_rank_whel =
pb_summaries %>%
filter( matrix=="Whelan2017_strict" ) %>%
ggplot() +
geom_line(aes(x=rank, y=count, col=model)) +
scale_colour_manual(values = edf2_cols) +
xlab( "category rank" ) +
ylab( "site count" ) +
theme(legend.position="right")+ ggtitle("Whelan2017_strict")
gg_mds_phil =
pb_summaries %>%
filter( matrix=="Philippe2009" ) %>%
ggplot() +
geom_point(aes(x=x_global, y=y_global, size=count, col=model), alpha=0.4 ) +
scale_colour_manual(values = edf2_cols) +
xlim( -0.5, 0.5) +
ylim( -0.6, 0.6) +
xlab( "MDS1" ) +
ylab( "MDS2" ) +
theme(legend.position="right")
gg_mds_whel =
pb_summaries %>%
filter( matrix=="Whelan2017_strict" ) %>%
ggplot() +
geom_point(aes(x=x_global, y=y_global, size=count, col=model), alpha=0.4 ) +
scale_colour_manual(values = edf2_cols) +
xlim( -0.5, 0.5) +
ylim( -0.6, 0.6) +
xlab( "MDS1" ) +
ylab( "MDS2" ) +
theme(legend.position="right")
ggarrange(
gg_rank_phil,
gg_rank_whel,
ncol = 2,
labels = c("A", "B"),
common.legend = TRUE
)
ggsave("figures/Extended_Data_Fig6_raw.pdf", width=6, height=3.5)
```
**Fig. S6** The allocation of frequency categories across sites in the Philippe2009_Choanozoa matrix (left column) and Whelan2017_strict matrix (right column) for the constrained Posson+nCAT60 model and unconstrained Posson+CAT models (differentiated with color). These plots are for singe sample isolated from the end of PhyloBayes chain files. (A,B) The count of sites allocated to each equilibrium frequency category, with the categories ranked from the most abundant to least abundant along the x axis. The unconstrained CAT analyses have a long tail of categories that are allocated to very few sites, which adds a considerable number of parameters that pertain to only a small fraction of the data. The nCAT60 analyses, which are constrained to 60 sites, have no such long tail and the rarest categories are allocated to a far greater number of sites than any of the sites in the long tail of the CAT analyses.
```{r figure_phylogenetic_signal, fig.width=6, fig.height=12}
p=au_tests %>%
unite(Matrix, c(manuscript, matrix, model)) %>%
group_by(Matrix, result) %>%
count() %>%
group_by(Matrix) %>%
mutate(percent = prop.table(n)) %>%
ggplot(aes(x = Matrix, y = percent, fill = result)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#E69F00", "#ebd196", "#a2e8d5", "#009E73")) +
coord_flip() +
theme(axis.text.y = element_blank(), axis.title.y = element_blank())
ggsave("figures/Extended_Data_Fig7_raw.pdf", width=6, height=8)
```
![](figures/Extended_Data_Fig7_manuscript.png)
**Fig. S7** The distribution of phylogenetic signal for Ctenophora-sister and Porifera-sister with different models and outgroup choice in Philippe2009, Ryan2013 and Whelan2017_full matrices (with two outgroup sampling: Choanozoa and full matrices). The two alternative topological hypotheses are: Ctenophora-sister; T1 (Orange); Porifera-sister T2 (Green). Proportions of genes supporting each of two alternative hypotheses in the Philippe2009, Ryan2013_est and Whelan2017_full data matrices with different outgroups sampling and substitutional models. The GLS values for each gene in each data matrix are provided in Table S5. We considered a gene with an absolute value of log-likelihood difference of two as a gene with strong (|ΔlnL| $\geq$ 2) or weak ( 0 < |ΔlnL| < 2) phylogenetic signal.
![](figures/Extended_Data_Fig8.png)
**Fig. S8.** Graphical representations of the posterior probabilities for the 32 analyses presented by Feuda *et al.* in their Table 3 and the subset of eight GTR+CAT analyses with posterior predictive (PP) scores that is the focus of Feuda *et al.*'s primary conclusions. (A). Cells are color coded by whether posterior probability is > 95 for Porifera-sister, > 95 for Ctenophora-sister, or neither (unresolved). Posterior predictive (PP) statistics were estimated for the 16 analyses in the top two rows of this figure (the Chang and full Whelan matrices), but not the bottom two (the Whelan matrices with reduced outgroup sampling). (B). These are a subset of the 32 analyses presented in their Table 3 and graphically here in the upper right pane. The eight analyses are for two datasets (Chang and Whelan) and four coding schemes. The coding schemes are the original 20 state amino acid data, and three different six state recodings that group amino acids based on different criteria: KGB6, D6, and SR6. Only one of these analyses, the SR6 coding of the Whelan matrix, has > 95 support for Porifera-sister. The 20-state and 6-state points on the plots in Fig. S8A correspond to the 20 and SR6 Whelan cells shown here. The presented statistics for these cells are posterior probability of Porifera-sister, Maxdiff (with lower scores indicating better convergence of runs), and five posterior predictive statistics (where lower absolute value indicates better model adequacy). The only one of these eight analyses that provides strong support for Porifera-sister is not the most adequate analysis by any of the posterior predictive scores, and showed the poorest convergence according to Maxdiff.
![](figures/Extended_Data_Fig9.png)
**Fig. S9.** Each of the six plots presents one statistic, which include Posterior probability of Porifera-sister and the five Posterior Predictive (PP) statistics considered by Feuda *et al.* Within each plot, there are six lines for six different analyses. These five analyses are the published SR-6 analyses presented by Feuda *et al.* (SR6-published), and four analyses based on randomized recoding matrices obtained by shuffling the SR-6 coding scheme (random-00 - random-03). Each analysis includes results for 20 states (the raw amino acid data, shown by the left point) and for 6 states (the 6-state recoded data, shown by the right point). For each statistic, the results obtained with the random recoding are similar to those of the SR6 recoding. This indicates that the impact of recoding is dominated by discarding data when collapsing from 20 states to 6 states, not accommodating compositional hetezrogeneity across lineages.
\pagebreak
## Supplementary text
### S1 Summaries of published analyses
Narrative summaries of the studies considered here.
#### Dunn *et al.* 2008
Dunn *et al.* [@Dunn:2008ky] added Expressed Sequence Tag (EST) data for 29 animals. It was the first phylogenomic analysis that included ctenophores, and therefore that could test the relationships of both Ctenophora and Porifera to the rest of animals. It was also the first phylogenetic analysis to recover Ctenophora as the sister group to all other animals.
The data matrix was constructed using a semi-automated approach. Genes were translated into proteins, promiscuous domains were masked, all gene sequences from all species were compared to each other with blastp, genes were clustered based on this similarity with TribeMCL [@Enright:2002uq], and these clusters were filtered to remove those with poor taxon sampling and high rates of lineage-specific duplications. Gene trees were then constructed, and in clades of sequences all from the same species all but one sequence were removed (these groups are often due to assembly errors). The remaining gene trees with more than one sequence for any taxon were then manually inspected. If strongly supported deep nodes indicative of paralogy were found, the entire gene was discarded. If the duplications for a small number of taxa were unresolved, all genes from those taxa were excluded. Genes were then realigned and sites were filtered with Gblocks [@Castresana:2000vy], resulting in a 77 taxon matrix. Some taxa in this matrix were quite unstable, which obscured other strongly-supported relationships. Unstable taxa were identified with leaf stability indices [@Thorley:1999kg], as implemented in phyutility [@Smith:2008gb], and removed from the matrix. This resulted in the 64-taxon matrix that is the focus of most of their analyses. Phylogenetic analyses were conducted under the Poisson+CAT model in PhyloBayes, and under the WAG model in MrBayes [@Ronquist:2003hx] and RAxML [@Stamatakis:2006wc].
Regarding the recovery of Ctenophora-sister, the authors concluded:
> The placement of ctenophores (comb jellies) as the sister group to all other sampled metazoans is strongly supported in all our analyses. This result, which has not been postulated before, should be viewed as provisional until more data are considered from placozoans and additional sponges.
Note that there was, in fact, an exception to strong support. An analysis of the 40 ribosomal proteins in the matrix recovered Ctenophora-sister with only 69% support. This study did not include the placozoan ingroup *Trichoplax*.
#### Philippe *et al.* 2009
Philippe *et al.* [@Philippe:2009hh] assembled an EST dataset for 55 species with 128 genes to explore the deepest animal phylogenetic relationships by adding 9 new species. The data matrix was assembled based on the phylogenetic analysis using Poisson+CAT model, which strongly supported Porifera-sister, and ctenophores were recovered as sister to cnidarians forming the "coelenterate" clade. Gene trees were then constructed, and potentially paralogs were removed by a bootstrap threshold of 70. Ambiguously aligned regions were trimmed and only genes sampled for at least two-thirds of species were retained. The phylogenetic analyses were conducted under the Poisson+CAT model in PhyloBayes.
Regarding the recovery of Ctenophora-sister, the authors concluded:
> The resulting phylogeny yields two significant conclusions reviving old views that have been challenged in the molecular era: (1) that the sponges (Porifera) are monophyletic and not paraphyletic as repeatedly proposed, thus undermining the idea that ancestral metazoans had a sponge-like body plan; (2) that the most likely position for the ctenophores is together with the cnidarians in a 'coelenterate' clade.
#### Hejnol *et al.* 2009
Hejnol *et al.* [@hejnol2009assessing] added EST sequences from seven taxa, and a total of 94 taxa were included in the final data matrix to explore animal phylogeny, especially the position of acoelomorph flatworms. The orthology inference was largely similar to Dunn *et al.* 2008, with the exception of orthology genes which were clustered by MCL. The final data matrix included 1497 genes, and then subsampled with 844, 330 and 53 gens by different thresholds of gene occupancy. With the exception of 53 gene matrix, maximum likelihood analyses from all other datasets strongly supported Ctenophora-sister (models were selected by RaxML perl script).
#### Pick *et al.* 2010
Pick *et al.* [@Pick:2010eb] sought to test whether Ctenophora-sister was an artefact of insufficient taxon sampling. They added new and additional published sequence data to the 64-taxon matrix of Dunn *et al.* [@Dunn:2008ky]. The new taxa included 12 sponges, 1 ctenophore, 5 cnidarians, and *Trichoplax*. They further modified the matrix by removing 2,150 sites that were poorly sampled or aligned. They considered two different sets of outgroups: Choanoflagellatea (resulting in Choanozoa) and the same sampling as Dunn *et al.* (resulting in Opisthokonta).
All their analyses were conducted under the F81+CAT+Gamma model in PhyloBayes, in both a Bayesian framework and with bootstrapping. All analyses have the same ingroup sampling and site removal so it isn't possible to independently assess the impact of these factors. Analyses with Choanozoa sampling recovered Porifera-sister with 72% posterior probability (PP) and 91% bootstrap support (BS). With broader Opisthokonta sampling, support for Porifera-sister is 84% PP. This is an interesting case where increased outgroup sampling leads to increased support for Porifera-sister.
The authors argue that previous results supporting Ctenophora-sister "are artifacts stemming from insufficient taxon sampling and long-branch attraction (LBA)" and that "this hypothesis should be rejected". Although the posterior probabilities supporting Porifera-sister are not strong, they conclude:
> Results of our analyses indicate that sponges are the sister group to the remaining Metazoa, and Placozoa are sister group to the Bilateria
They also investigated saturation, and concluded that the Dunn *et al.* [@Dunn:2008ky] data matrix is more saturated than Philippe *et al.* 2009 [Philippe:2009hh]. Note that the Pick *et al.* [@Pick:2010eb] dataset is not reanalyzed here because partition data are not available, and due to site filtering the partition file from Dunn *et al.* [@Dunn:2008ky] cannot be applied to this matrix.
#### Nosenko *et al.* 2013
Nosenko *et al.* [@nosenko2013deep] added Expressed Sequence Tag (EST) data for 9 species of non-bilaterian metazoans (7 sponges). They constructed a novel matrix containing 122 genes and parsed them into two non-overlapping matrices (ribosomal and non-ribosomal genes) and found incongruent results of deep metazoan phylogeny. The other major finding was that ribosomal gene partitions showed significantly lower saturation than the non-ribosomal ones.
Orthologs were constructed using the bioinformatics pipeline OrthoSelect [@schreiber2009orthoselect]. They also evaluated level of saturation, leaf stability of sampled taxa, compositional heterogeneity and model comparison of each matrix. By modifying gene sampling, ingroup and outgroup sampling, three major topologies related to the position of animal-root were constructed (including Porifera+Placozoa sister, Ctenophora-sister and Porifera-sister). Phylogenetic analyses were conducted under the Poisson+CAT, GTR+CAT and GTR models in PhyloBayes.
Regarding the recovery of Ctenophora-sister, the authors concluded:
> we were able to reconstruct a metazoan phylogeny that is consistent with traditional, morphology-based views on the phylogeny of non-bilaterian metazoans, including monophyletic Porifera and ctenophores as a sister-group of cnidarians.
#### Ryan *et al.* 2013
Ryan *et al.* sequenced the first ctenophore genome of *Mnemiopsis leidyi*. With the genome resources of *M. leidyi*, the authors constructed two phylogenomic datasets: a "Genome set" based on 13 animal genomes and a "EST Set" that also included 59 animals. They analyzed both matrices by site-homogeneous GTR+Gamma and site-heterogeneous Poisson+CAT models with three sets of outgroup sampling to evaluate the effect of outgroup selection to the ingroup topology for the Ryan2013_est matrix. The Orthologs were constructed based on the method of Hejnol *et al.* 2009. For the Ryan2013_genome matrix, they performed phylogenetic analyses with both gene content and sequence-based analyses. Overall, their results strongly supported Ctenophora-sister in all datasets they analyzed using a site-homogeneous model. The Poisson+CAT model of the genome dataset strongly supported of a clade of Ctenophora and Porifera as the sister group to all other Metazoa and Bayesian analysis on the EST dataset did not converge after 205 days (but strongly supported Porifera in Choaimalia matrix).
Regarding the recovery of Ctenophora-sister, the authors concluded:
> Our phylogenetic analyses suggest that ctenophores are the sister group to the rest of the extant animals.
#### Moroz *et al.* 2014
Moroz *et al.* [@moroz2014ctenophore] sequenced the second ctenophore genome *Pleurobrachia bachei* to explore the phylogenetic relationship of Metazoa. All phylogenetic analyses strongly supported Ctenophora-sister with different taxon and gene sampling using WAG site-homogeneous model. Two phylogenomic matrices were generated, the first set was represented by two ctenophore species, whereas the other set contained improved ctenophore sampling (10 taxa, Moroz2013_3d). Orthology determination employed in HaMStR [@ebersberger2009hamstr] using 1,032 "model organism" single-copy orthologs. Sequence were then trimmed and aligned. This resulted in a final matrix of 170,871 amino acid positions across 586 genes with 44 taxa for the first matrix, and 114 genes with 60 taxa for the second matrix. All the phylogenetic analyses were analyzed in RAxML under the WAG+CAT+F models (different from CAT models in PhyloBayes) to reduce the computational cost.
Regarding the recovery of Ctenophora-sister, the authors concluded:
>Our integrative analyses place Ctenophora as the earliest lineage within Metazoa. This hypothesis is supported by comparative analysis of multiple gene families, including the apparent absence of HOX genes, canonical microRNA machinery, and reduced immune complement in ctenophores.
It should be noted that only the Moroz_3d matrix has been reanalyzed in other studies, although the support of Ctenophora-sister is quite low.
#### Borowiec *et al.* 2015
Borowiec *et al.* [@borowiec2015extracting] assembled a genome dataset comprising 1080 orthologs derived from 36 publicly available genomes representing major lineages of animals, although only one genome of sponge and ctenophore was included. The orthologs were constructed using OrthologID pipeline [@chiu2006orthologid]. After removal of spurious sequences and genes with more than 40% of mission data, the final matrix included 1080 (Total 1080) genes. The authors further filtered the full dataset to 9 sub-datasets by filtering genes with high long-branch scores; genes with high saturation; gene occupancy; fast evolving genes. The main conclusion of the study was largely based on BorowiecTotal_1080 and Borowiec_Best108 matrices. Phylogenetic analyses were conducted under the GTR+CAT model in PhyloBayes in selected matrices, and under the data-partitioning methods in RAxML for all matrices.
Regarding the recovery of Ctenophora-sister, the authors concluded:
>Our phylogeny
supports the still-controversial position of ctenophores as sister group to all other metazoans. This study also
provides a workflow and computational tools for minimizing systematic bias in genome-based phylogenetic
analyses.
It should be noted that the authors also employed recoding-method in the Borowiec_Best108 matrix and found neither support of Porifera-sister or Ctenophora-sister [@borowiec2015extracting].
#### Whelan et al. 2015
Whelan *et al.* 2015 [@whelan2015error] constructed a new phylogenomic data matrix with eight new transcriptomic data and investigated a range of possible sources of systematic error under multiple analyses (*e.g.* long-branch attraction, compositional bias, fast evolving genes, etc.). Putative orthologs were determined of each species using HaMStR using the model organism core ortholog set (same as Moroz *et al.* 2014) and subsequently removal of genes with too much missing data and potential paralogs. The authors further filtered the full dataset to 24 sub-datasets by filtering genes with high long-branch scores; genes with high RSFV values; genes that are potential paralogs; fast evolving genes and progressive removal of outgroups. All the maximum likelihood analyses with site-homogeneous models and PartitionFinder strongly suggested Ctenophora-sister. GTR+CAT models only used in slow-evolving data matrices 6 and 16 also strongly supported Ctenophora.
Regarding the recovery of Ctenophora-sister, the authors concluded:
> Importantly, biases resulting from elevated compositional heterogeneity or elevated substitution rates are ruled out. Placement of ctenophores as
sister to all other animals, and sponge monophyly, are strongly supported under multiple analyses, herein.
Note that the authors also reanalyzed Philippe2009 matrix (with the removal of ribosomal genes) and recovered Porifera-sister with moderate support (pp=90).
#### Chang et al. 2015