-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathpangene.1
352 lines (348 loc) · 7.37 KB
/
pangene.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
.TH pangene 1 "22 February 2024" "pangene-1.1 (r231)" "Bioinformatics tools"
.SH NAME
.PP
pangene - building pangenome gene graphs
.SH SYNOPSIS
.TP 2
*
Align one set of proteins to multiple genomes with miniprot:
.RS 4
.B miniprot
--outs=0.97 --no-cs -Iut16
.I genome1.fna proteins.faa
>
.I genome1.paf
.br
.B miniprot
--outs=0.97 --no-cs -Iut16
.I genome2.fna proteins.faa
>
.I genome2.paf
.RE
.TP
*
Construct a pangenome gene graph:
.RS 4
.B pangene
.RB [ -e
.IR minIdentity ]
.RB [ -d
.IR delimGene ]
.RB [ -a
.IR minArcCount ]
.I genome1.paf genome2.paf
>
.I graph.gfa
.RE
.TP
*
Extract a subgraph around a gene:
.RS 4
.B gfatools
view
.B -wl
.IR geneName1 , geneName2
.RB [ -r
.IR radius ]
.I graph.gfa
>
.I subgraph.gfa
.RE
.SH DESCRIPTION
.PP
To construct a pangene graph, we need to prepare one set of proteins and
multiple genomes. If a gene has multiple isoforms, it is recommended to name
each isoform as
.IR geneName : proteinID .
Pangene will automatically select one protein as the primary isoform and use
.I geneName
in the GFA output. Pangene also identifies and filters out redundant proteins.
The quality of input proteins will affect the quality of the output graph.
.PP
To prepare the input for pangene, align the same set of proteins to each genome
individually and generate one alignment file per genome. Pangene then takes
these alignment files and construct a gene graph. At present, pangene only
works with
.BR miniprot 's
PAF output. Other formats might be supported later.
.SH OPTIONS
.SS Input preprocessing
.TP 10
.BI -d \ CHAR
Gene-protein delimeter [:]. This option tells pangene how to identify the gene
name in a protein identifier. If a gene has multiple isoforms, pangene will
later choose one primary isoform per gene. If pangene cannot find the delimiter,
it will use the protein name as the gene name.
.TP
.BI -X \ STR
Exclude genes in the
.I STR
list [].
.I STR
can be a comma separated string, or a file if it starts with symbol `@'.
.TP
.BI -I \ STR
Include genes in the
.I STR
list in the output graph [].
.TP
.BI -P \ STR
Prioritize genes in the
.I STR
list in the output graph []. When pangene selects genes as vertices in the
graph, these genes will get higher priority.
.TP
.BI -e \ FLOAT
Filter out an alignment if its identity below
.I FLOAT
[0.5]
.TP
.BI -l \ FLOAT
Filter out an alignment if less than
.I FLOAT
fraction of the protein is aligned [0.5]
.TP
.BI -m \ FLOAT
Score adjustment coefficient [2]. The adjusted score is equal to
.IR S *exp( FLOAT *( d + c ))
where
.I S
is the original alignment score, where
.IR d =1- identity
and
.IR c =1- coveredFraction .
.SS Graph construction
.TP 10
.BI -f \ FLOAT
Two alignments are considered to overlap with each other if their intersection
on the genome is at least
.I FLOAT
fraction of the shorter alignment [0.5]
.TP
.BI -J
Do not attempt to identify pseudogenes jointly across samples.
.TP
.BI -E
ignore genes that are single-exon in all genomes
.TP
.BI -p \ FLOAT
A gene is considered a segment in the graph if it is
.I dominant
in at least
.I FLOAT
fraction of the genomes [0.05]. A gene is
.I dominant
if its best alignment is better than other overlapping genes.
.TP
.BI -c \ INT
Drop a gene if its average occurrence among input genomes is above
.I INT
[10]
.TP
.BI -g \ INT
Drop a gene if its in- or out-degree in the initial graph is higher than
.I INT
[15]
.TP
.BI -r \ INT
Drop a gene if it connects more than
.I INT
loci that are distant from each other [3]
.TP
.BI -b \ FLOAT
Demote a branching arc if it is weaker than the best branching arc by
.I FLOAT
fraction [0.02]. If there exist arcs
.IR A -> B
and
.IR A -> C ,
both arcs are
.IR branching\ arcs .
Pangene will demote
.IR A -> B
if the alignment score of
.I A
is weaker by the alignment score of
.I A
on the
.IR A -> C
arc by
.IR FLOAT .
.TP
.BI -B \ FLOAT
Cut a branching arc if it is weaker than the best branching arc by
.I FLOAT
fraction [0.5]. This option is similar to
.B -b
but it cuts arcs.
.TP
.BI -y \ FLOAT
Cut a distant branching arc if it is weaker by the best branching arc by
.I FLOAT
fraction [0.05]. If there exist arcs
.IR A -> B
and
.IR A -> C
and
.I B
and
.I C
are not local, this option may cut the weaker branch.
.TP
.BI -T \ INT
Apply branch cutting for
.I INT
rounds [15]
.TP
.B -F
Don't consider genes on different contigs to be distant from each other. Useful
for fragmented short-read assemblies.
.TP
.BI -D \ NUM
Two genes are considered local if they are at most
.I NUM
bp apart [2M]. This and option
.B -C
work together with
.BR -y .
Non-local branching edges will be cut.
.TP
.BI -C \ INT
Two genes are considered local if there are no more than
.I INT
unfiltered genes [20]
.TP
.BI -a \ INT
Prune an arc if it is supported by less than
.I INT
number of genomes [1]
.TP
.BI --ori-sc
Use original alignment score during branching arc pruning. By default, pangene
recalculates the score using overlapping proteins that are not selected as
vertices. The default behavior helps cross-species pangene graph construction.
.SS Output options
.TP 10
.B -w
Suppress walk lines (W-lines)
.TP
.BI --bed[= STR ]
Output BED12 format where
.I STR
can be
.BR walk ,
.B raw
or
.B flag
[walk].
This option is mostly debugging:
.B raw
for alignments before graph construction;
.B flag
for alignments after graph construction with filter tags;
.B walk
for alignments used on walk lines only.
.TP
.B --version
Output version number
.SH OUTPUT FORMAT
.SS The GFA Format
Pangene outputs graphs in the GFA format consisting of three line types with
optional tags at the end. The follow table describes the dialect the pangene
uses. Note that this is not a description of generic GFA.
.TS
center box;
cb | cb | cb
c | l | l .
Line Col Description
_
S 1 Gene name
2 `*'
_
L 1 Gene 1
2 Orientation 1
3 Gene 2
4 Orientation 2
5 CIGAR [0M]
_
W 1 Index of input genome
2 `0'
3 Contig name
4 `*'
5 `*'
6 Walk: ([><]gene)+
.TE
.PP
Pangene may output the following GFA tags:
.TS
center box;
cb | cb | cb | cb
c | c | c | l .
Line Tag Type Description
_
S LN i Length of the primary protein
ng i # genomes haboring the gene
nc i Sum of occurrences
c1 i # genomes where the gene is dominant
c2 i # genomes where the gene is not dominant
pp Z protein sequence name of the primary isoform
_
L ng i # genomes having the arc
nc i Sum of occurrences
ad i Average distance on genomes
s1 i Average alignment score of the 1st gene
s2 i Average alignment score of the 2nd gene
.TE
.SS The BED Format
With option
.BR --bed ,
pangene may output the alignments in the 12-column BED format mostly for
debugging purposes.
.TS
center box;
cb | cb | cb
r | l | l .
Col Type Description
_
1 String Contig name
2 Integer Contig start
3 Integer Contig end
4 String Protein sequence name
5 Integer Alignment score (ms in miniprot)
6 Char `+' or `-'
7 Integer Same as col 2
8 Integer Same as col 3
9 String `0'
10 Integer # exons
11 String Exon lengths
12 String Exon starts relative to col 2
.TE
.PP
Pangene may output the following tags following the first 12 columns:
.TS
center box;
cb | cb | cb
c | c | l .
Tag Type Description
_
rk i Rank in the input PAF
re i Representative isoform or not
sd i Shadowed or not
vt i Selected as a GFA segment or not
ps i Pseudogene or not
br i Filtered by branching or not
cm i Position of the middle amino acid
id f Protein identity
.TE
.SH LIMITATIONS
.TP 2
*
Pangene only works with
.BR miniprot 's
PAF output.
.TP
*
In the output graph, arcs on W-lines may be absent from L-lines.
.SH SEE ALSO
.BR miniprot (1),
.BR gfatools (1)