-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathstructural_paraphrases_preacl.tex
655 lines (559 loc) · 26.9 KB
/
structural_paraphrases_preacl.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
\documentclass[11pt]{article}
\usepackage{acl2010}
\usepackage{times}
\usepackage{url}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{url}
\usepackage{multirow}
\usepackage{rotating}
%\setlength\titlebox{6.5cm}
% You can expand the title box if you really have to
\DeclareMathOperator*{\argmax}{arg\,max}
\newcommand{\mnote}[1]{\marginpar{%
\vskip-\baselineskip
\raggedright\footnotesize
\itshape\hrule\smallskip\footnotesize{#1}\par\smallskip\hrule}}
\title{Learning Structural and Sentential Paraphrases from Parallel Corpora}
\author{Juri Ganitkevitch \and Chris Callison-Burch\\
Center for Language and Speech Processing\\
Johns Hopkins University}
\date{}
\begin{document}
\maketitle
\begin{abstract}
\begin{itemize}
\item ``in this work'' stuff is implied
\item newcites rather than cites
\item put a little brag claim at the end, summarizing the outcome
\item abstract should tell you whether paper is interesting, whether
your results are interesting
\item motivate the problem
\item briefly motivate choices
\item be clear, sell the paper to the reviewers
\item be direct, motivate and explain instead of saying ``we
motivate/explain''
\item briefly tie back to previous work
\item concrete numbers
\end{itemize}
Paraphrase extraction approaches based on richly annotated
monolingual parallel corpora have been
\end{abstract}
\section{Introduction} \label{introduction}
Paraphrases are alternative ways to express the same meaning. The
automated generation and detection of paraphrases is crucial to many
tasks in NLP and has been shown to improve performance in others. In
multi-document summarization systems, paraphrase detection can be used
to recognize and collapse redundancies \cite{Barzilay1999}. Used for
query expansion, paraphrasing improves the performance of question
answering \cite{Ravichandran2002,Riezler2007} and information
retrieval systems \cite{Anick1999}. In statistical machine translation (SMT),
paraphrases have been used to extend phrase tables in order to achieve
broader source language coverage \cite{Callison-Burch2006b}, as well
as to automatically generate additional reference translations,
leading to better results in minimum error rate training
\cite{Madnani2007}.
The notion of a paraphrase typically denotes a set of surface text
forms with the same meaning:
\begin{center}
\begin{tabular}{c}
the committee's proposal \\
the proposal of the committee
\end{tabular}
\end{center}
Introducing non-terminals or \emph{slots}, lexical paraphrases are
commonly generalized into paraphrase patterns such as:
\begin{center}
\begin{tabular}{c}
the $X_1$'s $X_2$ \\
the $X_2$ of the $X_1$
\end{tabular}
\end{center}
It is evident that, compared to plain phrases, paraphrase patterns
have a much higher potential at generalization. However, the
unconstrained nature of the slots also bears the risk of producing
nonsensical or ungrammatical output. Thus, in order to achieve more
well-formed results, the slots are typically augmented with syntactic
constraints, yielding syntactic paraphrases.
\begin{center}
\begin{tabular}{c}
the $\mathit{NP}_1$'s $\mathit{NP}_2$ \\
the $\mathit{NP}_2$ of the $\mathit{NP}_1$
\end{tabular}
\end{center}
While retaining the generalization capabilities of unconstrained
patterns, the application of syntactic paraphrases is more controlled
and yields higher quality output.
\mnote{Sentential paraphrases, and all the transforms we want to
learn. Unclear whether we can get them from bitexts.. ..let's find out.}
In this paper, we demonstrate the integration of syntactic paraphrases
into an SMT-based paraphrasing system. The use of SMT machinery and
models to extract and apply paraphrases from bilingual parallel
corpora by pivoting over foreign language phrases is a well-known
technique. \newcite{Zhao2008b} have presented results that indicate
that bilingually sourced paraphrase extraction significantly
outperforms monolingual approaches in both recall and
precision. However, while in monolingually sourced paraphrase
extraction linguistically informed paraphrase patterns are
commonplace, SMT-based paraphrase systems mostly have been limited to
lexical paraphrases \cite{Quirk2004,Callison-Burch2005} or
unconstrained patterns \cite{Madnani2007}. In recent efforts to enrich
pivot-based paraphrase extraction with linguistic information
\newcite{Callison-Burch2008} showed that enforcing syntactic
constraints can drastically improve the quality of a lexical
paraphrase system. \newcite{Zhao2008} used dependency parses on the
English side of a large bilingual corpus to produce syntactic
parapharse patterns. Our approach goes beyond these two and produces a
fully syntactically informed synchronous paraphrasing CFG.
An additional focus of this paper are the closely tied topics of model
parameter estimation and system evaluation. In practice, the nature of
paraphrasing as a component of larger systems makes meaningful
standalone evaluation difficult: in many applications of paraphrasing
the proper preservation of meaning and production of grammatical
output is only a basis requirement. Additional constraints imposed by
the task at hand, be it shortening the input for sentence compression
or generating significantly differing output for the generation of
additional references, are a crucially important factor for the
system's quality. However, this is often not reflected when estimating
the parameters of a paraphrasing system. We leverage the rich feature
set present in our approach and the flexibily of minimum error rate
training to specifically adapt our paraphrase system to the task of
reference expansion for SMT and demonstrate how it outperforms the
commonly used approach.
This paper is structured as follows: we present related work on
paraphrase acquisition in
Section~\ref{related_work}. Section~\ref{formalism} introduces the
machine translation model used in our work. In
Section~\ref{acquisition} we describe our paraphrase acquisition
approach. Sections~\ref{adaptation} and \ref{analysis} present our
parametere estimation setup, experimental and an analysis
thereof. Finally, we conclude in Section~\ref{conclusion}.
\section{Related Work} \label{related_work}
The paraphrase extraction task has received a fair amount of attention
in the past. In this paper we extend a method for extracting
paraphrases from bilingual parallel corpora that was introduced by
\newcite{Callison-Burch2005}. \newcite{Callison-Burch2005} generate
paraphrases using techniques from {\it phrase-based} statistical
machine translation \cite{Koehn2003}. After extracting a bilingual
phrase table, English paraphrases can be extracted by pivoting through
foreign language phrases. The phrase table contains phrase pairs $(e,
f)$ (where the $e$ and $f$ stand for English and foreign phrases,
respectively) as well as bi-directional translation probabilities
$p(f | e)$ and $p(e | f)$, computed from the relative frequencies of
$e$ and $f$. From this table, they compute a paraphrasing phrase table
with entries $(e, e')$. Since many paraphrases can be extracted for a
phrase, \newcite{Callison-Burch2005} define a paraphrase probability
to rank them:
% \begin{equation}
% \hat{e_2} = \argmax_{e_2:e_2 \neq e_1} p(e_2|e_1)
% \end{equation}
% where
\begin{eqnarray}
p(e_2|e_1) &=& \sum_f p(e_2,f|e_1)\\
&=& \sum_f p(e_2|f,e_1) p(f|e_1) \\
&\approx& \sum_f p(e_2|f) p(f|e_1)
\label{paraphrase_prob_eqn}
\end{eqnarray}
Additionally, the resulting paraphrases are further re-ranked using
contextual features such as a language model.
\newcite{Callison-Burch2008} improves on this method by requiring the
paraphrases to be governed by the same constituent, resulting in less
noisy and more grammatical paraphrases.
\newcite{Madnani2007} apply the pivot technique to {\it hierarchical}
phrase-based machine translation \cite{Chiang2005}. Hierarchical
phrases contain a variable ``X'' that allows for slot-fillers.
\newcite{Madnani2007}'s paraphrase table contains these slotted
patterns as well. Because hierarchical phrase-based machine
translation is formally a synchronous context free grammar (SCFG),
\newcite{Madnani2007}'s paraphrase table can be thought of as a {\it
paraphrase grammar}. Their paraphrase grammar can paraphrase (or
``decode'') input sentences using an SCFG decoder, like the Hiero MT
system. \newcite{Madnani2007} mirror Hiero's log-linear model and its
feature set. The parameters for the log-linear model are estimated
using minimum error rate training, maximizing the BLEU metric on a set
of parallel English sentences. The authors report significant gains in
translation quality when using additional references generated by
paraphrasing to tune a machine translation system.
\newcite{Zhao2008} further enrich pivot-based paraphrase approach with
syntactic information by extracting partial subtrees from a
dependency-parsed English side of a bitext and pivoting over the
corresponding Chinese phrases to extract paraphrases. The slots in the
resulting patterns are labeled with part-of-speech tags (but not
larger syntactic constituents). Their system also employs a log-linear
model that combines translation and lexical probabilities and is tuned
to maximize precision over a hand-labeled set of paraphrases.
Several research efforts have leveraged parallel monolingual corpora,
however they jointly suffer from the scarcity and noisiness of
parallel corpora. \mnote{Ah, yes. That's some excellent English right
there.} \newcite{Dolan2004} work around this issue by extracting
parallel sentences from the vast amount of freely available comparable
English text and apply machine translation techniques to create a
paraphrasing system \cite{Quirk2004}. However, the word-based
translation model and monotone decoder they use results in a
substantial amount of identity paraphrases or single-word
substitutions.
Relying on small data sets of semantically equivalent translations,
\newcite{Pang2003} created finite state automata by syntax-aligning
parallel sentences, enabling the generation of additional reference
translations.
Both \newcite{Barzilay2001} and \newcite{Ibrahim2003} sentence-align
existing noisy parallel monolingual corpora such as translations of
the same novels. While \newcite{Ibrahim2003} employ a set of
heuristics that rely on anchor words identified by textual identity or
matchin liguistic features such as gender, number or semantic class,
\newcite{Barzilay2001} use a co-training approach that leverages
context similarity to identify viable paraphrases.
Semantic parallelism is well-established as a stong basis for the
extraction of correspondencies such as paraphrases. However, there are
notable efforts that choose to forgo it in favor of clustering
approaches based on distributional characteristics. The well-known
DIRT method by \newcite{Lin2001} fully relies on distributional
similarity features for paraphrase extraction. Patterns extracted from
paths in dependency graphs are clustered based on the similarity of
the observed contents of their slots.
Similarly, \newcite{Bhagat2008} argue that vast amounts of text can be
leveraged to make up for the relative weakness of distributional
features compared to parallelism. They also forgo complex annotations
such as syntactic or dependency parses, relying only on part-of-speech
tags to inform their approach. In their work, relations are learned by
finding pattern clusters initially seeded by already known
patterns. However, this method is not capable of producing syntactic
paraphrases. \mnote{Need better tie-in with the overall theme of structural
paraphrases.}
\section{SCFGs in Translation} \label{formalism}
Similar to \newcite{Madnani2007}, our paraphrasing system heavily
relies on established statistical machine translation
machinery. \mnote{Spin this towards the section title.} The model we
use in our approach is \emph{Syntax Augmented Machine Translation} or
\emph{SAMT} \cite{Zollmann2006}, which is based on the \emph{
synchronous context-free grammar} (SCFG) formalism and allows for an
arbitrary number of non-terminals.
Formally, a PSCFG $\mathcal{G}$ is defined by specifying
\[
\mathcal{G} = \langle \mathcal{N}, \mathcal{T}_S, \mathcal{T}_T,
\mathcal{R}, S \rangle ,
\]
where $\mathcal{N}$ is a set of non-terminal symbols, $\mathcal{T}_S$
and $\mathcal{T}_T$ are the source and target language vocabularies,
$\mathcal{R}$ is a set of rules and $S \in \mathcal{N}$ is the root
symbol. The rules in $\mathcal{R}$ take the form
\begin{equation*}
C \rightarrow \langle \gamma, \alpha, \sim, w \rangle ,
\end{equation*}
where the rule's left-hand side $C \in \mathcal{N}$ is a non-terminal,
$\gamma \in (\mathcal{N} \cup \mathcal{T}_S)^*$ and $\alpha \in
(\mathcal{N} \cup \mathcal{T}_T)^*$ are strings of terminal and
non-terminal symbols with an equal number of non-terminals
$c_{\mathit{NT}}(\gamma) = c_{\mathit{NT}}(\alpha)$ and
$$
\sim : \{1 \ldots c_{\mathit{NT}}(\gamma)\} \rightarrow \{1 \ldots
c_{\mathit{NT}}(\alpha)\}
$$
constitutes a one-to-one correspondency function between the
non-terminals in $\gamma$ and $\alpha$. A non-negative weight $w \geq
0$ is assigned to each rule, reflecting the likelihood of the rule.
\begin{figure*}[t]
\begin{center}
\includegraphics[width=0.99\linewidth]{figures/scfg_grid_revamp.pdf}
\end{center}
\caption{TODO: adjust colors, check example.}
\label{samt_extraction}
\end{figure*}
The rule set for an SAMT grammar is extracted from word-aligned
sentence-parallel corpora where the target language side is annotated
with syntactic parses. Figure~\ref{samt_extraction} shows an example
of rules obtained for one phrase pair. To extract a rule, we first
choose a target span $t$. The left-hand side of the rule is then given
by the syntactic constituent governing $t$. In addition to single
constituent non-terminals, SAMT allows for the concatenation of
constituents as well as for CCG-style non-terminal labels
\cite{Steedman1999}, which increases the grammar's coverage by
allowing for non-syntactic phases in the grammar. The source side of
the rule, $s$, is obtained by projecting $t$ over the word
alignment. To introduce non-terminals into the rule's source and
target sides, we can apply rules extracted over sub-phrases of $t$,
synchronously substituting the corresponding non-terminal symbol for
the sub-phrases on both sides. \mnote{Tie this back to $\alpha$ and
$\gamma$?} The synchronous substitution applied to $t$ and $s$
yields the correspondency $\sim$.
Rather than assigning a weight $w$, SAMT stores a number of feature
values $\varphi_i$ for each rule, which are combined in a log-linear
model:
\begin{equation}
w = \sum_i^N \lambda_i \varphi_i .
\end{equation}
For ease of notation, we will refer to $\varphi_1, \ldots ,\varphi_N$
as $\vec{\varphi}$. The feature set of the SAMT model by default
consists of 23 feature functions and covers a diverse set of rule
characteristics. We discuss our task-specific extensions to the SAMT
feature set in Section~\ref{adaptation}.
\section{SCFGs in Paraphrasing} \label{acquisition}
\begin{figure*}[!t]
\begin{center}
\includegraphics[width=0.99\linewidth]{figures/pivot.pdf}
\end{center}
\caption{Let's see if we need this}
\end{figure*}
To create a paraphrase grammar from a translation grammar, we extend
the pivot approach of \newcite{Callison-Burch2005} to the SAMT
model. For this purpose, we assume a grammar that translates from a
given foreign language to English. For each pair of translation rules where
the left-hand side $C$ and foreign string $\gamma$ match
\begin{eqnarray*}
C \rightarrow \langle \gamma, \alpha_1, \sim_1, \vec{\varphi}_1 \rangle \\
C \rightarrow \langle \gamma, \alpha_2, \sim_2, \vec{\varphi}_2 \rangle
\end{eqnarray*}
we create a paraphrase rule
\begin{equation*}
C \rightarrow \langle \alpha_1, \alpha_2, \sim, \vec{\varphi} \rangle ,
\end{equation*}
where the non-terminal correspondency relation $\sim$ has been set to
reflect the combined non-terminal alignment
\begin{equation*}
\sim ~ = ~ \sim_1^{-1} \circ \sim_2 .
\end{equation*}
\subsection{Feature Computation}
The computation of $\vec{\varphi}$ from $\vec{\varphi}_1$ and
$\vec{\varphi}_2$ occurs on feature level. For this purpose, the SAMT
feature set can be roughly partitioned into three groups.
The first group consists of features that depend solely on the rule
itself. This includes features based on number or nature of terminals
or non-terminals in $\alpha_1$ or $\alpha_2$, features that indicate
non-terminal reordering or deletion of punctuation and pseudo-features
like a glue rule indicator or the application count feature. Since
none of these features require additional information, their
computation for $\vec{\varphi}$ is straightforward.
The features in the second group additionally depend on aggregate
counts over the bitext and thus cannot be computed from the rule
alone. However, the information contained in $\vec{\varphi}_1$ and
$\vec{\varphi}_2$ allows us to apply the pivot approach as shown in
Equation~\ref{paraphrase_prob_eqn} to approxate the real feature value
for $\vec{\varphi}$. In the SAMT feature set, this group only includes
the four lexical translation probability features.
The remainder of the feature set falls into the third group. These
features depend on information that we cannot reconstruct from
$\vec{\varphi}_1$ and $\vec{\varphi}_2$. Features in this group
include the rule's probability conditioned on the left-hand-side, its
source side with and without non-terminal labels as well as rareness
and unbalancedness penalties. For these we fall back to a black-box
approach, treating the value for a given features as an indication of
the rule's quality in some particular aspect. To combine the feature
values, we multiply for the relative frequencies and add for the
penalty functions.
In addition to combining $\vec{\varphi}_1$ and $\vec{\varphi}_2$, we
extend $\vec{\varphi}$ by a paraphrasing-specific boolean feature
$\varphi_{\mathit{ident}}$, that is $1$ when $\alpha_1 = \alpha_2$ and
$0$ otherwise. This allows us to tune our system to prefer or
dis-prefer identity paraphrases.
\subsection{Grammar Pruning}
\label{pruning}
Due to the diverse set of non-terminals allowed in our model, the
grammars we extract tend to become extremely large. This, combined
with the multiplicative effect of pivoting, quickly makes the
resulting paraphrase grammars grow too large too handle.
To keep the grammars at a manageable size while retaining good
paraphrases, we implement a the following pruning approach. We only
consider translation rules that have been seen more than a given
number of times and exceed a given translation probability threshold
for pivot recombination. In addition to this, we rank the generated
paraphrases for each pattern according to a combination of translation
and lexical probablity and only retain the top $n$
alternatives. \mnote{This is a little wonky, maybe drop the top-$n$
pruning?}
The result of this process is a paraphrase grammar with syntactic
annotations and a feature set which (approximately) mirrors that of
the initial translation grammar. We now move to estimating the
feature weights. \mnote{This is awful. Needs better intro into
estimation part.}
\section{Analysis} \label{analysis}
When injecting syntactic information into our paraphrase extraction
process, a key aim is the acquisition of structural paraphrases. We
expect our training procedure to be able to extract generalized,
properly labeled patterns that will reflect well-known paraphrastic
transformations such as passivization, dative shift or the possessive
rule. A paraphrase grammar that, in addition to high lexical coverage,
can represent such transforms in a generic fashion can be used as a
powerful base for a diverse set of text-to-text generation tasks.
In examining a paraphrase grammar obtained from a French-English
bitext, we find that this is indeed the
case. Table~\ref{example_rules} shows a selection of paraphrase rules
found in our grammar, labeled with the paraphrastic transformation
they effect.
\begin{table*}
\begin{center}
\begin{tabular}{|c|rrcl|}
\hline
Possessive rule & $\mathit{NP}$ $\rightarrow$ & the $\mathit{NNP}_1$'s
$\mathit{NN}_2$ & $\mid$ & the $\mathit{NN}_2$ of the $\mathit{NNP}_1$
\\
Dative shift & $\mathit{VP}$ $\rightarrow$ & give $\mathit{NN}$ to
$\mathit{NP}$ & $\mid$ & give $\mathit{NP}$ the $\mathit{NN}$ \\
Passivization &
$\mathit{SBAR}$ $\rightarrow$ & that $\mathit{NP}$ had
$\mathit{VBN}$ & $\mid$ & which was $\mathit{VBN}$ by $\mathit{NP}$ \\
Adverbial phrase move &
$\mathit{VP}$ $\rightarrow$ & $\mathit{VBN}$ $\mathit{ADVP}$ & $\mid$ & is
$\mathit{ADVP}$ being $\mathit{VBN}$ \\
Reduced relative clause & $\mathit{SBAR/S}$ $\rightarrow$ & in order
to $\mathit{VB}$ that & $\mid$ & in order to $\mathit{VB}$ \\
\hline
\end{tabular}
\end{center}
\caption{Examples of structural paraphrasing rules extracted by our
system}
\label{example_rules}
\end{table*}
\subsection{Comparison of Paraphrase
Patterns} \label{pattern_comparison}
To analyze the paraphrase patterns produced by our system, we reduce
the grammar to only patterns that apply to an example sentence and
take a closer look at the most likely pattern pairs. We contrast our
extracted paraphrases with a Hiero-style baseline that mirrors the
approach of \newcite{Madnani2007}.
\begin{table*}[t]
\begin{center}
\begin{tabular}{|c|c|rrcl|}
\hline
Input phrase & \multicolumn{5}{c|}{Best extracted paraphrases} \\
\hline
\multirow{6}{*}{the dog's tail} &
\multirow{3}{*}{H}
&
$\mathit{X}$ $\rightarrow$ & the $\mathit{X}_1$'s
$\mathit{X}_2$ & $\mid$ & the $\mathit{X}_2$ of the $\mathit{X}_1$
\\
&&
$\mathit{X}$ $\rightarrow$ & the $\mathit{X}_1$'s
$\mathit{X}_2$ & $\mid$ & the $\mathit{X}_2$ of the $\mathit{X}_1$
\\
&&
$\mathit{X}$ $\rightarrow$ & the $\mathit{X}_1$'s
$\mathit{X}_2$ & $\mid$ & the $\mathit{X}_2$ of the $\mathit{X}_1$ \\
\cline{2-6}
& \multirow{3}{*}{S}
&
$\mathit{NP}$ $\rightarrow$ & the $\mathit{NNP}_1$'s
$\mathit{NN}_2$ & $\mid$ & the $\mathit{NN}_2$ of the $\mathit{NNP}_1$
\\
&&
$\mathit{NP}$ $\rightarrow$ & the $\mathit{NNP}_1$'s
$\mathit{NN}_2$ & $\mid$ & the $\mathit{NN}_2$ of the $\mathit{NNP}_1$
\\
&&
$\mathit{NP}$ $\rightarrow$ & the $\mathit{NNP}_1$'s
$\mathit{NN}_2$ & $\mid$ & the $\mathit{NN}_2$ of the
$\mathit{NNP}_1$ \\
\hline
\multirow{6}{*}{the dog's tail} &
\multirow{3}{*}{H}
&
$\mathit{X}$ $\rightarrow$ & the $\mathit{X}_1$'s
$\mathit{X}_2$ & $\mid$ & the $\mathit{X}_2$ of the $\mathit{X}_1$
\\
&&
$\mathit{X}$ $\rightarrow$ & the $\mathit{X}_1$'s
$\mathit{X}_2$ & $\mid$ & the $\mathit{X}_2$ of the $\mathit{X}_1$ \\
&&
$\mathit{X}$ $\rightarrow$ & the $\mathit{X}_1$'s
$\mathit{X}_2$ & $\mid$ & the $\mathit{X}_2$ of the $\mathit{X}_1$
\\
\cline{2-6}
& \multirow{3}{*}{S}
&
$\mathit{NP}$ $\rightarrow$ & the $\mathit{NNP}_1$'s
$\mathit{NN}_2$ & $\mid$ & the $\mathit{NN}_2$ of the $\mathit{NNP}_1$
\\
&&
$\mathit{NP}$ $\rightarrow$ & the $\mathit{NNP}_1$'s
$\mathit{NN}_2$ & $\mid$ & the $\mathit{NN}_2$ of the $\mathit{NNP}_1$
\\
&&
$\mathit{NP}$ $\rightarrow$ & the $\mathit{NNP}_1$'s
$\mathit{NN}_2$ & $\mid$ & the $\mathit{NN}_2$ of the
$\mathit{NNP}_1$ \\
\hline
\end{tabular}
\end{center}
\caption{This table shows the top paraphrase rules matching an input
phrase as ranked by our (S) and the baseline system (H).}
\end{table*}
\subsection{Sentential Paraphrasing} \label{sentential_paraphrasing}
We are interested in sentential paraphrases. Why are we interested?
More powerful than locally constrained, gives us large-scale changes
to sentential structure, which can be cruicial to applications such as
detecting entailment or automatically creating significantly differing
references. \emph{However, phrasal decoder-based approaches should be
able to achieve similar re-ordering effects (if not the generality,
which we only implicitely achieve, really). Maybe we should add a
phrase-based baseline in addition to Hiero? Did Nitin talk about
this?}
While the definition of a phrasal paraphrase is intuitively clear,
sentential paraphrases are much harder to define. When paraphrasing a
sentence $s$ into a new sentence $t$, the term suggests that we expect
the changes to $s$ to be above a certain threshold for $t$ to be
considered a sentential paraphrase.
\section{Parameter Estimation}
\label{adaptation}
When tuning the parameters for a paraphrasing system, two separate
tasks can be identified: the paraphrase ranking problem and the
paraphrasing process itself. Due to the structure of our system, our
model integrates both these tasks into one\footnote{In the grammar
pruning step described in Section~\ref{pruning} we do, in fact,
implement a coarse separate paraphrase ranking step that is distinct
from the otherwise integrated parameter estimation. This is done to keep
the size of the paraphrase grammar manageable.}
need to estimate parameters for both, paraphrase ranking and paraphrasing
since we use mert
like nitin, we tune on ref-to-ref translation
however BLEU is bad, leads to identity paraphrases
We estimate the parameters of our log-linear model using minimum error
rate training \cite{Och2003}.
\subsection{Paraphrase BLEU} \label{pp_bleu}
The familiar BLEU metric \cite{Papineni2002} quantifies the quality of
a translation by measuring the n-gram overlap between the generated
translation and a set of reference translation
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
& BLEU & ppBLEU \\
\hline
Identity & 52.7 & bad \\
Oracle & good & good \\
\hline
\end{tabular}
\end{center}
\caption{Oracle experiment with references: ``translate'' ref onto
itself score; ``translate'' ref onto heldout ref, score}
\end{table}
for paraphrasing, this can be a problem, since the reference
translations typically have a rather high overlap with one another
\mnote{get some numbers to back that up} and the decoder has to beat
that first
\section{Reference Expansion}
We present an examplary application of our paraphrasing system
Using paraphrases for reference expansion, cite Madnani
idea behind reference expansion: create additional ``reference''
translations to tune a machine translation system
however, tuning our paraphrase system to BLEU results in a paraphraser
that generates only minimal changes \mnote{if so, how did Madnani deal
with that?}
We therefore implement a paraphrase-specific metric, ppBLEU, that aims
to amend this issue.
\subsection{Data} \label{data}
We use Europarl v. 5. Align with Berkeley and parse with The Parser.
we use the SAMT pipeline to extract the grammar
however, the pivoting approach makes it impossible to apply the common pre-emptive
grammar filtering, hence the grammar grows very large
\subsection{Evaluation} \label{evaluation}
A straightforward way to evaluate a paraphrasing system is by using to
improve an SMT system's performance. Cite Chris and Nitin. Compare to
Hiero baseline
\section{Conclusion} \label{conclusion}
Look, we unified everything in the field and made all this stuff from
the previous section much better. Or did we?
\bibliographystyle{acl}
\bibliography{paraphrasing}
\nocite{*}
\end{document}