-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathiwpt2017.Rmd
201 lines (127 loc) · 6.06 KB
/
iwpt2017.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Discussion on manually annotated and parsed Komi dependencies"
author: "Niko Partanen"
date: "7/5/2017"
output:
html_document:
includes:
after_body: after_body.html
in_header: header.html
section_divs: FALSE
lib_dir: lib
self_contained: no
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
##
- Manually annotated utterance **is on top**
- One below comes from our BIST-parser
- **Both** human and machine generated versions may contain mistakes ;)
Suggestions, improvements and comments can be sent to me in email [email protected].
```{r, echo=FALSE, results='asis', warning=FALSE, message=FALSE}
library(conllr)
library(tidyverse)
library(glue)
plot_tree <- function(id, type = 'test'){
if (type == 'parser'){
filepath = 'dev_epoch_1.conllu'
} else {
filepath = 'kpv-ud-dev_20.conllu'
}
kpv <- read_conll(filepath)
kpv %>%
mutate(id = stringr::str_trim(id)) %>%
filter(id == glue('belyx-011.{id}')) %>%
as_conllu_div() -> sentence
sentence
}
```
So what is going on here is that we have been training multilingual dependency parser for Komi and Northern Saami, with the idea that we can use multilingual word embeddings and UD corpora from different languages as our training data. By controlling the training data it is possible to adjust the model. For example, the result drifts quite a lot depending from ratios of Finnish and Russian in a model being 20/80 or 50/50 and so on. Of course Komi is in no way a mixture of Finnish and Russian, but it shares similarities with both of them, in one hand for being a related language, and with Russian there is the long language contact that makes them more similar.
One problem is that there is a possibility that the test data contains mistakes in itself, as I'm not any kind of expert in Universal Dependency style annotations, and had never done anything with them before. Actually reporting errors would be most appreciated.
Good news is that because we have the output files all saved on our server, we can very easily recalculate the scores if there are changes in the evaluation data. I already noticed several mistakes, and I think there are several cases where I have misunderstood the UD annotation model.
## Analysis of sentences
So as explained above, the first one is the ground truth sentence and the second one is generated based on Russian treebanks:
`r plot_tree('013')`
`r plot_tree('013', 'parser')`
</br>
I don't know why it thinks `А` is of category [mark](http://universaldependencies.org/u/dep/mark.html). Now when I read the description that doesn't even look that wrong actually…
With other example it is obvious that the subject is already encoded wrong in the test data.
`r plot_tree('015')`
`r plot_tree('015', 'parser')`
</br>
Here is another one which is analysed correctly. The sentence is very simple *[the water] [calmed down]*.
`r plot_tree('016')`
`r plot_tree('016', 'parser')`
</br>
Here is another one where the analysis is same, the sentence being bit more complex *[the fish][entirely][got quiet]*:
`r plot_tree('017')`
`r plot_tree('017', 'parser')`
</br>
Here also result is good. The sentence is **[even][the tail][(it) didn't][show]**
`r plot_tree('018')`
`r plot_tree('018', 'parser')`
</br>
In this case it is absolutely clear that I didn't know very well how to annotate the postpositions, so maybe the Russian model understood those better after all? **Note: compare this kind of situations to some Finnish examples**.
`r plot_tree('020')`
`r plot_tree('020', 'parser')`
</br>
Also here the question is more if two consecutive adverbs should be linked to one another or directly to the root, as it is in the generated case.
`r plot_tree('022')`
`r plot_tree('022', 'parser')`
</br>
Here the only mistake is in object not being detected correctly.
`r plot_tree('024')`
`r plot_tree('024', 'parser')`
</br>
This repeats the same question on adverbs.
`r plot_tree('026')`
`r plot_tree('026', 'parser')`
</br>
This kind of mistake with objects and obliques we have seen many times already.
`r plot_tree('030')`
`r plot_tree('030', 'parser')`
</br>
Here is another correctly analysed one:
`r plot_tree('031')`
`r plot_tree('031', 'parser')`
</br>
This is also correct.
`r plot_tree('032')`
`r plot_tree('032', 'parser')`
</br>
This was is not correct, but I think the word order is the reason here. Subject is in the very end. `А` is again annotated as `mark`, maybe this is the correct annotation then?
`r plot_tree('033')`
`r plot_tree('033', 'parser')`
</br>
Here is similar situation as before with advmods after one another. The analysis tries to suggest subject in the beginning, but here the subject is not marked. Are the subjects more compulsory in Russian than Komi? Similarly I wasn't that sure whether we should annotate the verb there as copula or what. Is `aux` or `xcomp` better?
`r plot_tree('035')`
`r plot_tree('035', 'parser')`
</br>
Here is a gerund **while it was raining**. Wasn't that sure how to annotate it.
`r plot_tree('037')`
`r plot_tree('037', 'parser')`
</br>
Here the negation verb is somehow analysed as punctuation? Where this can come from?
`r plot_tree('038')`
`r plot_tree('038', 'parser')`
</br>
Here is a typo in the test data, рудӧдны should be `xcomp` of кутіс. But it doesn't effect the fact that subject is analysed from in the machine generated output. Is the clause around the subject somehow different from Russian constructions?
`r plot_tree('041')`
`r plot_tree('041', 'parser')`
</br>
Here are my mistakes with the adverbs, but also the subject is not found correctly.
`r plot_tree('042')`
`r plot_tree('042', 'parser')`
</br>
This one is also a case where the subject is in the very end. Or can be analyzed so?
`r plot_tree('043')`
`r plot_tree('043', 'parser')`
</br>
Have to think about this one…
`r plot_tree('045')`
`r plot_tree('045', 'parser')`
</br>
Here also analysis is different. Maybe one thing is that in Russian would never have just one noun alone in a position like this?
`r plot_tree('046')`
`r plot_tree('046', 'parser')`