-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.txt
131 lines (90 loc) · 6.64 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
DEVELOPERS
~~~~~~~~~~~~~~~~~~~~
* Vanessa Wei Feng
Department of Computer Science
University of Toronto
Canada
mailto:[email protected]
REFERENCES
~~~~~~~~~~~~~~~~~~~~~~
* Vanessa Wei Feng and Graeme Hirst, 2014. Two-pass Discourse Segmentation with Pairing and Global Features. arXiv:1407.8215v1. http://arxiv.org/abs/1407.8215
* Vanessa Wei Feng and Graeme Hirst, 2014. A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing. In Proceedings of the 52th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-2014), Baltimore, USA. http://aclweb.org/anthology/P14-1048
GENERAL INFOMRATION
~~~~~~~~~~~~~~~~~~~~~~
* This RST-style discourse parser produces discourse tree structure on full-text level, given a raw text. No prior sentence splitting or any sort of preprocessing is expected. The program runs on Linux systems.
* The overall software work flow is similar to the one described in our paper (ACL 2014). However, to guarantee further efficiency, we remove the post-editing component from the workflow, and remove the set of entity-based transaction features from our feature set. Moreover, both structure and relation classification models are now implemented using CRFSuite.
DEPENDENCIES
~~~~~~~~~~~~~~~~~~~~~~
You need to install the following third-party software first:
In order to run the software, you need to have the following components:
1. Python 2.7.
2. NLTK 2.0b9 (newer versions will not work because of the different interfaces in the tree modules).
3. Java
4. gcc
5. Perl
* The Stanford syntax parser is already included in this package. However, please do not try to replace the provided package with newer versions, since compatibility is not guaranteed in this case.
SETUP CRFSUITE
~~~~~~~~~~~~~~~~~~~~~~
The $gCRF_ROOT$ symbol in the commands below stands for the root directory of gCRF.
1. Test if the binary file of CRFsuite-stdin is ready for use, by executing the following two commands:
cd $gCRF_ROOT$/tools/crfsuite/
crfsuite-stdin tag -pi -m ../../model/tree_build_set_CRF/label/intra.crfsuite test.txt
If the following lines are printed, then the binary is usable on your platform, and thus you can skip the next step and go to the USAGE section:
@probability 0.003834
LEAF:0.063409
Elaboration[N][S]:0.958434
LEAF:0.060318
2. If the test in the Step 1 fails, you need to build the binary from source codes. To do so, you need to do the following three things:
a. Build LibLBFGS
cd $gCRF_ROOT$/tools/crfsuite/liblbfgs-1.10
./configure --prefix=$HOME/local
make
make install
b. Build CRFsuite-stdin
cd $gCRF_ROOT$/tools/crfsuite/crfsuite-0.12
./configure --prefix=$HOME/local --with-liblbfgs=$HOME/local
make
make install
c. Copy the crfsuite binary under $HOME/local/bin to tools/crfsuite and rename it as crfsuite-stdin
cp $HOME/local/bin/crfsuite $gCRF_ROOT$/tools/crfsuite/crfsuite-stdin
chmod +x ../crfsuite-stdin
Now repeat the test in Step 1 and see if it works now.
SOFTWARE USAGE
~~~~~~~~~~~~~~~~~~~~~~
Usage: parse.py [options] input_file/filelist
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-v, --verbose verbose mode
-s, --skip_parsing Skip parsing, i.e., conduct segmentation only.
-D, --filelist parse all files specified in the filelist file, one
file per line.
-t, --output_dir Specify a directory for output files.
-g, --global_features
Perform a second pass of EDU segmentation using global
features.
-l, --logging Perform logging while parsing.
-e, --save Save preprocessed document into serialized file for
future use.
1) To produce the discourse parse tree for a single file, do the following:
python parse.py -t test ../texts/wsj_0607.out
If you omit the -t option, the result will be written to "$gCRF_ROOT$/texts/results", otherwise, it will create a sub-folder under "$gCRF_ROOT$/texts/results". So for the command above, the result "wsj_0607.out.tree" will appear in "$gCRF_ROOT$/texts/results/test/".
2) To produce discourse parse trees for a batch of files, do the following, where filelist is the name of the file where you store the absolute path of each file that you wish to process (one file per line):
python parse.py -tD test_batch filelist
And the results will appear in "$gCRF_ROOT$/texts/results/test_batch/".
3) To enable two-pass segmentation (see our arXiv paper), use the -g option, in conjunction with other options (if any). However, enabling this option would largely slow the segmentation, but with higher segmentation accuracy.
4) To enable logging while processing texts, use the -l option. In this way, the time spent on preprocessing, segmentation, and parsing will be recorded individually.
For example, for the sample text, its log will look like the following:
***** Parsing ../texts/wsj_0607.out...
Finished preprocessing in 2.99 seconds.
Finished segmentation in 1.07 seconds. Segmented into 15 EDUs.
Finished tree building in 0.52 seconds.
===================================================
5) You may save a (partially) processed document to as a binary file, by using the -e option. This is particularly useful for processing large documents, as you might not want to repeat the preprocessing part (i.e., syntactic parsing, etc.), but wish to explore some different parsing configurations. There are three places where the serialized data will be saved: after preprocessing, after segmentation, and after discourse parsing.
TROUBLESHOOTING
~~~~~~~~~~~~~~~~~~~~~~
* It seems that newer version of NLTK has a different interface for its tree objects. Our implementation was based on NLTK 2.0b9. So if you're experiencing with errors reported such as "'function' object has no attribute 'parent'", please either replace your NLTK version with 2.0b9, or go through the source codes under src/ folder, adding brackets to all *.parent, *.left_sibling, *.right_sibling, *.root, *.treepos, *.treepositions. (e.g., *.parent becomes *.parent())
* We have included a script, santiy_check.py, to test a few key dependent components (CCGSsplitter, Stanford parser, and CRFSuite) of the tool. You may run the script for a quick diagnosis.
BUGS AND COMMENTS
~~~~~~~~~~~~~~~~~~~~~~
If you encounter and bugs using the program, please report the exception thrown by the program and the specific text file(s) you used for parsing, to [email protected]. General comments about the program and the results are also welcome!