Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH+FIX+WIP fragmented EDUs, dep2con, move Parseval, one file per doc #67

Open
wants to merge 55 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
e10ba13
FIX minor issues in report, score, learning.local
moreymat Jul 1, 2016
d33d65f
WIP special processing for same_unit
moreymat Jul 1, 2016
f070c51
ENH first implementation of CDUs, for frag EDUs
moreymat Jul 28, 2016
b35779a
ENH require joblib >= 0.9.4
moreymat Jul 29, 2016
e09c034
WIP update DataPack to include cdu{s,_pairings,_data,_target}
moreymat Jul 29, 2016
e982d23
WIP same-unit as preproc
moreymat Aug 24, 2016
fb462dd
WIP nonfixed_pairs are handled by classifier wrappers
moreymat Aug 25, 2016
dd3f80a
FIX revert to binary same-unit
moreymat Aug 29, 2016
651fbd2
WIP one file per doc
moreymat Aug 30, 2016
cc77239
WIP use regular AttachClassifiers for Same-Unit, delete now useless code
moreymat Aug 31, 2016
7e204bf
WIP one file per doc, same-unit
moreymat Sep 3, 2016
c56558a
FIX selection of cdus in Datapack.selected
moreymat Sep 4, 2016
e002607
MAINT carve out barebones_rst_deptree from get_oracle_ctrees
moreymat Sep 5, 2016
71412b1
DOC+COSMIT minor fixes in harness.graph and io
moreymat Sep 6, 2016
dffdf19
MAINT cleanups in io
moreymat Sep 6, 2016
e34c780
DOC fix type and description of an argument in docstring
moreymat Sep 7, 2016
384896d
MAINT renaming of rank strategy in educe dep2con
moreymat Sep 14, 2016
6b242ce
WIP print nb of spans in pred_ctree for debug
moreymat Sep 14, 2016
cc842f9
MAINT minor fix in metrics.util
moreymat Sep 15, 2016
45e813d
WIP add support_pred in eval structured for ctrees
moreymat Sep 16, 2016
6785970
WIP metrics.constituency: span_sel, fix sup_pred
moreymat Sep 21, 2016
12c9846
MAINT refactor parseval_report
moreymat Sep 24, 2016
cde05df
ENH variant of parseval scores, per doc then averaged
moreymat Sep 26, 2016
c19577e
MAINT get_spans is now a method from {Simple}RSTTree
moreymat Sep 30, 2016
639deee
WIP detailed metric: 'undirected spans' on deptree
moreymat Sep 30, 2016
7f63ef4
FIX limit parseval to one ctree per doc
moreymat Oct 3, 2016
abf439b
FIX disable extra verbosity in ctree eval
moreymat Oct 17, 2016
23b85da
MAINT constituency metrics moved to educe
moreymat Nov 10, 2016
a042559
ENH add scores LAS+N, LAS+O, LAS+N+O
moreymat Dec 6, 2016
05b8f02
MAINT contain CDU-related code
moreymat Feb 4, 2017
bf8006a
FIX load_multipack has file_split
moreymat Feb 6, 2017
26ba73d
FIX add verbose param to _load_multipack_cdus
moreymat Feb 6, 2017
13be763
FIX tuple takes a unique argument
moreymat Feb 7, 2017
2f21c3c
FIX repair file_split=corpus
moreymat Feb 7, 2017
48b1222
FIX DataPack.selected() with no CDUs
moreymat Feb 7, 2017
e93c74e
MAINT factor out dump_frag_edus (unfinished)
moreymat Feb 8, 2017
532abf9
FIX parser.same_unit exclude pairs (ROOT, _)
moreymat Feb 8, 2017
7c6d064
FIX attelo.parser.intra skip fake root to compute intra_tgts
moreymat Feb 8, 2017
bfea509
FIX parser.intra _for_intra_cdu when no CDU pairing
moreymat Feb 8, 2017
e37585e
FIX parser.same_unit first real EDU at posn 0
moreymat Feb 8, 2017
73331b7
FIX disable forests for ctree eval
moreymat Feb 9, 2017
a269f51
FIX dpack.edus[0] is the fake root EDU if file_split='corpus'
moreymat Feb 9, 2017
5602c43
FIX workaround inconsistent API in conv from dtree to simplersttree
moreymat Feb 10, 2017
bc1d5ec
FIX optional display of predicted same-unit
moreymat Feb 14, 2017
ddcf3a1
FIX use Torpor
moreymat Feb 14, 2017
795fea0
FIX merge ctargets in DataPack.vstack
moreymat Feb 15, 2017
f17f57c
FIX assume pos_label=1 for unpickled SklearnAttachClassifier
moreymat Feb 17, 2017
ecee6fb
Merge remote-tracking branch 'upstream/master' into enh-preproc-same-…
moreymat Apr 11, 2017
dd58064
ENH compute_uas_las() has param metrics
moreymat May 16, 2017
41456e9
ENH optionally pass doc_names to compute_uas_las()
moreymat May 17, 2017
974ef79
ENH metrics.deptree: dep_compact_report()
moreymat May 18, 2017
8aedc13
ENH pairwise dep similarity
moreymat May 21, 2017
2964e82
FIX pairwise sim report: no underscore
moreymat May 21, 2017
50d7da4
ENH add dep metric R+O
moreymat Nov 29, 2017
ada8822
FIX out_format latex
moreymat Dec 1, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions attelo/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ def add_common_args(psr):
help="EDU pair features (libsvm)")
psr.add_argument("vocab", metavar="FILE",
help="feature vocabulary")
psr.add_argument("labels", metavar="FILE",
help="labels")
psr.add_argument("--quiet", action="store_true",
help="Supress all feedback")

Expand Down
13 changes: 13 additions & 0 deletions attelo/cdu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""Explicit representation of a CDU.

As of 2016-07-28, this is WIP.
"""

from collections import namedtuple


class CDU(namedtuple("CDU", "id members")):
"""A class representing the CDU (id, [members])"""
pass


4 changes: 2 additions & 2 deletions attelo/cmd/graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ def config_argparser(psr):

input_grp = psr.add_mutually_exclusive_group(required=True)
input_grp.add_argument("--gold", metavar="FILE",
nargs=2,
nargs=3,
help="gold predictions [pairings, "
"features (targets only)]")
"features (targets only), labels]")
input_grp.add_argument("--predictions", metavar="FILE",
help="single predictions")

Expand Down
6 changes: 3 additions & 3 deletions attelo/cmd/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ def load_args_multipack(args):
'''
Load multipack specified via command line arguments
'''
return load_multipack(args.edus,
args.pairings,
return load_multipack(args.edus, args.pairings,
args.features,
args.vocab,
args.vocab, args.labels,
file_split='corpus', # WIP
verbose=not args.quiet)


Expand Down
4 changes: 4 additions & 0 deletions attelo/decoding/eisner.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

import numpy as np

from ..edu import FAKE_ROOT
from .interface import Decoder
# temporary? imports
from ..table import _edu_positions
Expand Down Expand Up @@ -46,6 +47,9 @@ def decode(self, dpack, nonfixed_pairs=None):
"""
# whether the output tree should contain a unique real root
unique_real_root = self._unique_real_root
# check that the first EDU is the fake root ; this is an
# important assumption for the following code
assert dpack.edus[0] == FAKE_ROOT

# get number of EDUs and possible labels
nb_edus = len(dpack.edus)
Expand Down
9 changes: 9 additions & 0 deletions attelo/edu.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,12 @@ def span(self):
all groupings
"""
# pylint: enable=pointless-string-statement


# small helper for parsers
def edu_id2num(edu_id):
"""Get the number of an EDU"""
edu_num = (int(edu_id.rsplit('_', 1)[1])
if edu_id != FAKE_ROOT_ID
else 0)
return edu_num
15 changes: 10 additions & 5 deletions attelo/fold.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,26 @@
'''
"""
Group-aware n-fold evaluation.

Attelo uses a variant of n-fold evaluation, where we (still)
andomly partition the dataset into a set of folds of roughly even
randomly partition the dataset into a set of folds of roughly even
size, but respecting the additional constraint that any two data
entries belonging in the same "group" (determined a single
entries belonging in the same "group" (determined by a single
distiguished feature, eg. the document id, the dialogue id, etc)
are always in the same fold. Note that this makes it a bit harder
to have perfectly evenly sized folds
to have perfectly evenly sized folds.


Created on Jun 20, 2012

@author: stergos

contribs: phil
'''

TODO
----
* [ ] refactor after `sklearn.model_selection._split`: encapsulate
into a class similar to GroupKFold.
"""

import random

Expand Down
64 changes: 52 additions & 12 deletions attelo/harness/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,18 @@ def _link_data_files(data_dir, eval_dir):
eval_file = fp.join(eval_dir, fname)
if fp.isfile(data_file) and not fp.exists(eval_file):
os.link(data_file, eval_file)
elif fp.isdir(data_file) and not fp.exists(eval_file):
# 2016-09-01 add support for one file per doc:
# create hard links to data/Y/Z as eval-xxx/Y/Z ;
# folders cannot be hard linked so we create copies
os.makedirs(eval_file)
# dirty recursive calls, limited to the immediate members
# of data/
for fname_sub in os.listdir(data_file):
data_file_sub = fp.join(data_file, fname_sub)
eval_file_sub = fp.join(eval_file, fname_sub)
if fp.isfile(data_file_sub) and not fp.exists(eval_file_sub):
os.link(data_file_sub, eval_file_sub)


def _link_model_files(old_dir, new_dir):
Expand Down Expand Up @@ -131,12 +143,26 @@ def _create_tstamped_dir(prefix, suffix):
return True


def prepare_dirs(runcfg, data_dir):
"""
Return eval and scratch directory paths
def prepare_dirs(runcfg, base_dir):
"""Get eval and scratch directory paths.

Parameters
----------
runcfg: attelo.harness.config.RuntimeConfig
Current runtime config
base_dir: filepath
Base directory for the experiment.

Returns
-------
eval_dir: filepath
Evaluation folder ; subfolder of base_dir.
scratch_dir: filepath
Scratch folder ; subfolder of base_dir.
"""
eval_prefix = fp.join(data_dir, "eval")
scratch_prefix = fp.join(data_dir, "scratch")
data_dir = os.path.join(base_dir, 'data')
eval_prefix = fp.join(base_dir, "eval")
scratch_prefix = fp.join(base_dir, "scratch")

eval_current = eval_prefix + '-current'
scratch_current = scratch_prefix + '-current'
Expand Down Expand Up @@ -230,19 +256,33 @@ def _load_harness_multipack(hconf, test_data=False):
paths = stripped_paths
else:
paths = hconf.mpack_paths(test_data, stripped=False)
mpack = load_multipack(paths['edu_input'],
paths['pairings'],
mpack = load_multipack(paths['edu_input'], paths['pairings'],
paths['features'],
paths['vocab'],
corpus_path=paths.get('corpus', None), # WIP
paths['vocab'], paths['labels'],
# WIP additional files, used only for rst-dt
# as of 2016-07-28
cdu_file=paths.get('cdu_input', None),
cdu_pairings_file=paths.get('cdu_pairings', None),
cdu_feature_file=paths.get('cdu_features', None),
corpus_path=paths.get('corpus', None),
# end WIP
file_split='corpus', # WIP
verbose=True)
return mpack


def _init_corpus(hconf):
"""Start evaluation; generate folds if needed

:rtype: DataConfig or None
Parameters
----------
hconf: ??
TODO

Returns
-------
dconf: DataConfig or None
Data configuration
"""
can_skip_folds = fp.exists(hconf.fold_file)
msg_skip_folds = ('Skipping generation of fold files '
Expand Down Expand Up @@ -281,8 +321,8 @@ def evaluate_corpus(hconf):

dconf = _init_corpus(hconf)
if hconf.runcfg.stage in [None, ClusterStage.main]:
foldset = hconf.runcfg.folds if hconf.runcfg.folds is not None\
else frozenset(dconf.folds.values())
foldset = (hconf.runcfg.folds if hconf.runcfg.folds is not None
else frozenset(dconf.folds.values()))
for fold in foldset:
do_fold(hconf, dconf, fold)

Expand Down
19 changes: 9 additions & 10 deletions attelo/harness/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,16 +59,16 @@ class TinyHarness(Harness):
parser=_parser2)]

def __init__(self):
self._datadir = mkdtemp()
self._basedir = mkdtemp()
for cpath in glob.glob('doc/example-corpus/*'):
shutil.copy(cpath, self._datadir)
shutil.copy(cpath, self._basedir)
super(TinyHarness, self).__init__('tiny', None)

def run(self):
"""Run the evaluation
"""
runcfg = RuntimeConfig.empty()
eval_dir, scratch_dir = prepare_dirs(runcfg, self._datadir)
eval_dir, scratch_dir = prepare_dirs(runcfg, self._basedir)
self.load(runcfg, eval_dir, scratch_dir)
evaluate_corpus(self)

Expand All @@ -89,13 +89,12 @@ def mpack_paths(self, _, stripped=False):
The 2nd argument denoted by '_' is test_data, which is unused in
this example.
"""
core_path = fp.join(self._datadir, 'tiny')
return {
'edu_input': core_path + '.edus',
'pairings': core_path + '.pairings',
'features': core_path + '.features.sparse',
'vocab': core_path + '.features.sparse.vocab'
}
core_path = fp.join(self._basedir, 'data', 'tiny')
return {'edu_input': core_path + '.edus',
'pairings': core_path + '.pairings',
'features': core_path + '.features.sparse',
'vocab': core_path + '.features.sparse.vocab',
'labels': core_path + '.labels'}

def _model_basename(self, rconf, mtype, ext):
"Basic filename for a model"
Expand Down
26 changes: 12 additions & 14 deletions attelo/harness/graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,20 +50,19 @@ def _mk_econf_graphs(hconf, edus, gold, econf, fold):
raise Exception('Unknown diff mode {}'.format(diffmode))

want_test = fold is None
suffix = 'test' if want_test\
else fp.basename(hconf.fold_dir_path(fold))
suffix = ('test' if want_test
else fp.basename(hconf.fold_dir_path(fold)))
output_dir = fp.join(hconf.report_dir_path(want_test, None),
output_bn_prefix + suffix,
econf.key)

# settings
to_hide = 'inter' if diffmode == GraphDiffMode.diff_intra else None
settings =\
GraphSettings(hide=to_hide,
select=hconf.graph_docs,
unrelated=False,
timeout=15,
quiet=False)
settings = GraphSettings(hide=to_hide,
select=hconf.graph_docs,
unrelated=False,
timeout=15,
quiet=False)

if diffmode == GraphDiffMode.solo:
yield delayed(graph_all)(edus,
Expand All @@ -84,12 +83,11 @@ def _mk_gold_graphs(hconf, dconf):
output_dir = fp.join(hconf.report_dir_path(None),
'graphs-gold')

settings =\
GraphSettings(hide=None,
select=hconf.graph_docs,
unrelated=False,
timeout=15,
quiet=True)
settings = GraphSettings(hide=None,
select=hconf.graph_docs,
unrelated=False,
timeout=15,
quiet=True)

predictions = to_predictions(dconf.pack)
edus = concat_l(dpack.edus for dpack in dconf.pack.values())
Expand Down
4 changes: 2 additions & 2 deletions attelo/harness/interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,8 +201,8 @@ def mpack_paths(self, test_data, stripped=False):
Usual keys are:
* edu_input
* pairings
* features
* vocab
* vocabulary
* labels

Parameters
----------
Expand Down
6 changes: 5 additions & 1 deletion attelo/harness/report.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,11 @@ def full_report(mpack, fold_dict, slices, metrics,
edge_count[key].append(score_edges(fpack, predictions))
# * on constituency tree spans
if 'cspans' in metrics:
sc_cspans = score_cspans(dpacks, dpredictions)
try:
sc_cspans = score_cspans(dpacks, dpredictions)
except Exception:
print('Error in slice configuration', key)
raise
cspan_count[key].append(sc_cspans)
# * on EDUs
if 'edus' in metrics:
Expand Down
Loading