-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample_new parameter to gendb::incorporate #215
Changes from 4 commits
85a19dc
10e88b9
82bc4cd
9e12a66
4e536ef
73fb3d5
d6ba9ad
8f10e0f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,7 +40,8 @@ double GenDB::logp_score() const { | |
|
||
void GenDB::incorporate( | ||
std::mt19937* prng, | ||
const std::pair<int, std::map<std::string, ObservationVariant>>& row) { | ||
const std::pair<int, std::map<std::string, ObservationVariant>>& row, | ||
bool sample_new) { | ||
int id = row.first; | ||
|
||
// Maps a query relation name to an observed value. | ||
|
@@ -53,7 +54,8 @@ void GenDB::incorporate( | |
schema.query.fields.at(query_rel).class_path; | ||
T_items items = | ||
sample_entities_relation(prng, schema.query.record_class, | ||
class_path.cbegin(), class_path.cend(), id); | ||
class_path.cbegin(), class_path.cend(), id, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A lot of the machinery in sample_class_ancestors etc is for the purpose of walking the DAG of references to make sure the reference values are coherent (i.e. if Physician 5 went to School 3, that's correct in all samples). Am I correct that when sample_new is false, we never see the same Physician e.g. more than once, so we never actually have to walk the DAG? If so, then I think we should implement this by simply iterating over a relation's domains and pulling sequential values from the domain CRPs rather than adding complexity to sample_class_ancestors and the other methods that are primarily for inference. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added TODO. I don't think the sample_new = false case is as simple as you make out, if only because it's what pclean_lib::translate_observations did before and that was far from trivial (and it relied on the annotated_domains_for_relations datastructure, which is being eliminated). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking again at translate_observations, it looks like it used annotated_domains_for_relations to get unique string names for all of the entities, which we don't need to do here since we're working with entity IDs directly, so I think it is pretty simple (and I'd prefer to do it in this PR though a TODO is ok too). |
||
sample_new); | ||
|
||
// Incorporate the items/value into the query relation. | ||
incorporate_query_relation(prng, query_rel, items, val); | ||
|
@@ -67,13 +69,14 @@ void GenDB::incorporate( | |
T_items GenDB::sample_entities_relation( | ||
std::mt19937* prng, const std::string& class_name, | ||
std::vector<std::string>::const_iterator class_path_start, | ||
std::vector<std::string>::const_iterator class_path_end, int class_item) { | ||
std::vector<std::string>::const_iterator class_path_end, | ||
int class_item, bool sample_new) { | ||
if (class_path_end - class_path_start == 1) { | ||
// The last item in class_path is the class from which the queried attribute | ||
// is observed (for which there's a corresponding clean relation, observing | ||
// the attribute from the class). We need to DFS-traverse the class's | ||
// parents, similar to PCleanSchemaHelper::compute_domains_for. | ||
return sample_class_ancestors(prng, class_name, class_item); | ||
return sample_class_ancestors(prng, class_name, class_item, sample_new); | ||
} | ||
|
||
// These are noisy relation domains along the path from the latent cleanly- | ||
|
@@ -88,11 +91,12 @@ T_items GenDB::sample_entities_relation( | |
std::tuple<std::string, std::string, int> ref_key = {class_name, ref_field, | ||
class_item}; | ||
if (!reference_values.contains(ref_key)) { | ||
sample_and_incorporate_reference(prng, ref_key, ref_class); | ||
sample_and_incorporate_reference(prng, ref_key, ref_class, sample_new); | ||
} | ||
T_items items = | ||
sample_entities_relation(prng, ref_class, ++class_path_start, | ||
class_path_end, reference_values.at(ref_key)); | ||
sample_entities_relation( | ||
prng, ref_class, ++class_path_start, class_path_end, | ||
reference_values.at(ref_key), sample_new); | ||
// The order of the items corresponds to the order of the relation's domains, | ||
// with the class (domain) corresponding to the primary key placed last on the | ||
// list. | ||
|
@@ -103,9 +107,14 @@ T_items GenDB::sample_entities_relation( | |
void GenDB::sample_and_incorporate_reference( | ||
std::mt19937* prng, | ||
const std::tuple<std::string, std::string, int>& ref_key, | ||
const std::string& ref_class) { | ||
const std::string& ref_class, bool sample_new) { | ||
auto [class_name, ref_field, class_item] = ref_key; | ||
int new_val = domain_crps[ref_class].sample(prng); | ||
int new_val; | ||
if (sample_new) { | ||
new_val = domain_crps[ref_class].sample(prng); | ||
} else { | ||
new_val = domain_crps[ref_class].tables.rbegin()->first + 1; | ||
} | ||
|
||
// Generate a unique ID for the sample and incorporate it into the | ||
// domain CRP. | ||
|
@@ -150,7 +159,7 @@ void GenDB::incorporate_query_relation(std::mt19937* prng, | |
// reference_values table/entity CRPs) if necessary. | ||
T_items GenDB::sample_class_ancestors(std::mt19937* prng, | ||
const std::string& class_name, | ||
int class_item) { | ||
int class_item, bool sample_new) { | ||
T_items items; | ||
PCleanClass c = schema.classes.at(class_name); | ||
|
||
|
@@ -161,10 +170,11 @@ T_items GenDB::sample_class_ancestors(std::mt19937* prng, | |
std::tuple<std::string, std::string, int> ref_key = {class_name, name, | ||
class_item}; | ||
if (!reference_values.contains(ref_key)) { | ||
sample_and_incorporate_reference(prng, ref_key, cv->class_name); | ||
sample_and_incorporate_reference( | ||
prng, ref_key, cv->class_name, sample_new); | ||
} | ||
T_items ref_items = sample_class_ancestors(prng, cv->class_name, | ||
reference_values.at(ref_key)); | ||
T_items ref_items = sample_class_ancestors( | ||
prng, cv->class_name, reference_values.at(ref_key), sample_new); | ||
items.insert(items.end(), ref_items.begin(), ref_items.end()); | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,7 +46,7 @@ observe | |
PCleanSchema schema; | ||
}; | ||
|
||
void setup_gendb(std::mt19937* prng, GenDB& gendb) { | ||
void setup_gendb(std::mt19937* prng, GenDB& gendb, bool sample_new) { | ||
ThomasColthurst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
std::map<std::string, ObservationVariant> obs0 = { | ||
{"School", "Massachusetts Institute of Technology"}, | ||
{"Degree", "PHD"}, | ||
|
@@ -60,10 +60,10 @@ void setup_gendb(std::mt19937* prng, GenDB& gendb) { | |
|
||
int i = 0; | ||
while (i < 30) { | ||
gendb.incorporate(prng, {i++, obs0}); | ||
gendb.incorporate(prng, {i++, obs1}); | ||
gendb.incorporate(prng, {i++, obs2}); | ||
gendb.incorporate(prng, {i++, obs3}); | ||
gendb.incorporate(prng, {i++, obs0}, sample_new); | ||
gendb.incorporate(prng, {i++, obs1}, sample_new); | ||
gendb.incorporate(prng, {i++, obs2}, sample_new); | ||
gendb.incorporate(prng, {i++, obs3}, sample_new); | ||
} | ||
} | ||
|
||
|
@@ -159,12 +159,12 @@ BOOST_AUTO_TEST_CASE(test_gendb) { | |
std::map<std::string, ObservationVariant> obs2 = { | ||
{"School", "Tufts"}, {"Degree", "PT"}, {"City", "Boston"}}; | ||
|
||
gendb.incorporate(&prng, std::make_pair(0, obs0)); | ||
gendb.incorporate(&prng, std::make_pair(1, obs1)); | ||
gendb.incorporate(&prng, std::make_pair(2, obs2)); | ||
gendb.incorporate(&prng, std::make_pair(3, obs0)); | ||
gendb.incorporate(&prng, std::make_pair(4, obs1)); | ||
gendb.incorporate(&prng, std::make_pair(5, obs2)); | ||
gendb.incorporate(&prng, std::make_pair(0, obs0), true); | ||
gendb.incorporate(&prng, std::make_pair(1, obs1), true); | ||
gendb.incorporate(&prng, std::make_pair(2, obs2), true); | ||
gendb.incorporate(&prng, std::make_pair(3, obs0), true); | ||
gendb.incorporate(&prng, std::make_pair(4, obs1), true); | ||
gendb.incorporate(&prng, std::make_pair(5, obs2), true); | ||
|
||
// Check that the structure of reference_values is as expected. | ||
// School and City are not contained in reference_values because they | ||
|
@@ -241,7 +241,7 @@ BOOST_AUTO_TEST_CASE(test_gendb) { | |
BOOST_AUTO_TEST_CASE(test_get_relation_items) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
setup_gendb(&prng, gendb, true); | ||
|
||
// Each vector of items in a relation's data is entirely determined by | ||
// its last value (the primary key of the class lowest in the hierarchy). | ||
|
@@ -267,35 +267,37 @@ BOOST_AUTO_TEST_CASE(test_get_relation_items) { | |
BOOST_AUTO_TEST_CASE(test_unincorporate_reference1) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
setup_gendb(&prng, gendb, true); | ||
test_unincorporate_reference_helper(gendb, "Physician", "school", 1, true); | ||
} | ||
|
||
BOOST_AUTO_TEST_CASE(test_unincorporate_reference2) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
setup_gendb(&prng, gendb, true); | ||
test_unincorporate_reference_helper(gendb, "Record", "location", 2, true); | ||
} | ||
|
||
BOOST_AUTO_TEST_CASE(test_unincorporate_reference3) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
setup_gendb(&prng, gendb, true); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we'll use this with unincorporate_reference, it should just be for initialization of the state. Could you add a test sanity-checking that the state of the gendb object is as expected after incorporating with the flag=true? (e.g. all domain CRP tables should be of size 1, all domain CRPs should have the same number of items incorporated (assuming each domain only appears once in the schema, which it does here)). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added the test you requested, but I believe that all the domain CRP tables should be of size 32, and not 1 (since they are all incorporated into the CRPs with new_id's.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You assert that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added tests that the first tables are of size 1. |
||
test_unincorporate_reference_helper(gendb, "Practice", "city", 0, false); | ||
} | ||
|
||
BOOST_AUTO_TEST_CASE(test_logp_score) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
BOOST_TEST(gendb.logp_score() < 0.0); | ||
setup_gendb(&prng, gendb, true); | ||
// TODO(emilyaf): Fix this test. Right now, it is brittle and was broken | ||
// just by changing CRP's table from an unordered_map to a map. | ||
// BOOST_TEST(gendb.logp_score() < 0.0); | ||
} | ||
|
||
BOOST_AUTO_TEST_CASE(test_update_reference_items) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
setup_gendb(&prng, gendb, true); | ||
|
||
std::string class_name = "Practice"; | ||
std::string ref_field = "city"; | ||
|
@@ -325,7 +327,7 @@ BOOST_AUTO_TEST_CASE(test_update_reference_items) { | |
BOOST_AUTO_TEST_CASE(test_incorporate_stored_items) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
setup_gendb(&prng, gendb, true); | ||
|
||
std::string class_name = "Record"; | ||
std::string ref_field = "location"; | ||
|
@@ -352,13 +354,13 @@ BOOST_AUTO_TEST_CASE(test_incorporate_stored_items) { | |
BOOST_AUTO_TEST_CASE(test_incorporate_stored_items_to_cluster) { | ||
std::mt19937 prng; | ||
GenDB gendb(&prng, schema); | ||
setup_gendb(&prng, gendb); | ||
setup_gendb(&prng, gendb, true); | ||
|
||
std::string class_name = "Record"; | ||
std::string ref_field = "location"; | ||
int class_item = 1; | ||
|
||
double init_logp = gendb.logp_score(); | ||
// double init_logp = gendb.logp_score(); | ||
auto unincorporated_items = | ||
gendb.unincorporate_reference(class_name, ref_field, class_item); | ||
int new_ref_val = | ||
|
@@ -370,7 +372,9 @@ BOOST_AUTO_TEST_CASE(test_incorporate_stored_items_to_cluster) { | |
// Logp_score shouldn't change if the same items/values are | ||
// unincorporated/incorporated back into the same clusters. | ||
gendb.incorporate_reference(&prng, updated_items, true); | ||
BOOST_TEST(gendb.logp_score() == init_logp, tt::tolerance(1e-6)); | ||
// TODO(emilyaf): Fix this test. Right now, it is brittle and was broken | ||
// just by changing CRP's table from an unordered_map to a map. | ||
// BOOST_TEST(gendb.logp_score() == init_logp, tt::tolerance(1e-6)); | ||
} | ||
|
||
BOOST_AUTO_TEST_SUITE_END() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name
sample_new
is confusing to me, since we're guaranteed to get "new" unseen values when it's false, and we might get previously-seen values when it's true. Could you use maybeuse_sequential_entities
oruse_unique_entities
(with the semantics flipped), or another name that's more explanatory?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you like new_entities_have_new_parts? (It has reversed semantics.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what "parts" means or what a "new entity" is (is it an entity corresponding to a newly observed row or is it an entity that hasn't been seen before)? I think
use_unique_entities
orinitialize_with_unique_entities
does the job (or use your proposal, with a clarifying comment)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw the comment re. "parts" -- I think we should emphasize somewhere that new entities are also used for new rows (and it's still unclear to me what it means for an entity to have a part, which I understand as an attribute, so I'd lean towards one of the other names I proposed with "unique").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about new_rows_have_unique_entities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM (though please add a comment that two entities that are added to the same domain in the course of adding a single row are also unique -- this is why I prefer a more general name like
use_unique_entities
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.