Add sample_new parameter to gendb::incorporate #215

ThomasColthurst · 2024-09-30T14:20:06Z

This is required for GenDB to replicate existing pclean behavior (via sample_new=False) and unblock #212.

emilyfertig · 2024-09-30T15:41:08Z

cxx/gendb.cc

@@ -53,7 +54,8 @@ void GenDB::incorporate(
        schema.query.fields.at(query_rel).class_path;
    T_items items =
        sample_entities_relation(prng, schema.query.record_class,
-                                 class_path.cbegin(), class_path.cend(), id);
+                                 class_path.cbegin(), class_path.cend(), id,


A lot of the machinery in sample_class_ancestors etc is for the purpose of walking the DAG of references to make sure the reference values are coherent (i.e. if Physician 5 went to School 3, that's correct in all samples). Am I correct that when sample_new is false, we never see the same Physician e.g. more than once, so we never actually have to walk the DAG? If so, then I think we should implement this by simply iterating over a relation's domains and pulling sequential values from the domain CRPs rather than adding complexity to sample_class_ancestors and the other methods that are primarily for inference.

Added TODO. I don't think the sample_new = false case is as simple as you make out, if only because it's what pclean_lib::translate_observations did before and that was far from trivial (and it relied on the annotated_domains_for_relations datastructure, which is being eliminated).

Looking again at translate_observations, it looks like it used annotated_domains_for_relations to get unique string names for all of the entities, which we don't need to do here since we're working with entity IDs directly, so I think it is pretty simple (and I'd prefer to do it in this PR though a TODO is ok too).

emilyfertig · 2024-09-30T15:44:59Z

cxx/gendb.cc

@@ -40,7 +40,8 @@ double GenDB::logp_score() const {

 void GenDB::incorporate(
    std::mt19937* prng,
-    const std::pair<int, std::map<std::string, ObservationVariant>>& row) {
+    const std::pair<int, std::map<std::string, ObservationVariant>>& row,
+    bool sample_new) {


The name sample_new is confusing to me, since we're guaranteed to get "new" unseen values when it's false, and we might get previously-seen values when it's true. Could you use maybe use_sequential_entities or use_unique_entities (with the semantics flipped), or another name that's more explanatory?

How do you like new_entities_have_new_parts? (It has reversed semantics.)

I don't understand what "parts" means or what a "new entity" is (is it an entity corresponding to a newly observed row or is it an entity that hasn't been seen before)? I think use_unique_entities or initialize_with_unique_entities does the job (or use your proposal, with a clarifying comment)

Just saw the comment re. "parts" -- I think we should emphasize somewhere that new entities are also used for new rows (and it's still unclear to me what it means for an entity to have a part, which I understand as an attribute, so I'd lean towards one of the other names I proposed with "unique").

How about new_rows_have_unique_entities?

SGTM (though please add a comment that two entities that are added to the same domain in the course of adding a single row are also unique -- this is why I prefer a more general name like use_unique_entities).

cxx/gendb_test.cc

emilyfertig · 2024-10-01T16:45:37Z

cxx/gendb_test.cc

+BOOST_AUTO_TEST_CASE(test_unincorporate_reference_new_entities_have_new_parts) {
+  std::mt19937 prng;
+  GenDB gendb(&prng, schema);
+  setup_gendb(&prng, gendb, true);


I don't think we'll use this with unincorporate_reference, it should just be for initialization of the state. Could you add a test sanity-checking that the state of the gendb object is as expected after incorporating with the flag=true? (e.g. all domain CRP tables should be of size 1, all domain CRPs should have the same number of items incorporated (assuming each domain only appears once in the schema, which it does here)).

Added the test you requested, but I believe that all the domain CRP tables should be of size 32, and not 1 (since they are all incorporated into the CRPs with new_id's.)

You assert that tables.size() == 32, which means that there are 32 tables, each of size 1 (for completeness you could test that each of the tables is indeed size 1, though testing that there are 32 tables and N=32, as you do, is adequate IMO).

Added tests that the first tables are of size 1.

emilyfertig · 2024-10-01T18:19:06Z

If you want, I can push another commit to this branch that avoids walking the DAG and plumbing the bool arg through.

If it's ok with you I'd like to merge #217 before this (and #212) -- I'm happy to help with merge conflicts, just I'm scared of breaking the Gibbs sampler and that seems like an easier order to do it in.

ThomasColthurst · 2024-10-01T19:40:19Z

If you want, I can push another commit to this branch that avoids walking the DAG and plumbing the bool arg through.

If it's ok with you I'd like to merge #217 before this (and #212) -- I'm happy to help with merge conflicts, just I'm scared of breaking the Gibbs sampler and that seems like an easier order to do it in.

My preference would be for you to replace the bool arg plumbing in a separate pull request.

#217 seems like it is still rather far from being checked-in. My preference would be to check this and #212 in first.

emilyfertig · 2024-10-01T22:14:49Z

The comments on #217 were all pretty minor and are now addressed, so I think it should be close. I'll resolve the merge conflicts by cloning this branch and rebasing the commits onto #217. Then if you want you can update your PR by overwriting your branch with the cloned branch (I'll send you the commands). That ordering will be easier for me than merging this and trying to rebase 217 on top of it.

emilyfertig · 2024-10-01T22:35:35Z

Actually the merge conflicts weren't as bad as I thought so we can check in either first.

ThomasColthurst added 4 commits September 27, 2024 21:41

Add sample_new option to incorporate methods

85a19dc

Debug printfs

10e88b9

Comment out brittle tests

82bc4cd

Fix build warning

9e12a66

ThomasColthurst requested a review from emilyfertig September 30, 2024 14:20

emilyfertig reviewed Sep 30, 2024

View reviewed changes

ThomasColthurst added 2 commits October 1, 2024 16:03

Add test and fix test

4e536ef

Rename sample_new

73fb3d5

emilyfertig reviewed Oct 1, 2024

View reviewed changes

Add basic test of new_entities_have_new_parts

d6ba9ad

emilyfertig approved these changes Oct 1, 2024

View reviewed changes

Respond to reviewer comments

8f10e0f

ThomasColthurst merged commit c87569a into master Oct 2, 2024
2 checks passed

ThomasColthurst deleted the 092724-thomaswc-sample_new branch October 2, 2024 14:08

ThomasColthurst mentioned this pull request Oct 2, 2024

Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample_new parameter to gendb::incorporate #215

Add sample_new parameter to gendb::incorporate #215

ThomasColthurst commented Sep 30, 2024

emilyfertig Sep 30, 2024

ThomasColthurst Oct 1, 2024

emilyfertig Oct 1, 2024

emilyfertig Sep 30, 2024

ThomasColthurst Oct 1, 2024

emilyfertig Oct 1, 2024

emilyfertig Oct 1, 2024

ThomasColthurst Oct 1, 2024

emilyfertig Oct 1, 2024

ThomasColthurst Oct 2, 2024

emilyfertig Oct 1, 2024

ThomasColthurst Oct 1, 2024

emilyfertig Oct 1, 2024

ThomasColthurst Oct 2, 2024

emilyfertig commented Oct 1, 2024

ThomasColthurst commented Oct 1, 2024

emilyfertig commented Oct 1, 2024

emilyfertig commented Oct 1, 2024

Add sample_new parameter to gendb::incorporate #215

Add sample_new parameter to gendb::incorporate #215

Conversation

ThomasColthurst commented Sep 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyfertig commented Oct 1, 2024

ThomasColthurst commented Oct 1, 2024

emilyfertig commented Oct 1, 2024

emilyfertig commented Oct 1, 2024