You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following is most likely a misunderstanding on my part but I notice that there are many duplicates and pseudo-duplicates in the jsonl files.
For instance, this line in lama/TREx/P17.jsonl:
{
"uuid": "df10f035-6269-4cdf-88df-26395e0dc3b4",
"obj_uri": "Q16",
"obj_label": "Canada",
"sub_uri": "Q7517499",
"sub_label": "Simcoe Composite School",
"predicate_id": "P17",
"evidences": [{
"sub_surface": "Simcoe Composite School",
"obj_surface": "Canada",
"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
}, {
"sub_surface": "Simcoe Composite School",
"obj_surface": "Canada",
"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
}]
}
has two evidences both of which are the same. This is not always the case, i.e., in many other cases the evidences are different sentences.
Further, in the conceptnet corpora, apparently every UUID appears twice. As an example, here are two instances with the same UUID:
{
"sub": "alive",
"obj": "think",
"pred": "HasSubevent",
"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
"obj_label": "think",
"uuid": "d4f11631dde8a43beda613ec845ff7d1"
}
and
{
"pred": "HasSubevent",
"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
"obj_label": "think",
"uuid": "d4f11631dde8a43beda613ec845ff7d1",
"sub_label": "alive"
}
Here, in the second time the instance does not have the following fields sub, obj but otherwise seems to remain unchanged.
So, based on this, my question is:
Are the duplicates intentional? For instance, when computing metrics of my model over the probe, am I to treat the task as-is and if need be, make predictions twice over the same instance?
Alternatively, I could easily root out the duplicates when processing the files? Do I do that instead? Have others done that?
Hi,
The following is most likely a misunderstanding on my part but I notice that there are many duplicates and pseudo-duplicates in the jsonl files.
For instance, this line in
lama/TREx/P17.jsonl
:has two
evidences
both of which are the same. This is not always the case, i.e., in many other cases theevidences
are different sentences.Further, in the conceptnet corpora, apparently every UUID appears twice. As an example, here are two instances with the same UUID:
and
Here, in the second time the instance does not have the following fields
sub
,obj
but otherwise seems to remain unchanged.So, based on this, my question is:
Are the duplicates intentional? For instance, when computing metrics of my model over the probe, am I to treat the task as-is and if need be, make predictions twice over the same instance?
Alternatively, I could easily root out the duplicates when processing the files? Do I do that instead? Have others done that?
I know for a fact that LAMA on HuggingFace datasets (https://huggingface.co/datasets/lama) contains these duplicates.
The text was updated successfully, but these errors were encountered: