-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample naming #248
Add sample naming #248
Conversation
I don't think having named samples is a good idea, because there's almost always way more samples than features, and dedicating one string per sample can get out of hand very quickly. Also, sample names don't play well with methods that reorder or extract samples from the dataset (this isn't done for features). I do agree about In #70 it mentions target names, so you could take a crack at that instead. |
I understand your doubts, but I think that we can resolve them. Regarding the memory requirements, the sample names are optional. If for some use cases these might be memory-consuming, they can just be ignored. We can also set a string capacity for sample names if we anticipate performance issues in large datasets. Regarding the reordering of the samples, I agree that it should be done carefully when sample names are present. But we already do it for the targets. Anyway, as far as I know, the reordering of samples is only done in the cross-validation, and in this case, we can ignore the sample names, since the intermediate databases are not returned by the method. Thanks for your suggestions for |
The memory usage would be too large for most real-world datasets. Capping the string capacity doesn't solve this because the problem is the number of strings, which must match the number of samples, leading to an unacceptable number of allocations. The usefulness of sample names is also limited IMO, since you can easily track down an outlier by finding it in the original dataset. Also, most datasets for ML don't come with sample names that you can just import, so users will need to go through the trouble of creating their own generic names for every sample, making it hard to use. Sample names also cause trouble with methods that extracts subsets of samples from a dataset, such as Finally, in the MR is Gitlab-speak for PR. I mixed it up since I'm using it at my day job lol. |
Sample naming won't be implemented. Target naming is implemented by #373. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #248 +/- ##
==========================================
+ Coverage 39.24% 39.52% +0.28%
==========================================
Files 92 92
Lines 6085 6150 +65
==========================================
+ Hits 2388 2431 +43
- Misses 3697 3719 +22 ☔ View full report in Codecov by Sentry. |
It might be useful to have named samples for debugging and exploring the data, e.g. identifying outliers.
Comments:
sample_names
(andfeature_names
) returning a non-empty vector even if the names for the samples are not set? As it is, whensample_names
is empty, it returnsvec!["sample-1", "sample-2",..,"sample-{nsamples}"]
. The same happens forfeature_names
. However, in some methods where new dataset are created transforming the input dataset, the output datasets have a non emptysample_names
field even if the input dataset has an emptysample_names
field. For instance, it happens for the preprocessing methods. An alternative choice could be to make public thefeature_names
and thesample_names
fields (as for theweights
field), and to clone them if needed.test_retain_sample_names
for the preprocessing methods. Is it ok?with_feature_names
does not check if the given vector has length nfeatures. Do you want to implement it in this PR, as I have done forwith_sample_names
?