Skip to content

05_exercise

siposl edited this page May 30, 2022 · 3 revisions

🏠   ◀️ 04 | Exercise06 | Exercise ▶️

05 | Exercise - Appraising popular datasets

24.05.22

In this exercise session, we will appraise popular datasets used in research. Will we be able to understand the contained data and comprehend the steps of its creation?

Agenda

  • 10:15 - 10:20: Welcome and arrival
  • 10:20 - 10:30: In-exercise task introduction
  • 10:30 - 11:10: Task work
  • 11:10 - 11:40: Group presentations
  • 11:40 - 11:45: Goodbye and outlook

Session notes

Slides for the fifth exercise.

In-exercise Task: Datasets

Last week, we noticed some problems with the German Credit Data1 hosted on the UCI Machine Learning Repository, and included as an example dataset in multiple ML packages. Is this a singular occurrence, or would we be able to find similar problems with other popular used datasets?

Steps

  1. Decide on a dataset you want to analyze. Maybe think about a paper you recently read and a dataset they used for their research.
  2. Analyze the dataset with all the methods you know of. Below are some points you may consider during your analysis.
  3. Prepare a short presentation about your findings (~ 5 minutes).
  4. Upload the presentation to: ../hcds-summer-2022/tree/main/exercise/tasks/ex05_datasets.
  5. [Optional] If you are done with your dataset (or you think it's perfect) start looking into another one.

What you may consider:

  • Does it contain a metadata (e.g. a README-file)?
  • Is everything clearly labeled and documented?
  • Where is the data coming from? Who collected it? How was it accomplished?
  • Do you understand the context of the data collection?
  • How is the data sampled? Does the sample make sense?
  • How is its distribution?
  • How was it curated?
  • Was it tested and how?
  • Have there been any assumptions made?
  • Are you able to understand every single attribute?
  • Have there been any changes made to the data?
  • Are there any known errors or inconsistencies? Is it complete?

Resources

Here is an incomplete list of places that are hosting or referring to popular ML datasets and academic papers (where datasets may be introduced or used).

Dataset Repositories:

List datasets used for research: https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

Finding papers:


References

1 Professor Dr. Hans Hofmann (1994). Statlog (German Credit Data) Data Set, UCI Machine Learning Repository.