-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor DataCollectorV0 and HDF5 dependencies isolation #133
Conversation
5e95c97
to
8560018
Compare
# Check that we get max(max_episode_steps) when there is no max_episode_steps=None | ||
test_datasets.pop() | ||
# testing without creating a copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for removing these checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible anymore to have copy=False, as it was causing problem (the ids of episode were not consistent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see so this test removal is in combination with removing the copy optional argument? That seems reasonable, assuming we must remove that feature to isolate h5py dependence to MinariStorage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not specific about h5py isolation, but it is a bug in the current code: when you combine two datasets, the episode ids of the second dataset are modified when you have copy=False, i.e. they don't start anymore from 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed there is still an h5py
import in utils.py and in it is used in get_normalized_scores
Is this by design, and if so, what is blocking removing the h5py dependency there? Other than this and the other question I had, this seems good overall. It adds new tests and passes the existing tests.
Here are profiling results.
Oh, thanks for spotting this, it wasn't by design; I fixed it. |
9661a84
to
47c4a68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had one more question, otherwise, once this is passing pre-commit, I think it's ready to merge :^)
@@ -547,7 +549,7 @@ def check_load_and_delete_dataset(dataset_id: str): | |||
|
|||
|
|||
def create_dummy_dataset_with_collecter_env_helper( | |||
dataset_id: str, env: DataCollectorV0, num_episodes: int = 10 | |||
dataset_id: str, env: DataCollectorV0, num_episodes: int = 10, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of the kwargs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed for this new test; they are a general kwarg that can be passed to thecreate_dataset_from_collector_env
function
Minari/tests/utils/test_get_normalized_score.py
Lines 20 to 27 in c5c88e8
ref_min_score, ref_max_score = -1, 100 | |
dataset = create_dummy_dataset_with_collecter_env_helper( | |
"cartpole-test-v0", | |
env, | |
num_episodes=num_episodes, | |
ref_min_score=ref_min_score, | |
ref_max_score=ref_max_score, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge :^) but one last question, what it two datasets that are themselves each the result of combining two datasets are combined? Will this situation be handled correctly?
Thanks! Yes, this should not create any issue as the new generated dataset is a perfectly normal dataset (everything is copied); with just different metadata. |
Motivation
In the future, we aim to consider different storage methods than HDF5 files for our datasets (see #98), as other backends can be faster.
Currently, different files depend on h5py; this PR isolates the dependency on MinariStorage.
If we add support for other storage, we simply need to create a new MinariStorage that implements the same interface.
Changes
To achieve the h5py isolation, we make DataCollectorV0, functions in utils and MinariDataset to interface with MinariStorage instead of directly to h5py. Thus, MinariStorage, should offer an API for the most used methods.
This makes as natural dependencies
DataCollectorV0 -> MinariStorage.
andMinariDataset -> MinariStorage
.To avoid
MinariDataset -> DataCollectorV0
dependency, we change the current API for adding the buffer of a DataCollector to a MInariDataset, from:to:
Other user notable changes:
.storage
new