-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First Implemetation of export cache functionality #29
First Implemetation of export cache functionality #29
Conversation
Add documentation build to CI
Improve installation instructions.
…team#25, also adjusted test to include this
…e_25 allowed for expressions in ingnore file lists, implements issue aiidateam#25, …
…this, for teting on a different system and that philipp can join debugging
…on for the builder
…into implement_caching_wc
…since localhost and local code is always different
update readme
… monkeypatch how computers and codes are included in the hash of aiida-core
ok the calcjob hash on the CI is different... |
start documenting mock_code
Requires to provide an absolutpath to the export file to load or export to. | ||
""" | ||
@contextmanager | ||
def _with_export_cache(data_dir_abspath, calculation_class=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have a default value for the data dir in run_with_cache
(line 299) should we have the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually prefer not having a default in either of the two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings towards one or the other. It should just be the same for both ways
if len(union_pk) != 0: | ||
process_node_pk = min(union_pk) | ||
#export data to reuse it later | ||
export_cache(node=load_node(process_node_pk), savepath=full_import_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems problematic. Suppose we prepare the inputs to a workflow with a calcfunction
which then has a lower pk. Then the export would be done for the calcfunction and not the workflow.
Why don't we just use export_cache(node=resnode, ...)
? Then we are sure to export the workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was implement because I tried no to use run_get_node within the fixture. if we say it is ok that the workchain is run within the fixture (fixture should only prepare and clean up), this part becomes obsolete. This fixture is therefore like a monkeypatch for run_get_node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but this did not work for this test where I prepare an input to the workflow with a calcfunction
which therefore has a lower pk. Then only the calcfunction
is exported but not the workflow nodes itself.
But maybe this can be fixed with an option for the export.
Also we might want to provide the option for the This could be helpful for debugging/developing workflows (see https://groups.google.com/forum/?hl=en#!topic/aiidausers/y-ekFBlLgsg from mailing list). |
|
||
|
||
@pytest.fixture(scope='function') | ||
def export_cache(hash_code_by_entrypoint): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this should be a fixture (same for load_cache
), or just an internal helper function. Do we want users to use the bare export_cache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this an extra fixture, because if we change the import/export functionality to be different from what aiida-core does, as a user I might want to use it in a test (or see what it does for debugging). In aiida-fleur tests I use a fixture to export and import things to create test environments, because we also have some input nodes for workchains where uuids are provided. So there might be a need to have something bare
which is adapted for tests (like with specific hashing, or striped/slim exports etc.). For now this defines the interface. Overall I'm also not sure if this should be just a helper as for my current purposes do not need to take it further.
""" | ||
|
||
cache_exists = False | ||
if data_dir == 'data_dir': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this special case for? As far as I can tell, this doesn't actually change the behavior: Below, we use pathlib.Path(full_import_path)
, which - if for example full_import_path
is data_dir/something.tar.gz
- is relative to cwd
anyway, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right
print(name) | ||
|
||
# check existence | ||
full_import_path = str(data_dir) + '/' + name + '.tar.gz' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hard-coded slash could be problematic: To be platform-agnostic, this should be
cache_path = pathlib.Path(data_dir) / (name + '.tar.gz')
|
||
cache_exists = False | ||
if data_dir == 'data_dir': | ||
cwd = pathlib.Path(os.getcwd()) # Might be not the best idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using os.getwcd()
is definitely not great -- the issue is that one might run pytest
from different directories (e.g., form the top-level module directory or inside the tests
directory).
In my opinion, both should always work. There are two approaches to this that I've seen:
- In the (top-level)
conftest.py
, make use of__file__
to define a fixture that gives you a constant "anchor directory". See e.g. https://github.com/FrescolinoGroup/pyhdf5io/blob/master/tests/conftest.py#L17 - Use the location of the test that you are currently running, and define paths relative to that. This is what
pytest-datadir
uses.
Both are valid, depending on the use case (do you want to share data between tests, or have it only for a specific test / test directory. I think we should leave this choice up to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change this. I was not familiar with pathlib and why to use it at all and not os. Same goes for the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use the package file path for our plugins. I also did not like cwd. Since relative paths are to cwd, the default is ensured to work, which was the only point in doing so.
load_cache(path_to_cache=full_import_path) | ||
|
||
# now run process/workchain whatever | ||
with enable_caching(): # should enable caching globally in this python interpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be possible to re-use with_export_cache
here? That exports the calculations while here we explicitly export the parts connected to the node, but is this distinction really needed?
We could maybe change the with_enable_caching
logic to only export calculations that were created within the context manager (and their direct inputs / outputs). So, for example
run(SomeCalculation, **<inputs>)
with with_export_cache(<...>):
run(OtherCalculation, **<inputs>)
would export the OtherCalculation
, but not SomeCalculation
. That would also get around the need to always clean the DB before running the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently run_with_cache (which is kind of a patch for run_get_node) exists only because I do not know how to explicitly do this and I want an automatic cache naming which changes if the test changes, otherwise debugging becomes harder. How do you know what is run in the with
environment? One idea is query for creation time. Then on the other hand if you use as we currently do the default aiida export you get the full provenance, which means if your SomeCalculation
and OtherCalculation
are connected (which within a test they usually are) they will get exported both anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it makes sense to just export what is created during the with
statement (and its provenance). That can be either by creation time, or maybe by checking what exists already in the DB when entering the with
statement - whatever works more reliably.
I'm not sure if we might get into clock synchronization issues (between the DB and the python interpreter, maybe?) when using ctime
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: 'Spliting' the cache
works currently for both fixtures, since the export happens at the end of the with.
with with_export_cache(<...>):
run(OtherCalculation, **<inputs>)
with with_export_cache(<...>):
run(ThisWeCacheSeperateCalculation, **<inputs>)
So, if you know to extract what ran in the contextmanager let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to do some testing to see if there is a good way to extract that...
use fastentrypoints module [1] to load console scripts. not only is it faster, the alternative route via pkg_resources also throws exceptions whenever it detects dependency clashes in the python environment, causing aiida-mock-code to fail. [1] https://github.com/ninjaaron/fast-entry_points
faster entry point loading for console script
…unctions and fixture in export_cache
…tionality for with_export_cache and run_with_cache
…ting into implement_caching_wc
…ting into implement_caching_wc
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all my complains are in now. I mainly tested the run_with_cache
fixture which now works fine for me
@@ -345,6 +345,8 @@ def _run_with_cache( # type: ignore | |||
|
|||
# check existence | |||
full_import_path = pathlib.Path(data_dir) / (name + '.tar.gz') | |||
# make sure the path is absolute (this is needed by export_cache) | |||
full_import_path = full_import_path.absolute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is not an absolute path export_cache
fails because there this is checked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check that it is an absolute path instead of just making it absolute. Otherwise the test will depend on which directory they are run from.
As much as possible, e.g. running pytest
from the root directory should do the same as e.g. cd tests; pytest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes, that slipped my mind. so how about we make this a check only and get rid of the default value. Then the users would be required to set the paths to executables, output directory etc. themselves and we wouldn't be hiding this in the fixture.
I guess this would be cleaner.
* add cli options for config file management Add cli options: --testing-config-action: Read config file if present ('read'), require config file ('require') or generate new config file ('generate'). --regenerate-test-data: Regenerate test data * add tests for new cli options * add documentation of new cli options * fix: check that the data directory path is absolute and exists * fix: add quotation marks to exported environment variables Paths may contain spaces etc.
@broeder-j @PhilippRue I think we can do the following: We merge this PR now - I will go over the functionality and try to implement the "node selection by what ran in the context manager" discussed above. After that I will create a new PR (possibly with some interface changes...), and have the two of you review it / test it on your plugins. Does that work for you? |
sounds good 👍 |
Since it seemed to be incompatible with the changes in #30, I merged it into a separate branch for now (to keep the tests running on |
Hi,
I put this here for review, @PhilippRue and @greschd since my time to work on this is sparse.
The code might be not totally fine yet, I left both fixture versions (run_with_cache, with_export_cache) side by side. Maybe we should do a first merge, that we can continue to work on smaller things at a time.
After the slack discussion:
The fixture run_with_cache was split, that one can use export_cache and load_cache alone. There is are simple test for these.
In order for the run_with_cache fixture test to work, I monkey patched code.get_object_to_hash and calcjob.get_object_to_hash (since it used the computer uuid). there is a fixture for this. thanks greschd for this suggestion.
PhilippRue has solved this issue different by introducing the aiida-core and mock code with a session scope instead of function and imports of computers and codes. In aiida-kkr he has running examples of with run_with_cache. This may not be a solution for several developers.
Further I have implemented also a second fixture to run with exports where you have to specify the export to use, feature suggestions to this by greschd:
Known issue:
Sometimes the caching when running does not work (which causes the tests to fail), which I do not understand the hash of the imported and running calcjob is the same. @greschd maybe you spot the reason for this.
I left both fixture versions in, since I would like the export name to be choosen in an automatic way per default. fix #20