Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First Implemetation of export cache functionality #29

Merged
merged 49 commits into from
Apr 3, 2020

Conversation

broeder-j
Copy link
Member

@broeder-j broeder-j commented Mar 12, 2020

Hi,
I put this here for review, @PhilippRue and @greschd since my time to work on this is sparse.
The code might be not totally fine yet, I left both fixture versions (run_with_cache, with_export_cache) side by side. Maybe we should do a first merge, that we can continue to work on smaller things at a time.

After the slack discussion:
The fixture run_with_cache was split, that one can use export_cache and load_cache alone. There is are simple test for these.
In order for the run_with_cache fixture test to work, I monkey patched code.get_object_to_hash and calcjob.get_object_to_hash (since it used the computer uuid). there is a fixture for this. thanks greschd for this suggestion.

PhilippRue has solved this issue different by introducing the aiida-core and mock code with a session scope instead of function and imports of computers and codes. In aiida-kkr he has running examples of with run_with_cache. This may not be a solution for several developers.

  • The builder hash now supports nested namespaces.

Further I have implemented also a second fixture to run with exports where you have to specify the export to use, feature suggestions to this by greschd:

  • allow for multiple calculation classes (currently all or one, maybe with_caching in aiida-core has to support it first)
  • making sure computer/ code hash to the same value for different machines (should work with the monkeypatch fixture, there is a test for this)

Known issue:
Sometimes the caching when running does not work (which causes the tests to fail), which I do not understand the hash of the imported and running calcjob is the same. @greschd maybe you spot the reason for this.

  • Then the fixture run_with_cache should also work with yield
  • maybe the diff workchain for testing should not be in the test file?

I left both fixture versions in, since I would like the export name to be choosen in an automatic way per default. fix #20

Dominik Gresch and others added 26 commits February 27, 2020 13:51
Improve installation instructions.
…e_25

allowed for expressions in ingnore file lists, implements issue aiidateam#25, …
…this, for teting on a different system and that philipp can join debugging
…since localhost and local code is always different
… monkeypatch how computers and codes are included in the hash of aiida-core
@broeder-j broeder-j requested review from PhilippRue and greschd March 12, 2020 12:20
@broeder-j
Copy link
Member Author

ok the calcjob hash on the CI is different...

Requires to provide an absolutpath to the export file to load or export to.
"""
@contextmanager
def _with_export_cache(data_dir_abspath, calculation_class=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a default value for the data dir in run_with_cache (line 299) should we have the same here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually prefer not having a default in either of the two places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings towards one or the other. It should just be the same for both ways

Comment on lines 364 to 367
if len(union_pk) != 0:
process_node_pk = min(union_pk)
#export data to reuse it later
export_cache(node=load_node(process_node_pk), savepath=full_import_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems problematic. Suppose we prepare the inputs to a workflow with a calcfunction which then has a lower pk. Then the export would be done for the calcfunction and not the workflow.

Why don't we just use export_cache(node=resnode, ...)? Then we are sure to export the workflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was implement because I tried no to use run_get_node within the fixture. if we say it is ok that the workchain is run within the fixture (fixture should only prepare and clean up), this part becomes obsolete. This fixture is therefore like a monkeypatch for run_get_node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but this did not work for this test where I prepare an input to the workflow with a calcfunction which therefore has a lower pk. Then only the calcfunction is exported but not the workflow nodes itself.

But maybe this can be fixed with an option for the export.

@PhilippRue
Copy link
Contributor

Also we might want to provide the option for the with_export_cache to overwrite the output file. Now the export file is only written if no previous file is found.

This could be helpful for debugging/developing workflows (see https://groups.google.com/forum/?hl=en#!topic/aiidausers/y-ekFBlLgsg from mailing list).



@pytest.fixture(scope='function')
def export_cache(hash_code_by_entrypoint):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this should be a fixture (same for load_cache), or just an internal helper function. Do we want users to use the bare export_cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this an extra fixture, because if we change the import/export functionality to be different from what aiida-core does, as a user I might want to use it in a test (or see what it does for debugging). In aiida-fleur tests I use a fixture to export and import things to create test environments, because we also have some input nodes for workchains where uuids are provided. So there might be a need to have something bare which is adapted for tests (like with specific hashing, or striped/slim exports etc.). For now this defines the interface. Overall I'm also not sure if this should be just a helper as for my current purposes do not need to take it further.

"""

cache_exists = False
if data_dir == 'data_dir':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this special case for? As far as I can tell, this doesn't actually change the behavior: Below, we use pathlib.Path(full_import_path), which - if for example full_import_path is data_dir/something.tar.gz - is relative to cwd anyway, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right

print(name)

# check existence
full_import_path = str(data_dir) + '/' + name + '.tar.gz'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded slash could be problematic: To be platform-agnostic, this should be

cache_path = pathlib.Path(data_dir) / (name + '.tar.gz')


cache_exists = False
if data_dir == 'data_dir':
cwd = pathlib.Path(os.getcwd()) # Might be not the best idea.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using os.getwcd() is definitely not great -- the issue is that one might run pytest from different directories (e.g., form the top-level module directory or inside the tests directory).

In my opinion, both should always work. There are two approaches to this that I've seen:

Both are valid, depending on the use case (do you want to share data between tests, or have it only for a specific test / test directory. I think we should leave this choice up to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this. I was not familiar with pathlib and why to use it at all and not os. Same goes for the tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use the package file path for our plugins. I also did not like cwd. Since relative paths are to cwd, the default is ensured to work, which was the only point in doing so.

load_cache(path_to_cache=full_import_path)

# now run process/workchain whatever
with enable_caching(): # should enable caching globally in this python interpreter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be possible to re-use with_export_cache here? That exports the calculations while here we explicitly export the parts connected to the node, but is this distinction really needed?

We could maybe change the with_enable_caching logic to only export calculations that were created within the context manager (and their direct inputs / outputs). So, for example

run(SomeCalculation, **<inputs>)
with with_export_cache(<...>):
    run(OtherCalculation, **<inputs>)

would export the OtherCalculation, but not SomeCalculation. That would also get around the need to always clean the DB before running the test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently run_with_cache (which is kind of a patch for run_get_node) exists only because I do not know how to explicitly do this and I want an automatic cache naming which changes if the test changes, otherwise debugging becomes harder. How do you know what is run in the with environment? One idea is query for creation time. Then on the other hand if you use as we currently do the default aiida export you get the full provenance, which means if your SomeCalculation and OtherCalculation are connected (which within a test they usually are) they will get exported both anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it makes sense to just export what is created during the with statement (and its provenance). That can be either by creation time, or maybe by checking what exists already in the DB when entering the with statement - whatever works more reliably.
I'm not sure if we might get into clock synchronization issues (between the DB and the python interpreter, maybe?) when using ctime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: 'Spliting' the cache
works currently for both fixtures, since the export happens at the end of the with.

with with_export_cache(<...>):
    run(OtherCalculation, **<inputs>)
with with_export_cache(<...>):
    run(ThisWeCacheSeperateCalculation, **<inputs>)

So, if you know to extract what ran in the contextmanager let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to do some testing to see if there is a good way to extract that...

ltalirz and others added 7 commits March 31, 2020 13:15
use fastentrypoints module [1] to load console scripts.
not only is it faster, the alternative route via pkg_resources also
throws exceptions whenever it detects dependency clashes in the python
environment, causing aiida-mock-code to fail.

[1] https://github.com/ninjaaron/fast-entry_points
faster entry point loading for console script
…tionality for with_export_cache and run_with_cache
@broeder-j
Copy link
Member Author

Also we might want to provide the option for the with_export_cache to overwrite the output file. Now the export file is only written if no previous file is found.

This could be helpful for debugging/developing workflows (see https://groups.google.com/forum/?hl=en#!topic/aiidausers/y-ekFBlLgsg from mailing list).

Done

PhilippRue
PhilippRue previously approved these changes Apr 1, 2020
Copy link
Contributor

@PhilippRue PhilippRue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all my complains are in now. I mainly tested the run_with_cache fixture which now works fine for me

@@ -345,6 +345,8 @@ def _run_with_cache( # type: ignore

# check existence
full_import_path = pathlib.Path(data_dir) / (name + '.tar.gz')
# make sure the path is absolute (this is needed by export_cache)
full_import_path = full_import_path.absolute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is not an absolute path export_cache fails because there this is checked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check that it is an absolute path instead of just making it absolute. Otherwise the test will depend on which directory they are run from.

As much as possible, e.g. running pytest from the root directory should do the same as e.g. cd tests; pytest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, that slipped my mind. so how about we make this a check only and get rid of the default value. Then the users would be required to set the paths to executables, output directory etc. themselves and we wouldn't be hiding this in the fixture.
I guess this would be cleaner.

* add cli options for config file management

Add cli options:

  --testing-config-action: Read config file if present ('read'), require config file ('require') or generate new config file ('generate').
  --regenerate-test-data: Regenerate test data

* add tests for new cli options

* add documentation of new cli options

* fix: check that the data directory path is absolute and exists

* fix: add quotation marks to exported environment variables

Paths may contain spaces etc.
@greschd
Copy link
Member

greschd commented Apr 2, 2020

@broeder-j @PhilippRue I think we can do the following: We merge this PR now - I will go over the functionality and try to implement the "node selection by what ran in the context manager" discussed above. After that I will create a new PR (possibly with some interface changes...), and have the two of you review it / test it on your plugins.

Does that work for you?

@PhilippRue
Copy link
Contributor

@broeder-j @PhilippRue I think we can do the following: We merge this PR now - I will go over the functionality and try to implement the "node selection by what ran in the context manager" discussed above. After that I will create a new PR (possibly with some interface changes...), and have the two of you review it / test it on your plugins.

Does that work for you?

sounds good 👍

@greschd greschd changed the base branch from develop to export_cache April 3, 2020 09:59
@greschd greschd merged commit f9a6f79 into aiidateam:export_cache Apr 3, 2020
@greschd
Copy link
Member

greschd commented Apr 3, 2020

Since it seemed to be incompatible with the changes in #30, I merged it into a separate branch for now (to keep the tests running on develop).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add 'export_cache' implementation
4 participants