Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prov data input output #1989

Draft
wants to merge 42 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
e3eabfe
Init for the no listing workflow
jjkoehorst Jun 5, 2023
ab0738f
improved test case with checking if the files are there and if they a…
jjkoehorst Jun 12, 2023
c559952
make clean formatting
jjkoehorst Jun 15, 2023
c62979a
Improved naming for the no_listing test and a creation step of 10.000…
jjkoehorst Jun 15, 2023
bc85a1f
Identifying what it is processing for provenance
jjkoehorst Jun 16, 2023
9a6f4ce
For the provenance a --no-input option to not copy input files into t…
jjkoehorst Jun 16, 2023
456b26c
A work in progress to ensure the test works for no-listing and no-input
jjkoehorst Jun 16, 2023
b278f58
Wokring on specifying input/output for provenance
jjkoehorst Jun 18, 2023
fd14f0d
working on no-input
jjkoehorst Jun 22, 2023
95cff2b
move back from data/input data/output to a data folder to see if RO-C…
jjkoehorst Jun 22, 2023
cdbf106
Likely missed one
jjkoehorst Jun 22, 2023
4d8ae62
Moved back to provenance_constants.DATA with comments for DATAX to kn…
jjkoehorst Jun 22, 2023
29eee5f
Merge branch 'main' into prov_data_input_output for updates
Mar 14, 2024
5fc090b
fix merge err
Mar 14, 2024
18a9bd6
fix loadListing of dir for prov
Mar 22, 2024
fc0e331
misc change of comments
Mar 22, 2024
627026c
fix text, remove vscode space files
Mar 26, 2024
96d1e20
remove redundant texts
Mar 26, 2024
c7963b3
Merge pull request #1986 from ElderMedic/prov_data_input_output
ElderMedic Mar 27, 2024
88b113b
minor changes of logging
Apr 11, 2024
8f52df9
Merge branch 'main' into prov_data_input_output
ElderMedic Apr 11, 2024
55f56b1
Merge branch 'prov_data_input_output' of https://github.com/common-wo…
Apr 11, 2024
c979e29
Result of make cleanup
jjkoehorst Apr 17, 2024
0f7df3c
Merge pull request #2004 from common-workflow-language/main
ElderMedic May 23, 2024
97ad0c2
changed contents in provenance
Jun 25, 2024
7d6f35a
MADE PROVENANCE_CONSTANTS CONSTANT AGAIN, ADDED SOME COMMENTS
Jun 27, 2024
a9b5941
Merge pull request #2011 from common-workflow-language/main
ElderMedic Jun 27, 2024
d052895
Fix hasher param type for checksum_copy, improve linting: flake8, pyd…
Jun 27, 2024
61024ff
fix current_source None issue, improve linting
Jun 28, 2024
b3a85c6
docstring fix and lint fix
Jun 28, 2024
7b748de
remove load_listing param for used_artefacts
Jun 28, 2024
a47e73c
fix lint
Jun 28, 2024
2acc4c0
fix conformity issue
Jun 28, 2024
d6bdc7f
run black for uncompromising python code format
Jun 29, 2024
0a17526
fix no_input
Jun 30, 2024
fe04191
black format fix
Jun 30, 2024
6b06075
typo fix
Jul 1, 2024
4d41ca4
fix typing hint in test_provenance.py
Jul 3, 2024
f1445e0
improve cwlview viz for large workflow
Jul 9, 2024
b3d9e19
Update tests/test_provenance.py
ElderMedic Jul 22, 2024
52b256b
Merge branch 'main' into prov_data_input_output
ElderMedic Jul 22, 2024
743ac15
Merge pull request #2045 from common-workflow-language/main
ElderMedic Oct 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cwlref-runner/README
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This an optional companion package to "cwltool" which provides provides an
This an optional companion package to "cwltool" which provides an
additional entry point under the alias "cwl-runner", which is the
implementation-agnostic name for the default CWL interpreter installed on a
host.
31 changes: 14 additions & 17 deletions cwlref-runner/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,19 @@
from setuptools import setup, find_packages

SETUP_DIR = os.path.dirname(__file__)
README = os.path.join(SETUP_DIR, 'README')
README = os.path.join(SETUP_DIR, "README")

setup(name='cwlref-runner',
version='1.0',
description='Common workflow language reference implementation',
long_description=open(README).read(),
author='Common workflow language working group',
author_email='[email protected]',
url="http://www.commonwl.org",
download_url="https://github.com/common-workflow-language/common-workflow-language",
license='Apache 2.0',
install_requires=[
'cwltool'
],
entry_points={
'console_scripts': [ "cwl-runner=cwltool.main:main" ]
},
zip_safe=True
setup(
name="cwlref-runner",
version="1.0",
description="Common workflow language reference implementation",
long_description=open(README).read(),
author="Common workflow language working group",
author_email="[email protected]",
url="http://www.commonwl.org",
download_url="https://github.com/common-workflow-language/common-workflow-language",
license="Apache 2.0",
install_requires=["cwltool"],
entry_points={"console_scripts": ["cwl-runner=cwltool.main:main"]},
zip_safe=True,
)
18 changes: 18 additions & 0 deletions cwltool/argparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,24 @@ def arg_parser() -> argparse.ArgumentParser:
type=str,
)

# TO DO: Not yet implemented
provgroup.add_argument(
"--no-data", # Maybe change to no-input and no-intermediate to ignore those kind of files?...
default=False,
action="store_true",
help="Disables the storage of input and output data files in provenence folder",
dest="no_data",
)

# TO DO: Not yet implemented
provgroup.add_argument(
"--no-input", # Maybe change to no-input and no-intermediate to ignore those kind of files?...
default=False,
action="store_true",
help="Disables the storage of input data files in provenence folder",
dest="no_input",
)

printgroup = parser.add_mutually_exclusive_group()
printgroup.add_argument(
"--print-rdf",
Expand Down
4 changes: 4 additions & 0 deletions cwltool/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -573,6 +573,10 @@ def addsf(
datum = cast(CWLObjectType, datum)
ll = schema.get("loadListing") or self.loadListing
if ll and ll != "no_listing":
# Debug show
for k in datum:
_logger.debug("Datum: %s: %s" % (k, datum[k]))
_logger.debug("----------------------------------------")
get_listing(
self.fs_access,
datum,
Expand Down
41 changes: 36 additions & 5 deletions cwltool/cwlprov/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@
import re
import uuid
from getpass import getuser
from typing import IO, Any, Callable, Dict, List, Optional, Tuple, TypedDict, Union
from typing import IO, Any, Dict, List, Optional, Tuple, TypedDict, Union

from cwltool.cwlprov.provenance_constants import Hasher

from ..loghandler import _logger


def _whoami() -> Tuple[str, str]:
Expand Down Expand Up @@ -135,17 +139,16 @@
def checksum_copy(
src_file: IO[Any],
dst_file: Optional[IO[Any]] = None,
hasher: Optional[Callable[[], "hashlib._Hash"]] = None,
hasher: Optional[str] = Hasher,
buffersize: int = 1024 * 1024,
) -> str:
"""Compute checksums while copying a file."""
# TODO: Use hashlib.new(Hasher_str) instead?
if hasher:
checksum = hasher()
checksum = hashlib.new(hasher)
else:
from .provenance_constants import Hasher

checksum = Hasher()
checksum = hashlib.new(Hasher)

Check warning on line 151 in cwltool/cwlprov/__init__.py

View check run for this annotation

Codecov / codecov/patch

cwltool/cwlprov/__init__.py#L151

Added line #L151 was not covered by tests
contents = src_file.read(buffersize)
if dst_file and hasattr(dst_file, "name") and hasattr(src_file, "name"):
temp_location = os.path.join(os.path.dirname(dst_file.name), str(uuid.uuid4()))
Expand All @@ -158,6 +161,34 @@
pass
if os.path.exists(temp_location):
os.rename(temp_location, dst_file.name) # type: ignore

return content_processor(contents, src_file, dst_file, checksum, buffersize)


def checksum_only(
src_file: IO[Any],
dst_file: Optional[IO[Any]] = None,
hasher: str = Hasher,
buffersize: int = 1024 * 1024,
) -> str:
"""Calculate the checksum only, does not copy the data files."""
if dst_file is not None:
_logger.error(

Check warning on line 176 in cwltool/cwlprov/__init__.py

View check run for this annotation

Codecov / codecov/patch

cwltool/cwlprov/__init__.py#L176

Added line #L176 was not covered by tests
"[Debug Checksum Only] Destination file should be None but it is %s", dst_file
)
checksum = hashlib.new(hasher)
contents = src_file.read(buffersize)
return content_processor(contents, src_file, dst_file, checksum, buffersize)

Check warning on line 181 in cwltool/cwlprov/__init__.py

View check run for this annotation

Codecov / codecov/patch

cwltool/cwlprov/__init__.py#L179-L181

Added lines #L179 - L181 were not covered by tests


def content_processor(
contents: Any,
src_file: IO[Any],
dst_file: Optional[IO[Any]],
checksum: "hashlib._Hash",
buffersize: int,
) -> str:
"""Calculate the checksum based on the content."""
while contents != b"":
if dst_file is not None:
dst_file.write(contents)
Expand Down
9 changes: 7 additions & 2 deletions cwltool/cwlprov/provenance_constants.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import hashlib
import os
import uuid

Expand All @@ -18,7 +17,12 @@

# Research Object folders
METADATA = "metadata"
# sub-folders for data
DATA = "data"
INPUT_DATA = "data/input"
INTM_DATA = "data/intermediate"
OUTPUT_DATA = "data/output"

WORKFLOW = "workflow"
SNAPSHOT = "snapshot"
# sub-folders
Expand All @@ -43,10 +47,11 @@
# sha1, compatible with the File type's "checksum" field
# e.g. "checksum" = "sha1$47a013e660d408619d894b20806b1d5086aab03b"
# See ./cwltool/schemas/v1.0/Process.yml
Hasher = hashlib.sha1
SHA1 = "sha1"
SHA256 = "sha256"
SHA512 = "sha512"
# set the default hash function as SHA1 for hashlib.new
Hasher = SHA1

# TODO: Better identifiers for user, at least
# these should be preserved in ~/.config/cwl for every execution
Expand Down
59 changes: 50 additions & 9 deletions cwltool/cwlprov/provenance_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@
from ..stdfsaccess import StdFsAccess
from ..utils import CWLObjectType, JobsType, get_listing, posix_path, versionstring
from ..workflow_job import WorkflowJob

# from . import provenance_constants
from .provenance_constants import (
ACCOUNT_UUID,
CWLPROV,
Expand All @@ -43,11 +45,14 @@
SCHEMA,
SHA1,
SHA256,
Hasher,
TEXT_PLAIN,
UUID,
WF4EVER,
WFDESC,
WFPROV,
INPUT_DATA,
OUTPUT_DATA,
)
from .writablebagfile import create_job, write_bag_file # change this later

Expand Down Expand Up @@ -111,14 +116,24 @@
_logger.debug("[provenance] Creator Full name: %s", self.full_name)
self.workflow_run_uuid = run_uuid or uuid.uuid4()
self.workflow_run_uri = self.workflow_run_uuid.urn
# default to input data, now only INPUT_DATA and OUTPUT_DATA are possible values
self.current_data_source = INPUT_DATA
self.generate_prov_doc()

def __str__(self) -> str:
"""Represent this Provenvance profile as a string."""
return f"ProvenanceProfile <{self.workflow_run_uri}> in <{self.research_object}>"

def generate_prov_doc(self) -> Tuple[str, ProvDocument]:
"""Add basic namespaces."""
"""Generate a provenance document.

This method adds basic namespaces to the provenance document and records host provenance.
It also adds information about the cwltool version, namespaces for various entities,
and creates agents, activities, and associations to represent the workflow execution.

Returns:
A tuple containing the workflow run URI and the generated ProvDocument.
"""

def host_provenance(document: ProvDocument) -> None:
"""Record host provenance."""
Expand Down Expand Up @@ -152,7 +167,7 @@
# https://tools.ietf.org/html/draft-thiemann-hash-urn-01
# TODO: Change to nih:sha-256; hashes
# https://tools.ietf.org/html/rfc6920#section-7
self.document.add_namespace("data", "urn:hash::sha1:")
self.document.add_namespace("data", f"urn:hash::{Hasher}:")
# Also needed for docker images
self.document.add_namespace(SHA256, "nih:sha-256;")

Expand Down Expand Up @@ -287,6 +302,7 @@
process_run_id: str,
outputs: Union[CWLObjectType, MutableSequence[CWLObjectType], None],
when: datetime.datetime,
# load_listing: None,
) -> None:
self.generate_output_prov(outputs, process_run_id, process_name)
self.document.wasEndedBy(process_run_id, None, self.workflow_run_uri, when)
Expand All @@ -300,14 +316,19 @@
if "checksum" in value:
csum = cast(str, value["checksum"])
(method, checksum) = csum.split("$", 1)
if method == SHA1 and self.research_object.has_data_file(checksum):
# TODO intermediate file?...
if method == SHA1 and self.research_object.has_data_file(
self.current_data_source, checksum
):
entity = self.document.entity("data:" + checksum)

if not entity and "location" in value:
location = str(value["location"])
# If we made it here, we'll have to add it to the RO
with self.fsaccess.open(location, "rb") as fhandle:
relative_path = self.research_object.add_data_file(fhandle)
relative_path = self.research_object.add_data_file(

Check warning on line 329 in cwltool/cwlprov/provenance_profile.py

View check run for this annotation

Codecov / codecov/patch

cwltool/cwlprov/provenance_profile.py#L329

Added line #L329 was not covered by tests
fhandle, current_source=self.current_data_source
)
# FIXME: This naively relies on add_data_file setting hash as filename
checksum = PurePath(relative_path).name
entity = self.document.entity("data:" + checksum, {PROV_TYPE: WFPROV["Artifact"]})
Expand Down Expand Up @@ -408,8 +429,10 @@
# a later call to this method will sort that
is_empty = True

if "listing" not in value:
get_listing(self.fsaccess, value)
# get loadlisting, and populate the listing of value if not no_listing, recursively if deep_listing
ll = value.get("loadListing")

Check warning on line 433 in cwltool/cwlprov/provenance_profile.py

View check run for this annotation

Codecov / codecov/patch

cwltool/cwlprov/provenance_profile.py#L433

Added line #L433 was not covered by tests
if ll and ll != "no_listing":
get_listing(self.fsaccess, value, (ll == "deep_listing"))

Check warning on line 435 in cwltool/cwlprov/provenance_profile.py

View check run for this annotation

Codecov / codecov/patch

cwltool/cwlprov/provenance_profile.py#L435

Added line #L435 was not covered by tests
for entry in cast(MutableSequence[CWLObjectType], value.get("listing", [])):
is_empty = False
# Declare child-artifacts
Expand Down Expand Up @@ -472,7 +495,9 @@
def declare_string(self, value: str) -> Tuple[ProvEntity, str]:
"""Save as string in UTF-8."""
byte_s = BytesIO(str(value).encode(ENCODING))
data_file = self.research_object.add_data_file(byte_s, content_type=TEXT_PLAIN)
data_file = self.research_object.add_data_file(
byte_s, current_source=self.current_data_source, content_type=TEXT_PLAIN
)
checksum = PurePosixPath(data_file).name
# FIXME: Don't naively assume add_data_file uses hash in filename!
data_id = f"data:{PurePosixPath(data_file).stem}"
Expand Down Expand Up @@ -505,7 +530,9 @@
if isinstance(value, bytes):
# If we got here then we must be in Python 3
byte_s = BytesIO(value)
data_file = self.research_object.add_data_file(byte_s)
data_file = self.research_object.add_data_file(

Check warning on line 533 in cwltool/cwlprov/provenance_profile.py

View check run for this annotation

Codecov / codecov/patch

cwltool/cwlprov/provenance_profile.py#L533

Added line #L533 was not covered by tests
byte_s, current_source=self.current_data_source
)
# FIXME: Don't naively assume add_data_file uses hash in filename!
data_id = f"data:{PurePosixPath(data_file).stem}"
return self.document.entity(
Expand Down Expand Up @@ -604,6 +631,7 @@
job_order: Union[CWLObjectType, List[CWLObjectType]],
process_run_id: str,
name: Optional[str] = None,
# load_listing=None,
) -> None:
"""Add used() for each data artefact."""
if isinstance(job_order, list):
Expand Down Expand Up @@ -634,7 +662,17 @@
process_run_id: Optional[str],
name: Optional[str],
) -> None:
"""Call wasGeneratedBy() for each output,copy the files into the RO."""
"""
Call wasGeneratedBy() for each output, copy the files into the RO.

To save output data in ro.py add_data_file() method,
use a var current_data_source to keep track of whether it's
input or output (maybe intermediate in the future) data
it is later injected to add_data_file() method to save the data in the correct folder,
thus avoid changing the provenance_constants DATA
"""
self.current_data_source = OUTPUT_DATA

if isinstance(final_output, MutableSequence):
for entry in final_output:
self.generate_output_prov(entry, process_run_id, name)
Expand All @@ -660,6 +698,7 @@
self.document.wasGeneratedBy(
entity, process_run_id, timestamp, None, {"prov:role": role}
)
# return current_data_source

def prospective_prov(self, job: JobsType) -> None:
"""Create prospective prov recording as wfdesc prov:Plan."""
Expand Down Expand Up @@ -733,6 +772,8 @@
# TODO: Also support other profiles than CWLProv, e.g. ProvOne

# list of prov identifiers of provenance files
# NOTE: prov_ids are file names prepared for provenance/RO files in
# metadata/provenance for each sub-workflow of main workflow
prov_ids = []

# https://www.w3.org/TR/prov-xml/
Expand Down
Loading
Loading