Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/proper error messages #19

Merged
merged 22 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,13 @@ jobs:

- run: pnpm install

- run: pnpm run build
- name: Setup Rclone
uses: animmouse/setup-rclone@v1

- run: pnpm run package
- run: ./test-go-direct.sh
working-directory: packages/aws-copy-out-sharer/docker/rclone-batch-docker-image

- run: ./test-docker-direct.sh
working-directory: packages/aws-copy-out-sharer/docker/rclone-batch-docker-image
# - run: pnpm run build
# - run: pnpm run package
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@ node_modules/
cdk.context.json

cdk.out/


.DS_Store
52 changes: 51 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,63 @@ A service that can be installed into an Elsa Data environment
and which enables parallel file copying out into a
destination bucket in the same region.

NOTE: this is a general purpose "S3 file copy" tool - so might
be useful outside of Elsa Data. It can certainly be invoked
directly as a Steps function independent of Elsa Data (all
the Elsa Data does is sets up the input CSVs and then invokes
the Steps function itself).

## Development

On check-out (once only) (note that `pre-commit` is presumed installed externally)

```shell
pre-commit install
```

For package installation (note that `pnpm` is presumed installed externally)

```shell
pnpm install
```

Edit the packages and deploy to dev

```shell
(in the dev folder)
pnpm run deploy
```

## Testing

Because this service is very dependent on the behaviour of AWS Steps
(using distributed maps) - it was too complex to set up a "local" test
that would actually test much of the pieces likely to fail.

Instead, development is done and the CDK project is deployed to a "dev" account (noting
that this sets the minimum dev cadence for trying changes
to minutes rather than seconds).

There is then a test script - that creates samples objects - and launches
test invocations.

## Input

```json
{
"sourceFilesCsvBucket": "bucket-with-csv",
"sourceFilesCsvKey": "key-of-source-files.csv",
"destinationBucket": "a-target-bucket-in-same-region",
"destinationBucket": "a-target-bucket-in-same-region-but-not-same-account",
"maxItemsPerBatch": 10
}
```

The copy will fan-out wide (to sensible width (~ 100)) - but there is a small AWS Config
cost to the startup/shutdown
of the Fargate tasks. Therefore the `maxItemsPerBatch` controls how many individuals files are attempted per
Fargate task - though noting that we request SPOT tasks.

So there is balance between the likelihood of SPOT interruptions v re-use of Fargate tasks. If
tasks are SPOT interrupted - then the next invocation will skip already transferred files (assuming
at least one is copied) - so it is probably safe and cheapest to leave the items per batch at 10
and be prepared to perhaps re-execute the copy.
36 changes: 36 additions & 0 deletions dev/EXAMPLE-COPY-README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
How to do a full scale invoke test.

Go to "elsa-data-tmp" bucket in dev.
It probably will be empty as objects auto-expire.
Make a folder "copy-out-test-working".
Copy "example-copy-manifest.csv" to that folder.

THE FOLDER MUST BE EXACTLY AS SPECIFIED AS THAT PERMISSION IS BAKED INTO
THE DEV DEPLOYMENT (IN ORDER TO TEST PERMISSIONS!)

Invoke the dev Steps with the input (feel free to change to "0\_" if you
want to run multiple experiments without overriding the results)

```json
{
"sourceFilesCsvBucket": "elsa-data-tmp",
"sourceFilesCsvKey": "example-copy-manifest.csv",
"destinationBucket": "elsa-data-copy-target-sydney",
"maxItemsPerBatch": 2,
"destinationStartCopyRelativeKey": "0_STARTED_COPY.txt",
"destinationEndCopyRelativeKey": "0_ENDED_COPY.csv"
}
```

For a test of AG (in the AG account - with public/made up data files)

```json
{
"sourceFilesCsvBucket": "elsa-data-copy-working",
"sourceFilesCsvKey": "example-copy-manifest-ag.csv",
"destinationBucket": "elsa-data-copy-target-sydney",
"maxItemsPerBatch": 1,
"destinationStartCopyRelativeKey": "AG_STARTED_COPY.txt",
"destinationEndCopyRelativeKey": "AG_ENDED_COPY.csv"
}
```
22 changes: 22 additions & 0 deletions dev/constants.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/**
* The only configurable item needed for the test cases - set this to a bucket you have
* full access to. Ideally the bucket should have a lifecycle that auto expires objects after 1 day.
* In order to keep minimal AWS permissions this is also specified
* in the CDK deployment.
*/
export const TEST_BUCKET = "elsa-data-tmp";

/**
* A designated area in our test bucket that is where we can find the list
* of objects to copy - and other working files
* NOTE: this is not where the source or destination files are located.
*/
export const TEST_BUCKET_WORKING_PREFIX = "copy-out-test-working/";

/**
* We have a clear permissions split between the objects that we are copying - and
* the working objects we create in doing the copy. By making sure they are in
* different test folders - we can confirm that our permissions aren't accidentally
* overlapping (which might mean we pass tests that later fail for real)
*/
export const TEST_BUCKET_OBJECT_PREFIX = "copy-out-test-objects/";
97 changes: 90 additions & 7 deletions dev/dev.ts
Original file line number Diff line number Diff line change
@@ -1,15 +1,65 @@
import { CopyOutStack } from "aws-copy-out-sharer";
import { SubnetType } from "aws-cdk-lib/aws-ec2";
import { App } from "aws-cdk-lib";
import { CopyOutStateMachineConstruct } from "aws-copy-out-sharer";
import { SubnetType, Vpc } from "aws-cdk-lib/aws-ec2";
import { App, Stack, StackProps } from "aws-cdk-lib";
import { InfrastructureClient } from "@elsa-data/aws-infrastructure";
import { Service } from "aws-cdk-lib/aws-servicediscovery";
import { Construct } from "constructs";
import { TEST_BUCKET, TEST_BUCKET_WORKING_PREFIX } from "./constants";

const app = new App();

const description =
"Bulk copy-out service for Elsa Data - an application for controlled genomic data sharing";

const devId = "ElsaDataDevCopyOutStack";
const agId = "ElsaDataAgCopyOutStack";

new CopyOutStack(app, devId, {
/**
* Wraps the copy out construct for development purposes. We don't do this Stack definition in the
* construct itself - as unlike some other Elsa Data constructs - there may be general
* utility to the copy out service (i.e. we want to let people just install the Copy Out
* state machine without elsa infrastructure). But for dev purposes we are developing
* this in conjunction with Elsa Data - so we register it into the namespace and use a common
* VPC etc.
*/
class ElsaDataCopyOutStack extends Stack {
constructor(scope?: Construct, id?: string, props?: StackProps) {
super(scope, id, props);

// our client unlocks the ability to fetch/create CDK objects that match our
// installed infrastructure stack (by infrastructure stack name)
const infraClient = new InfrastructureClient(
"ElsaDataDevInfrastructureStack",
);

const vpc = infraClient.getVpcFromLookup(this);

const namespace = infraClient.getNamespaceFromLookup(this);

const service = new Service(this, "Service", {
namespace: namespace,
name: "CopyOut",
description: "Parallel file copying service",
});

const copyOut = new CopyOutStateMachineConstruct(this, "CopyOut", {
vpc: vpc,
vpcSubnetSelection: SubnetType.PRIVATE_WITH_EGRESS,
workingBucket: TEST_BUCKET,
workingBucketPrefixKey: TEST_BUCKET_WORKING_PREFIX,
aggressiveTimes: true,
allowWriteToInstalledAccount: true,
});

service.registerNonIpInstance("StateMachine", {
customAttributes: {
stateMachineArn: copyOut.stateMachine.stateMachineArn,
},
});
}
}

new ElsaDataCopyOutStack(app, devId, {
// the stack can only be deployed to 'dev'
env: {
account: "843407916570",
Expand All @@ -20,7 +70,40 @@ new CopyOutStack(app, devId, {
"umccr-org:Stack": devId,
},
description: description,
isDevelopment: true,
infrastructureStackName: "ElsaDataDevInfrastructureStack",
infrastructureSubnetSelection: SubnetType.PRIVATE_WITH_EGRESS,
});

/**
* Wraps an even simpler deployment direct for AG. We have needs to do AG copies
* outside of Elsa. This is also a good test of the copy-out mechanics. So this
* allows us to directly deploy/destroy.
*/
class ElsaDataSimpleCopyOutStack extends Stack {
constructor(scope?: Construct, id?: string, props?: StackProps) {
super(scope, id, props);

const vpc = Vpc.fromLookup(this, "Vpc", { vpcName: "main-vpc" });

const copyOut = new CopyOutStateMachineConstruct(this, "CopyOut", {
vpc: vpc,
vpcSubnetSelection: SubnetType.PRIVATE_WITH_EGRESS,
workingBucket: "elsa-data-copy-working",
aggressiveTimes: false,
allowWriteToInstalledAccount: true,
});

//stateMachineArn: copyOut.stateMachine.stateMachineArn,
}
}

new ElsaDataSimpleCopyOutStack(app, agId, {
// the stack can only be deployed to 'dev'
env: {
account: "602836945884",
region: "ap-southeast-2",
},
tags: {
"umccr-org:Product": "ElsaData",
"umccr-org:Stack": agId,
},
description: description,
});
2 changes: 2 additions & 0 deletions dev/example-copy-manifest-ag.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
elsa-data-demo-agha-gdr-store,FLAGSHIP_A/2020-02-01/ERR251112_R1.fastq.gz
elsa-data-demo-agha-gdr-store,FLAGSHIP_A/2020-02-01/ERR251112_R2.fastq.gz
13 changes: 13 additions & 0 deletions dev/example-copy-manifest.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
umccr-10f-data-dev,"ASHKENAZIM/HG002-HG003-HG004.joint.filter.bcf"
umccr-10f-data-dev,"ASHKENAZIM/HG002-HG003-HG004.joint.filter.bcf.csi"
umccr-10f-data-dev,"ASHKENAZIM/HG002-HG003-HG004.joint.filter.vcf"
umccr-10f-data-dev,"ASHKENAZIM/HG002-HG003-HG004.joint.filter.vcf.gz"
umccr-10f-data-dev,"ASHKENAZIM/HG002-HG003-HG004.joint.filter.vcf.gz.csi"
umccr-10f-data-dev,"ASHKENAZIM/HG002-HG003-HG004.joint.filter.vcf.gz.tbi"
umccr-10f-data-dev,AFILETHATDOESNOTEXIST.txt
umccr-10f-data-dev,ASHKENAZIM/HG002.bam
umccr-10f-data-dev,ASHKENAZIM/HG002.bam.bai
umccr-10f-data-dev,ASHKENAZIM/HG003.bam
umccr-10f-data-dev,ASHKENAZIM/HG003.bam.bai
umccr-10f-data-dev,ASHKENAZIM/HG004.bam
umccr-10f-data-dev,ASHKENAZIM/HG004.bam.bai
20 changes: 17 additions & 3 deletions dev/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,24 @@
"private": true,
"version": "0.0.0",
"description": "Manual CDK deployment for development",
"scripts": {
"deploy": "pnpm -w run build && cdk deploy ElsaDataDevCopyOutStack",
"destroy": "pnpm -w run build && cdk destroy ElsaDataDevCopyOutStack",
"agdeploy": "pnpm -w run build && cdk deploy ElsaDataAgCopyOutStack",
"agdestroy": "pnpm -w run build && cdk destroy ElsaDataAgCopyOutStack",
"test": "ts-node --prefer-ts-exts test.ts",
"test-quick": "ts-node --prefer-ts-exts test.ts"
},
"dependencies": {
"aws-cdk": "2.93.0",
"aws-cdk-lib": "2.93.0",
"aws-copy-out-sharer": "link:../packages/aws-copy-out-sharer"
"@aws-sdk/client-s3": "3.576.0",
"@aws-sdk/client-servicediscovery": "3.576.0",
"@aws-sdk/client-sfn": "3.576.0",
"@aws-sdk/client-sso-oidc": "3.574.0",
"@elsa-data/aws-infrastructure": "1.5.1",
"aws-cdk": "2.141.0",
"aws-cdk-lib": "2.141.0",
"aws-copy-out-sharer": "link:../packages/aws-copy-out-sharer",
"constructs": "10.3.0"
},
"devDependencies": {}
}
Loading
Loading