Skip to content

Commit

Permalink
AWS identity verification (#2121)
Browse files Browse the repository at this point in the history
## Context

Pull #2086 added the ability to mirror project content on Amazon S3.
This is now working and we are in the process of uploading open-access
projects from PhysioNet.

The changes here will be needed once we start uploading
*restricted/credentialed* projects, so that we can securely grant access
to authorized users. (Identity verification aside, there are also some
more significant changes that are needed for handling
restricted/credentialed projects; see issue #2094.)

In brief: currently (in the old system Felipe set up), people are asked
to *self-report* their AWS account number, and *any person or service
within that account* would be allowed to access restricted data.

With these changes, in contrast, people will be asked to *verify* their
personal AWS identity; subsequently, we'll be able to grant access *only
to verified identities* (the latter part is yet to be implemented.)

## Why

DUAs for MIMIC and other databases require that data is only shared with
authorized *individuals* (each person must register on PhysioNet and be
credentialed.) We want to enable cloud access for better performance,
but complying with these DUAs requires knowing who is being granted
permission to use these cloud services.

Moreover, although each user is ultimately responsible for data
security, we want to encourage good practices. People may be using AWS
for all sorts of reasons unrelated to PhysioNet. Giving themselves
permission to access MIMIC through their personal account should not
*also* grant permission to all of those unrelated and
possibly-less-trusted services.

Some people may be using organizational AWS accounts rather than
personal ones. Maybe we want to discourage this, or maybe not, but we
can't prevent it. One member of an organization having access shouldn't
grant access to everyone in the organization.

There is a lot about AWS authentication that is still a bit mysterious
to me, but my gut feeling is that the "IAM user" level is the right
level of authentication for PhysioNet and MIMIC.

It has been suggested that we could ask people to self-report their AWS
username (or ARN?) in addition to their account number. And yes, that
would be an improvement; but it has the disadvantage that usernames are
variable-length, and may not be long-term stable. Better would be to ask
people to self-report their *AWS userid*, but that's not easy for people
to find and more likely to cause mistakes.

Finally, I can imagine that in the future there may be other reasons for
wanting to associate a PhysioNet account with an AWS account, and having
a strong verification process could enable more interesting forms of
integration.

## How identity verification works

The concept is that we would have a special-purpose S3 bucket which
allows access *only if the path matches the requester's AWS account and
userid.* To prove your identity, you generate a signature for a URL that
can only be accessed by you, and paste that signed URL into a form on
the site.

The process would be:

1. You go to your cloud settings page on PhysioNet.

2. We tell you to run the command `aws sts get-caller-identity`.

3. You copy the output into the form.

4. We then tell you to run a command like `aws s3 presign
s3://asdfghjk/physionet.org-verification/[email protected]/userid=AIDAABCDEFGHIJKL/account=112233445566/username=barackobama/`.

5. You copy the output into the form.

6. We verify the format of the URL and submit it to AWS to verify the
signature.

## Wait a minute, what's this "userid" thing you keep talking about?


https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-unique-ids

## Setup and testing

Using this feature requires creating a special-purpose S3 bucket (a
bucket which probably should not be used for anything else.)

*For the time being*, you can test this by setting
`AWS_VERIFICATION_BUCKET_NAME` to `bm-uverify-test1`. I will delete that
bucket once we've set up a permanent replacement under the PhysioNet AWS
account.

If you want to see exactly how the verification bucket is created, and
test it yourself, see the instructions in `deploy/README.md`.

## Background

Although this implementation is guided by the needs of PhysioNet, my
goal has been to design a general-purpose authentication protocol that
could be used by any site that needs to verify cross-account AWS
identities.

This is inspired in part by the technique used by Hashicorp Vault and
discussed here:
*
https://ahermosilla.com/cloud/2020/11/17/leveraging-aws-signed-requests.html
* https://www.hashicorp.com/resources/deep-dive-vault-aws-auth-backend

and similarly: https://stackoverflow.com/a/76099155

We could use the same method, but it would require the person to
download and run a small program (and that program involves some pretty
hairy digging into the AWS API.)

The method proposed here, in contrast, only requires the person to
install the official AWS CLI and run a couple of commands. I think that
this is easier to understand and therefore paradoxically more secure
(see if you can spot the security flaw in the StackOverflow answer.)

For information about *why* this works, see AWS documentation on policy
variables:

https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html

Also see the AWS CLI documentation:

https://docs.aws.amazon.com/cli/latest/reference/sts/get-caller-identity.html
https://docs.aws.amazon.com/cli/latest/reference/s3/presign.html
  • Loading branch information
tompollard authored Jul 9, 2024
2 parents 309435a + e36f7a4 commit c0a644d
Show file tree
Hide file tree
Showing 13 changed files with 1,051 additions and 13 deletions.
3 changes: 3 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ PAUSE_CREDENTIALING_MESSAGE='PhysioNet will not be taking new applications for c
# GOOGLE_APPLICATION_CREDENTIALS=json
GCP_DELEGATION_EMAIL=email

# AWS user authentication bucket (see deploy/README.md)
#AWS_VERIFICATION_BUCKET_NAME=example-bucket

# AWS
# Used to provide MIMIC through AWS, this will include S3, Redshift, Spark
# Key and key2 are predefined by AWS, can be changed but IT WILL BREAK ALL
Expand Down
28 changes: 28 additions & 0 deletions deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,34 @@ AWS_CLOUD_FORMATION=URL
This functionality will send the AWS ID to a Lambda function in the AWS Cloud Formation.
That ID will be then added to the storage bucket and databases.

### User authentication for AWS

Before accessing restricted data via AWS, users will need to add their AWS account on the "Cloud" page of their user profile.

In order for this option to appear on the site, the site operator must create a *verification bucket* and configure the `AWS_VERIFICATION_BUCKET_NAME` setting in `.env`. A "verification bucket" is a special S3 bucket that doesn't contain any files.

For demo/testing purposes, you can use the same verification bucket that PhysioNet uses (the bucket name isn't secret.) For production use, each site should have a verification bucket that is owned and controlled by the site's own AWS account. To do that:

- Log in to the AWS console, and create an IAM user with full privileges for S3 administration. (This can be the same user that will be used for managing S3 project buckets.)
- Generate an access key for this user, and configure the AWS CLI (`aws configure`).
- Open a Python shell (`manage.py shell`) and run:
```
import user.awsverification
user.awsverification.configure_aws_verification_bucket(BUCKET)
```
where BUCKET is the bucket name you want to use (`AWS_VERIFICATION_BUCKET_NAME`).
- Delete the user / access key if you're not going to use them again.

To test that a verification bucket is functioning correctly:

- Log in to the AWS console, and create an IAM user with no added privileges.
- Generate an access key for this user, and configure the AWS CLI (`aws configure`).
- Open a Python shell (`manage.py shell`) and run:
```
import user.awsverification
user.awsverification.test_aws_verification_bucket(BUCKET)
```

## ORCID account integration

Obtaining a client_id / client_secret for interacting with the ORCID API:
Expand Down
3 changes: 3 additions & 0 deletions physionet-django/physionet/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,9 @@
AWS_HEADER_VALUE2 = config('AWS_VALUE2', default=False)
AWS_CLOUD_FORMATION = config('AWS_CLOUD_FORMATION', default=False)

# User verification bucket (see user/awsverification.py)
AWS_VERIFICATION_BUCKET_NAME = config('AWS_VERIFICATION_BUCKET_NAME', default=None)

# Tags for the DataCite API used for DOI
DATACITE_API_URL = config('DATACITE_API_URL', default='https://api.test.datacite.org/dois')
DATACITE_PREFIX = config('DATACITE_PREFIX', default='')
Expand Down
Loading

0 comments on commit c0a644d

Please sign in to comment.