-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS identity verification #2121
Conversation
ee7a7b4
to
859822d
Compare
Just using the GetObject permission doesn't quite work - if the resource doesn't exist then we get a 403 even if we have permission to access it. https://stackoverflow.com/a/19038017 explains why this is (which makes sense), though it's not 100% clear how the backend is interpreting the policy, and I'd like to find some official AWS docs clarifying the matter. Adding ListBucket permission seems to make it work (and raises the question of whether the GetObject permission itself is needed.) |
Unfortunately, patching in X-Amz-Expected-Bucket-Owner doesn't work. Don't know if there's a way to include that header in the |
46ad8f2
to
078df56
Compare
Now with UI and should be working! |
15b6210
to
b177957
Compare
Nits:
|
Hmm. Some sources say that Amazon presigned URLs have a maximum expiration time of one week. Maybe that's true for the v4 signature? It's definitely not true for the v2 signature, I just generated one that was valid for a year and it was accepted. |
65c007c
to
cab730e
Compare
github-advanced-security, you're adorable. It's a fair point that I should rework that regexp to reduce backtrackability (though I'm not sure it's possible to completely fix the issue; python As for the rest of it, get outta here. ;) |
b0b9e44
to
ab2ab47
Compare
I believe this approach should work if we use S3 Access Points to grant access. If we use S3 Access Grants then it is less clear - can we still grant access based on userid or do we have to use ARN? So we may have to decide which approach to take before deciding how to validate identities. :( |
I'm still not clear about the access-grant thing and how it would work. But just to note: this code could be changed to verify the requester's So although I think it's questionable, security-wise, to rely on cross-account IAM ARNs, it's not impossible and also wouldn't preclude other approaches. |
Note that using presigned URLs seems to be broken with bucket regions other than This affects older versions of awscli. Newer versions don't use boto3 and are possibly unaffected. For the time being, PhysioNet should use us-east-1 for user verification, and other sites should do the same. Unrelated to user verification, this might also cause problems if we want to store data in regions other than us-east-1. |
ab2ab47
to
cc9b991
Compare
# If the signature is correct, and the identity is correct as | ||
# determined by the bucket policy, then S3 should return a 404 | ||
# response (because the resource doesn't, in fact, exist.) | ||
response = session.get(signed_url) |
Check failure
Code scanning / CodeQL
Full server-side request forgery
|
||
# As a sanity check, verify that S3 returns a 403 response if | ||
# the AWS signature is missing. | ||
response = session.get(unsigned_url) |
Check failure
Code scanning / CodeQL
Full server-side request forgery
This module contains functions for authenticating a person's AWS identity (account number, username, and user ID) by means of signed URLs. Amazon S3 authenticates clients using a per-request "signature" that incorporates the request path and headers together with a secret key held by the client. This means that the client can pre-compute this signature and send it to someone else, allowing the recipient to perform that request on that client's behalf, without revealing the secret key itself. We can arrange to create an S3 URL that can only be accessed by a particular AWS identity, and then ask someone to pre-compute the signature that they would use to access that resource (which they can do using the AWS CLI or other S3-compatible tools and libraries.) If we then submit that signature to S3 and it succeeds, we know that the requester holds the secret key for that identity. In fact, the resource in question doesn't need to actually exist, as long as we can tell the difference between an unauthorized request (HTTP 403) and an authorized request for something that doesn't exist (HTTP 404). AWS supports many types of identities. This module only supports "IAM user" identities (arn:aws:iam::*:user/*) and not any other types, both as a matter of policy (data access permissions are granted to individual people, not to groups, organizations, or computers) and because IAM users have a stable, fixed-length user ID.
We want to verify a person's AWS identity in order to permit them to access restricted resources via direct cloud APIs; and possibly for other purposes in the future. An AWS account ID is not an identity. An account may contain many identities (known as "userids" or "unique IDs"), which might or might not belong to the same person. (Even if they do all belong to the same person, it doesn't mean that person wants or should want to give all of their identities the ability to access sensitive data.) Here, we add fields to store the userid alongside the account ID and ARN (which may be of interest in the future), and the date and time that these credentials were verified.
In order for a person to verify their AWS identity, they need to provide a digital signature, in the form of a signed URL that includes their AWS account information in the path. We further require the URL to include the domain name of the site, and the user's primary email address, to prevent misuse. This signed URL can be generated using the AWS CLI. However, the URL must be exactly correct; if it is wrong, it is difficult to tell why. In order to hopefully avoid confusion, we first ask the person to run 'aws sts get-caller-identity'; based on that, we tell them the exact 'aws s3 presign' command they need to run.
I'm banning this bot. I don't accept contributions from bots unless they are free and open source. (Seriously. I am not interested in providing free labor to Microsoft to help them develop better tools to lock people into GitHub.) |
cc9b991
to
e36f7a4
Compare
regarding boto3 being silly in test suite, see #2150 (comment) |
In testing the new AWS user verification feature (see #2121), we found that the bucket policy written in the code didn't match the policy I had actually set for the `bm-uverify-test1` demo bucket. (Sorry, I guess I never tested the final version of that code.) We need to allow *any* principal, not only principals matching a pattern. This new version of the policy has been tested on my personal AWS account, so hopefully it should work when we do the same on the PhysioNet account.
Context
Pull #2086 added the ability to mirror project content on Amazon S3. This is now working and we are in the process of uploading open-access projects from PhysioNet.
The changes here will be needed once we start uploading restricted/credentialed projects, so that we can securely grant access to authorized users. (Identity verification aside, there are also some more significant changes that are needed for handling restricted/credentialed projects; see issue #2094.)
In brief: currently (in the old system Felipe set up), people are asked to self-report their AWS account number, and any person or service within that account would be allowed to access restricted data.
With these changes, in contrast, people will be asked to verify their personal AWS identity; subsequently, we'll be able to grant access only to verified identities (the latter part is yet to be implemented.)
Why
DUAs for MIMIC and other databases require that data is only shared with authorized individuals (each person must register on PhysioNet and be credentialed.) We want to enable cloud access for better performance, but complying with these DUAs requires knowing who is being granted permission to use these cloud services.
Moreover, although each user is ultimately responsible for data security, we want to encourage good practices. People may be using AWS for all sorts of reasons unrelated to PhysioNet. Giving themselves permission to access MIMIC through their personal account should not also grant permission to all of those unrelated and possibly-less-trusted services.
Some people may be using organizational AWS accounts rather than personal ones. Maybe we want to discourage this, or maybe not, but we can't prevent it. One member of an organization having access shouldn't grant access to everyone in the organization.
There is a lot about AWS authentication that is still a bit mysterious to me, but my gut feeling is that the "IAM user" level is the right level of authentication for PhysioNet and MIMIC.
It has been suggested that we could ask people to self-report their AWS username (or ARN?) in addition to their account number. And yes, that would be an improvement; but it has the disadvantage that usernames are variable-length, and may not be long-term stable. Better would be to ask people to self-report their AWS userid, but that's not easy for people to find and more likely to cause mistakes.
Finally, I can imagine that in the future there may be other reasons for wanting to associate a PhysioNet account with an AWS account, and having a strong verification process could enable more interesting forms of integration.
How identity verification works
The concept is that we would have a special-purpose S3 bucket which allows access only if the path matches the requester's AWS account and userid. To prove your identity, you generate a signature for a URL that can only be accessed by you, and paste that signed URL into a form on the site.
The process would be:
You go to your cloud settings page on PhysioNet.
We tell you to run the command
aws sts get-caller-identity
.You copy the output into the form.
We then tell you to run a command like
aws s3 presign s3://asdfghjk/physionet.org-verification/[email protected]/userid=AIDAABCDEFGHIJKL/account=112233445566/username=barackobama/
.You copy the output into the form.
We verify the format of the URL and submit it to AWS to verify the signature.
Wait a minute, what's this "userid" thing you keep talking about?
https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-unique-ids
Setup and testing
Using this feature requires creating a special-purpose S3 bucket (a bucket which probably should not be used for anything else.)
For the time being, you can test this by setting
AWS_VERIFICATION_BUCKET_NAME
tobm-uverify-test1
. I will delete that bucket once we've set up a permanent replacement under the PhysioNet AWS account.If you want to see exactly how the verification bucket is created, and test it yourself, see the instructions in
deploy/README.md
.Background
Although this implementation is guided by the needs of PhysioNet, my goal has been to design a general-purpose authentication protocol that could be used by any site that needs to verify cross-account AWS identities.
This is inspired in part by the technique used by Hashicorp Vault and discussed here:
and similarly: https://stackoverflow.com/a/76099155
We could use the same method, but it would require the person to download and run a small program (and that program involves some pretty hairy digging into the AWS API.)
The method proposed here, in contrast, only requires the person to install the official AWS CLI and run a couple of commands. I think that this is easier to understand and therefore paradoxically more secure (see if you can spot the security flaw in the StackOverflow answer.)
For information about why this works, see AWS documentation on policy variables:
https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html
Also see the AWS CLI documentation:
https://docs.aws.amazon.com/cli/latest/reference/sts/get-caller-identity.html
https://docs.aws.amazon.com/cli/latest/reference/s3/presign.html