Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blank Manifest files generated from Cart #42

Open
skirk-mpr opened this issue May 7, 2020 · 4 comments
Open

Blank Manifest files generated from Cart #42

skirk-mpr opened this issue May 7, 2020 · 4 comments

Comments

@skirk-mpr
Copy link

We have several data packages within our Data Lake -- a mix of data packages created with manifest files that point to individual files in S3:

{
    "dataStore": [
        {
            "includePath": "s3://my-bucket/test/test.csv"
        },
        {
            "includePath": "s3://my-bucket/test/test2.csv"
        },
        {
            "includePath": "s3://my-bucket/test/test3.csv"
        }
    ]
}

as well as manifests that just have include paths to a "subfolder" which contain files.

{
    "dataStore": [
        {
            "includePath": "s3://my-bucket/test/"
        },

    ]
}

In both cases, after the Glue Crawlers successfully runs, we see the individual files listed as Tables in the 'Integrations' tab for the Data Package for packages created with manifest that list out each individual files. For data packages created with manifest files that point to just a "subfolder" within the bucket that contain multiple files - a single table appears in the Integrations tab. Exploring this table via the Glue link or the Athena query view, suggest its consolidate the records across the three files into a single table - even if some of the files share only some common fields in their schema but are not completely identical. Is this expected?

However, our real question/issue is when adding these two Data Packages to our cart and Generating S3 Signed URL manifests - what we are getting are essentially blank manifests, with only the following content:

{"entries":[]}

@beomseoklee
Copy link
Member

@skirk-mpr
For your first question, that is expected. Data Lake solution uses AWS Glue, and as you provided folder, it will crawl every file in the folder.

For your second question, that seems like a bug because data-lake-datasets table contains s3_key starting with /. I haven't figured out which causes that happens, but that's not normal.

I'm sorry for your inconvenience, and we would fix this issue in the next release.

@skirk-mpr
Copy link
Author

Thank you for your response and for the clarification -- @beomseoklee!

Just to confirm, when providing an folder within the S3 bucket as part of the Data Package manifest, the Glue Crawlers will always result in a single table, even if the schema of those individual files are inconsistent -- e.g. they do not share any common header, etc?

Thank you for confirming that this is in fact a bug! Just out of curiosity, is there a regular release schedule planned for this product? It has been great to experiment with!

@beomseoklee
Copy link
Member

@skirk-mpr
You are right about AWS Glue.

The root cause of this one is due to this part.
https://github.com/awslabs/aws-data-lake-solution/blob/e1adf3064644db007154fe717ea7716f8163c3a7/source/api/services/manifest/lib/manifest.js#L496-L501

URL parse would include / at the beginning of path, so data-lake-datasets DynamoDB table would contain / at s3_key.

So I think the simplest workaround would be either you can remove / from processManifestEntry function or you can change the below code https://github.com/awslabs/aws-data-lake-solution/blob/e1adf3064644db007154fe717ea7716f8163c3a7/source/api/services/manifest/lib/manifest.js#L269-L288

to something like

if (items[index].type === 'dataset' || items[index].content_type === 'include-path') {
    let s3Key = items[index].s3_key;
    if (s3Key.startsWith('/')) {
        s3Key = s3Key.slice(1);
    }
    checkObjectExists(items[index].s3_bucket, s3Key, function(err, data) {
        if (data) {
            if (format === 'signed-url') {
                let params = {
                    Bucket: items[index].s3_bucket,
                    Key: s3Key,
                    Expires: expiration
                };
                var _url = s3.getSignedUrl('getObject', params);
                _content.entries.push({
                    url: _url
                });
            } else if (format === 'bucket-key') {
                _content.entries.push({
                    bucket: items[index].s3_bucket,
                    key: items[index].s3Key
                });
            }
        }

Currently, this bug is added to our backlog, but we haven't scheduled the next release for the solution.

@plorent
Copy link

plorent commented Nov 7, 2022

Actually, it looks like pointing to an existing location on S3 only works when you point to a file. When merely pointing to a folder on S3 that contains multiple files, you ultimately end up with an empty array in the generated manifest file.
Update: just tested with a manifest.json that points to individual files on S3 and now I get download links for those files when I generate a manifest in my cart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants