-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blank Manifest files generated from Cart #42
Comments
@skirk-mpr For your second question, that seems like a bug because I'm sorry for your inconvenience, and we would fix this issue in the next release. |
Thank you for your response and for the clarification -- @beomseoklee! Just to confirm, when providing an folder within the S3 bucket as part of the Data Package manifest, the Glue Crawlers will always result in a single table, even if the schema of those individual files are inconsistent -- e.g. they do not share any common header, etc? Thank you for confirming that this is in fact a bug! Just out of curiosity, is there a regular release schedule planned for this product? It has been great to experiment with! |
@skirk-mpr The root cause of this one is due to this part. URL parse would include So I think the simplest workaround would be either you can remove to something like if (items[index].type === 'dataset' || items[index].content_type === 'include-path') {
let s3Key = items[index].s3_key;
if (s3Key.startsWith('/')) {
s3Key = s3Key.slice(1);
}
checkObjectExists(items[index].s3_bucket, s3Key, function(err, data) {
if (data) {
if (format === 'signed-url') {
let params = {
Bucket: items[index].s3_bucket,
Key: s3Key,
Expires: expiration
};
var _url = s3.getSignedUrl('getObject', params);
_content.entries.push({
url: _url
});
} else if (format === 'bucket-key') {
_content.entries.push({
bucket: items[index].s3_bucket,
key: items[index].s3Key
});
}
} Currently, this bug is added to our backlog, but we haven't scheduled the next release for the solution. |
Actually, it looks like pointing to an existing location on S3 only works when you point to a file. When merely pointing to a folder on S3 that contains multiple files, you ultimately end up with an empty array in the generated manifest file. |
We have several data packages within our Data Lake -- a mix of data packages created with manifest files that point to individual files in S3:
as well as manifests that just have include paths to a "subfolder" which contain files.
In both cases, after the Glue Crawlers successfully runs, we see the individual files listed as Tables in the 'Integrations' tab for the Data Package for packages created with manifest that list out each individual files. For data packages created with manifest files that point to just a "subfolder" within the bucket that contain multiple files - a single table appears in the Integrations tab. Exploring this table via the Glue link or the Athena query view, suggest its consolidate the records across the three files into a single table - even if some of the files share only some common fields in their schema but are not completely identical. Is this expected?
However, our real question/issue is when adding these two Data Packages to our cart and Generating S3 Signed URL manifests - what we are getting are essentially blank manifests, with only the following content:
{"entries":[]}
The text was updated successfully, but these errors were encountered: