Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multipart upload #4

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Support for multipart upload #4

wants to merge 3 commits into from

Conversation

t3t5u
Copy link

@t3t5u t3t5u commented Dec 18, 2024

PUT object (EntityTooLarge)

config.yml
exec:
  max_threads: 1
  min_output_tasks: 1
in:
  type: file
  path_prefix: test.10485760.jsonl.gz # 10G
  decoders:
  - {type: gzip}
  parser:
    type: jsonl
    columns:
    - {name: string, type: string}
out:
  type: s3
  auth_method: basic
  access_key_id: ***
  secret_access_key: ***
  region: ap-northeast-1
  bucket: trocco-sandbox
  path_prefix: test_by_sano/multipart_upload
  file_ext: .jsonl
  sequence_format: .%03d.%02d
  formatter:
    type: jsonl
embulk run
$ embulk run config.yml
2024-12-18 11:36:13.610 +0000: Embulk v0.9.26
2024-12-18 11:36:14.241 +0000 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2024-12-18 11:36:15.536 +0000 [INFO] (main): Gem's home and path are set by default: "/home/ubuntu/.embulk/lib/gems"
2024-12-18 11:36:16.135 +0000 [INFO] (main): Started Embulk v0.9.26
2024-12-18 11:36:16.245 +0000 [INFO] (0001:transaction): Loaded plugin embulk-output-s3 (1.7.1.snapshot)
2024-12-18 11:36:16.277 +0000 [INFO] (0001:transaction): Loaded plugin embulk-parser-jsonl (0.2.1)
2024-12-18 11:36:16.286 +0000 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'test.10485760.jsonl.gz'
2024-12-18 11:36:16.287 +0000 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2024-12-18 11:36:16.290 +0000 [INFO] (0001:transaction): Loading files [test.10485760.jsonl.gz]
2024-12-18 11:36:16.307 +0000 [INFO] (0001:transaction): Using local thread executor with max_threads=1 / tasks=1
2024-12-18 11:36:16.336 +0000 [INFO] (0001:transaction): Loaded plugin embulk-formatter-jsonl (0.1.4)
2024-12-18 11:36:16.369 +0000 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2024-12-18 11:36:16.692 +0000 [INFO] (0015:task-0000): Writing S3 file 'test_by_sano/multipart_upload.000.00.jsonl'
2024-12-18 11:39:38.730 +0000 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
org.embulk.exec.PartialExecutionException: com.amazonaws.services.s3.model.AmazonS3Exception: Your proposed upload exceeds the maximum allowed size (Service: Amazon S3; Status Code: 400; Error Code: EntityTooLarge; Request ID: ****************; S3 Extended Request ID: ****************************************************************************; Proxy: null), S3 Extended Request ID: ****************************************************************************
	at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
	at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:566)
	at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:35)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
	at org.embulk.spi.Exec.doWith(Exec.java:22)
	at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
	at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:242)
	at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:290)
	at org.embulk.EmbulkRunner.run(EmbulkRunner.java:155)
	at org.embulk.cli.EmbulkRun.runSubcommand(EmbulkRun.java:431)
	at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:90)
	at org.embulk.cli.Main.main(Main.java:64)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Your proposed upload exceeds the maximum allowed size (Service: Amazon S3; Status Code: 400; Error Code: EntityTooLarge; Request ID: ****************; S3 Extended Request ID: ****************************************************************************; Proxy: null), S3 Extended Request ID: ****************************************************************************
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(com/amazonaws/http/AmazonHttpClient.java:1819)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(com/amazonaws/http/AmazonHttpClient.java:1403)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(com/amazonaws/http/AmazonHttpClient.java:1372)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(com/amazonaws/http/AmazonHttpClient.java:1145)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(com/amazonaws/http/AmazonHttpClient.java:802)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(com/amazonaws/http/AmazonHttpClient.java:770)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(com/amazonaws/http/AmazonHttpClient.java:744)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(com/amazonaws/http/AmazonHttpClient.java:704)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(com/amazonaws/http/AmazonHttpClient.java:686)
	at com.amazonaws.http.AmazonHttpClient.execute(com/amazonaws/http/AmazonHttpClient.java:550)
	at com.amazonaws.http.AmazonHttpClient.execute(com/amazonaws/http/AmazonHttpClient.java:530)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(com/amazonaws/services/s3/AmazonS3Client.java:5437)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(com/amazonaws/services/s3/AmazonS3Client.java:5384)
	at com.amazonaws.services.s3.AmazonS3Client.access$300(com/amazonaws/services/s3/AmazonS3Client.java:421)
	at com.amazonaws.services.s3.AmazonS3Client$PutObjectStrategy.invokeServiceCall(com/amazonaws/services/s3/AmazonS3Client.java:6508)
	at com.amazonaws.services.s3.AmazonS3Client.uploadObject(com/amazonaws/services/s3/AmazonS3Client.java:1856)
	at com.amazonaws.services.s3.AmazonS3Client.putObject(com/amazonaws/services/s3/AmazonS3Client.java:1816)
	at org.embulk.output.s3.S3FileOutputPlugin$S3FileOutput.putFile(org/embulk/output/s3/S3FileOutputPlugin.java:534)
	at org.embulk.output.s3.S3FileOutputPlugin$S3FileOutput.multipartUploadOrPutFile(org/embulk/output/s3/S3FileOutputPlugin.java:341)
	at org.embulk.output.s3.S3FileOutputPlugin$S3FileOutput.closeCurrent(org/embulk/output/s3/S3FileOutputPlugin.java:544)
	at org.embulk.output.s3.S3FileOutputPlugin$S3FileOutput.finish(org/embulk/output/s3/S3FileOutputPlugin.java:601)
	at java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)
	at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:438)
	at org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:302)
	at RUBY.finish(uri:classloader:/gems/embulk-0.9.26-java/lib/embulk/file_output.rb:44)
	at RUBY.finish(/home/ubuntu/.embulk/lib/gems/gems/embulk-formatter-jsonl-0.1.4/lib/embulk/formatter/jsonl.rb:79)
	at RUBY.finish(uri:classloader:/gems/embulk-0.9.26-java/lib/embulk/formatter_plugin.rb:80)
	at Embulk$$FormatterPlugin$$JavaAdapter$$OutputAdapter_1623102721.finish(Embulk$$FormatterPlugin$$JavaAdapter$$OutputAdapter_1623102721.gen:13)
	at org.embulk.spi.FileOutputRunner$DelegateTransactionalPageOutput.finish(org/embulk/spi/FileOutputRunner.java:154)
	at org.embulk.spi.PageBuilder.finish(org/embulk/spi/PageBuilder.java:227)
	at org.embulk.parser.jsonl.JsonlParserPlugin.run(org/embulk/parser/jsonl/JsonlParserPlugin.java:183)
	at org.embulk.spi.FileInputRunner.run(org/embulk/spi/FileInputRunner.java:140)
	at org.embulk.spi.util.Executors.process(org/embulk/spi/util/Executors.java:62)
	at org.embulk.spi.util.Executors.process(org/embulk/spi/util/Executors.java:38)
	at org.embulk.exec.LocalExecutorPlugin$DirectExecutor$1.call(org/embulk/exec/LocalExecutorPlugin.java:170)
	at org.embulk.exec.LocalExecutorPlugin$DirectExecutor$1.call(org/embulk/exec/LocalExecutorPlugin.java:167)
	at java.util.concurrent.FutureTask.run(java/util/concurrent/FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(java/lang/Thread.java:750)

Error: com.amazonaws.services.s3.model.AmazonS3Exception: Your proposed upload exceeds the maximum allowed size (Service: Amazon S3; Status Code: 400; Error Code: EntityTooLarge; Request ID: ****************; S3 Extended Request ID: ****************************************************************************; Proxy: null), S3 Extended Request ID: ****************************************************************************

Multipart upload (part_size: 1g)

config.yml
out:
  ...
  multipart_upload:
    part_size: 1g
  ...
embulk run
$ embulk run config.yml
2024-12-18 11:41:17.271 +0000: Embulk v0.9.26
2024-12-18 11:41:17.886 +0000 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2024-12-18 11:41:19.217 +0000 [INFO] (main): Gem's home and path are set by default: "/home/ubuntu/.embulk/lib/gems"
2024-12-18 11:41:19.860 +0000 [INFO] (main): Started Embulk v0.9.26
2024-12-18 11:41:19.933 +0000 [INFO] (0001:transaction): Loaded plugin embulk-output-s3 (1.7.1.snapshot)
2024-12-18 11:41:19.977 +0000 [INFO] (0001:transaction): Loaded plugin embulk-parser-jsonl (0.2.1)
2024-12-18 11:41:19.989 +0000 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'test.10485760.jsonl.gz'
2024-12-18 11:41:19.990 +0000 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2024-12-18 11:41:19.992 +0000 [INFO] (0001:transaction): Loading files [test.10485760.jsonl.gz]
2024-12-18 11:41:20.011 +0000 [INFO] (0001:transaction): Using local thread executor with max_threads=1 / tasks=1
2024-12-18 11:41:20.052 +0000 [INFO] (0001:transaction): Loaded plugin embulk-formatter-jsonl (0.1.4)
2024-12-18 11:41:20.092 +0000 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2024-12-18 11:41:20.372 +0000 [INFO] (0015:task-0000): Writing S3 file 'test_by_sano/multipart_upload.000.00.jsonl'
2024-12-18 11:44:32.534 +0000 [INFO] (pool-2-thread-1): Uploading a part 1 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:32.567 +0000 [INFO] (pool-2-thread-3): Uploading a part 3 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:32.639 +0000 [INFO] (pool-2-thread-2): Uploading a part 2 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:32.651 +0000 [INFO] (pool-2-thread-4): Uploading a part 4 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:39.219 +0000 [INFO] (pool-2-thread-4): Uploaded 4,294,967,296 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:39.369 +0000 [INFO] (pool-2-thread-1): Uploaded 1,073,741,824 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:39.403 +0000 [INFO] (pool-2-thread-2): Uploaded 2,147,483,648 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:39.730 +0000 [INFO] (pool-2-thread-3): Uploaded 3,221,225,472 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:41.961 +0000 [INFO] (pool-2-thread-4): Uploading a part 5 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:42.125 +0000 [INFO] (pool-2-thread-1): Uploading a part 6 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:42.162 +0000 [INFO] (pool-2-thread-2): Uploading a part 7 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:42.504 +0000 [INFO] (pool-2-thread-3): Uploading a part 8 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:48.731 +0000 [INFO] (pool-2-thread-4): Uploaded 5,368,709,120 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:48.749 +0000 [INFO] (pool-2-thread-2): Uploaded 7,516,192,768 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:49.154 +0000 [INFO] (pool-2-thread-3): Uploaded 8,589,934,592 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:49.426 +0000 [INFO] (pool-2-thread-1): Uploaded 6,442,450,944 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:51.488 +0000 [INFO] (pool-2-thread-4): Uploading a part 9 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:51.509 +0000 [INFO] (pool-2-thread-2): Uploading a part 10 / 10. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:44:58.181 +0000 [INFO] (pool-2-thread-4): Uploaded 9,663,676,416 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:58.247 +0000 [INFO] (pool-2-thread-2): Uploaded 10,737,418,240 / 10,737,418,240 bytes of the file. entity tag '37799b99cb0bd8573b64c278b987a86e'
2024-12-18 11:44:59.342 +0000 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2024-12-18 11:44:59.346 +0000 [INFO] (main): Committed.
2024-12-18 11:44:59.346 +0000 [INFO] (main): Next config diff: {"in":{"last_path":"test.10485760.jsonl.gz"},"out":{}}

Multipart upload (part_size: 5g)

config.yml
out:
  ...
  multipart_upload:
    part_size: 5g
  ...
embulk run
$ embulk run config.yml
2024-12-18 11:45:47.745 +0000: Embulk v0.9.26
2024-12-18 11:45:48.345 +0000 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2024-12-18 11:45:49.624 +0000 [INFO] (main): Gem's home and path are set by default: "/home/ubuntu/.embulk/lib/gems"
2024-12-18 11:45:50.205 +0000 [INFO] (main): Started Embulk v0.9.26
2024-12-18 11:45:50.286 +0000 [INFO] (0001:transaction): Loaded plugin embulk-output-s3 (1.7.1.snapshot)
2024-12-18 11:45:50.319 +0000 [INFO] (0001:transaction): Loaded plugin embulk-parser-jsonl (0.2.1)
2024-12-18 11:45:50.329 +0000 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'test.10485760.jsonl.gz'
2024-12-18 11:45:50.329 +0000 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2024-12-18 11:45:50.332 +0000 [INFO] (0001:transaction): Loading files [test.10485760.jsonl.gz]
2024-12-18 11:45:50.350 +0000 [INFO] (0001:transaction): Using local thread executor with max_threads=1 / tasks=1
2024-12-18 11:45:50.380 +0000 [INFO] (0001:transaction): Loaded plugin embulk-formatter-jsonl (0.1.4)
2024-12-18 11:45:50.416 +0000 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2024-12-18 11:45:50.748 +0000 [INFO] (0015:task-0000): Writing S3 file 'test_by_sano/multipart_upload.000.00.jsonl'
2024-12-18 11:49:11.284 +0000 [INFO] (pool-2-thread-1): Uploading a part 1 / 2. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:49:11.303 +0000 [INFO] (pool-2-thread-2): Uploading a part 2 / 2. bucket 'trocco-sandbox', key 'test_by_sano/multipart_upload.000.00.jsonl', upload id '********************************************************************************************************************************************************'
2024-12-18 11:49:43.895 +0000 [INFO] (pool-2-thread-1): Uploaded 5,368,709,120 / 10,737,418,240 bytes of the file. entity tag '9c15fe05ab201b4d665f9652b611040c'
2024-12-18 11:49:43.911 +0000 [INFO] (pool-2-thread-2): Uploaded 10,737,418,240 / 10,737,418,240 bytes of the file. entity tag '9c15fe05ab201b4d665f9652b611040c'
2024-12-18 11:49:45.008 +0000 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2024-12-18 11:49:45.012 +0000 [INFO] (main): Committed.
2024-12-18 11:49:45.013 +0000 [INFO] (main): Next config diff: {"in":{"last_path":"test.10485760.jsonl.gz"},"out":{}}

@t3t5u t3t5u requested review from d-hrs and NamedPython December 18, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant