Uploading exported data to s3 without storing at the local disk #20

Cole-Greer · 2023-03-17T18:21:20Z

Copying #309 from @SergeVil from the old amazon-neptune-tools repository.

We are running the Neptune AWS Batch Fargate that has a local disk limitation 20Gb, while the memory is pretty big - 128Gb. Such a low disk volume limits us to export the large database (getting the OS error). We look for the option to upload the exported csv files right away from JVM memory to the S3 bucket. Thank you!

Cole-Greer · 2023-03-17T18:32:49Z

@SergeVil I have a few questions to make sure we build the right solution. The biggest question I have is how big is the exported data in your case?

As it is currently implemented, the export tool needs to buffer all of the data locally as it builds the schema and then reformats the csvs according to the schema.

Are you looking for a solution which would have all the results streamed to S3 with minimal local buffering or would your use case allow for the full export to be buffered locally as long as it is buffered in memory instead of on disk?

SergeVil · 2023-03-17T18:54:38Z

@Cole-Greer Our current total data size is 14Gb. We may expect more, but this is the order of magnitude. The export folder on S3 currently contains 423 objects.

We will be fine with both approaches that you propose. I think that the first one is more generic as it can support the bigger databases, but the second one will also work for our use case.

SergeVil · 2023-03-21T15:18:57Z

Also you can store the files in the compressed format on the local disk. It can save up to 60% for plain text.

Cole-Greer · 2023-03-21T19:57:53Z

@SergeVil It's good to hear that both approaches will work for you. We are going to start by adding an option to buffer the results in memory as it is likely to be the quickest solution for your issue. We are having early discussions to potentially change the way we resolve the graph schema to remove the need for this buffering but those changes would be extensive and won't be ready for some time.

SergeVil · 2023-03-30T12:51:11Z

Hi @Cole-Greer. Thank you for taking the quickest path! When you are planning a release we can try?

Cole-Greer · 2023-03-30T21:02:34Z

@SergeVil The next Neptune Export release is intended to come out later today although unfortunately this fix has not yet been completed. It is being prioritized for the following release which is scheduled to come out at the end of April.

SergeVil · 2023-04-02T17:29:41Z

@Cole-Greer Thank you for the update. Please, keep me posted if anything changes, our production release depends on this.

Cole-Greer · 2023-04-20T20:31:44Z

@SergeVil I spent some time working on the in-memory buffering option mentioned above and unfortunately it does require additional capabilities to implement it in a performant manner. I believe there is a viable short-term workaround which you could make use of immediately and that covers your use case. Docker containers have a /dev/shm/ directory which is essentially a RAM-backed filesystem intended for inter-process communication. In my tests the size of this filesystem appears to only be bounded by the memory available to the container. This allows us to effectively shift all of the buffering into memory simply by changing the --root-path option when invoking the export service.
For example:

java -jar neptune-export.jar nesvc 
    --root-path /dev/shm/neptune-export 
    --json '{
        "command": "export-rdf", 
        "outputS3Path" : S3_URL,
        "params": {
            "endpoint" : NEPTUNE_ENDPOINT,
            "format" : "turtle"
        }
    }'

I hope that this provides a short term solution to your issue. Please let me know if you have any challenges related to this workaround.
Long term solution will require extensive reworking of how the tool infers graph schemas and processes query results. There are plans to do such work to incorporate the new Neptune Summary API to derive a graph schema before any querying of the graph; however, we can’t confirm any timelines for these enhancements.
Please let me know if suggested workaround resolves your issues, and is feasible for you to apply in the short term.

Cole-Greer mentioned this issue Mar 17, 2023

Uploading exported data to s3 without storing at the local disk awslabs/amazon-neptune-tools#309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uploading exported data to s3 without storing at the local disk #20

Uploading exported data to s3 without storing at the local disk #20

Cole-Greer commented Mar 17, 2023

Cole-Greer commented Mar 17, 2023

SergeVil commented Mar 17, 2023

SergeVil commented Mar 21, 2023

Cole-Greer commented Mar 21, 2023

SergeVil commented Mar 30, 2023

Cole-Greer commented Mar 30, 2023

SergeVil commented Apr 2, 2023

Cole-Greer commented Apr 20, 2023

Uploading exported data to s3 without storing at the local disk #20

Uploading exported data to s3 without storing at the local disk #20

Comments

Cole-Greer commented Mar 17, 2023

Cole-Greer commented Mar 17, 2023

SergeVil commented Mar 17, 2023

SergeVil commented Mar 21, 2023

Cole-Greer commented Mar 21, 2023

SergeVil commented Mar 30, 2023

Cole-Greer commented Mar 30, 2023

SergeVil commented Apr 2, 2023

Cole-Greer commented Apr 20, 2023