Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading exported data to s3 without storing at the local disk #20

Open
Cole-Greer opened this issue Mar 17, 2023 · 8 comments
Open

Uploading exported data to s3 without storing at the local disk #20

Cole-Greer opened this issue Mar 17, 2023 · 8 comments

Comments

@Cole-Greer
Copy link
Collaborator

Copying #309 from @SergeVil from the old amazon-neptune-tools repository.

We are running the Neptune AWS Batch Fargate that has a local disk limitation 20Gb, while the memory is pretty big - 128Gb. Such a low disk volume limits us to export the large database (getting the OS error). We look for the option to upload the exported csv files right away from JVM memory to the S3 bucket. Thank you!

@Cole-Greer
Copy link
Collaborator Author

@SergeVil I have a few questions to make sure we build the right solution. The biggest question I have is how big is the exported data in your case?

As it is currently implemented, the export tool needs to buffer all of the data locally as it builds the schema and then reformats the csvs according to the schema.

Are you looking for a solution which would have all the results streamed to S3 with minimal local buffering or would your use case allow for the full export to be buffered locally as long as it is buffered in memory instead of on disk?

@SergeVil
Copy link

@Cole-Greer Our current total data size is 14Gb. We may expect more, but this is the order of magnitude. The export folder on S3 currently contains 423 objects.

We will be fine with both approaches that you propose. I think that the first one is more generic as it can support the bigger databases, but the second one will also work for our use case.

@SergeVil
Copy link

Also you can store the files in the compressed format on the local disk. It can save up to 60% for plain text.

@Cole-Greer
Copy link
Collaborator Author

@SergeVil It's good to hear that both approaches will work for you. We are going to start by adding an option to buffer the results in memory as it is likely to be the quickest solution for your issue. We are having early discussions to potentially change the way we resolve the graph schema to remove the need for this buffering but those changes would be extensive and won't be ready for some time.

@SergeVil
Copy link

Hi @Cole-Greer. Thank you for taking the quickest path! When you are planning a release we can try?

@Cole-Greer
Copy link
Collaborator Author

@SergeVil The next Neptune Export release is intended to come out later today although unfortunately this fix has not yet been completed. It is being prioritized for the following release which is scheduled to come out at the end of April.

@SergeVil
Copy link

SergeVil commented Apr 2, 2023

@Cole-Greer Thank you for the update. Please, keep me posted if anything changes, our production release depends on this.

@Cole-Greer
Copy link
Collaborator Author

@SergeVil I spent some time working on the in-memory buffering option mentioned above and unfortunately it does require additional capabilities to implement it in a performant manner. I believe there is a viable short-term workaround which you could make use of immediately and that covers your use case. Docker containers have a /dev/shm/ directory which is essentially a RAM-backed filesystem intended for inter-process communication. In my tests the size of this filesystem appears to only be bounded by the memory available to the container. This allows us to effectively shift all of the buffering into memory simply by changing the --root-path option when invoking the export service.
For example:

java -jar neptune-export.jar nesvc 
    --root-path /dev/shm/neptune-export 
    --json '{
        "command": "export-rdf", 
        "outputS3Path" : S3_URL,
        "params": {
            "endpoint" : NEPTUNE_ENDPOINT,
            "format" : "turtle"
        }
    }'

I hope that this provides a short term solution to your issue. Please let me know if you have any challenges related to this workaround.
Long term solution will require extensive reworking of how the tool infers graph schemas and processes query results. There are plans to do such work to incorporate the new Neptune Summary API to derive a graph schema before any querying of the graph; however, we can’t confirm any timelines for these enhancements.
Please let me know if suggested workaround resolves your issues, and is feasible for you to apply in the short term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants