Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in Ruby gem with repeated recursive messages #19498

Open
nirvdrum opened this issue Dec 3, 2024 · 9 comments
Open

Memory leak in Ruby gem with repeated recursive messages #19498

nirvdrum opened this issue Dec 3, 2024 · 9 comments
Assignees
Labels

Comments

@nirvdrum
Copy link

nirvdrum commented Dec 3, 2024

What version of protobuf and what language are you using?
Version: 4.29.0
Language: Ruby

What operating system (Linux, Windows, ...) and version?

  • macOS 15.1.1
  • Ubuntu 24.04

What runtime / compiler are you using (e.g., python version or gcc version)

ruby -v
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin24]

protoc --version
libprotoc 29.0

What did you do?
Steps to reproduce the behavior:

  1. Create a message with a repeated recursive field
  2. In code, create a long-held reference of this type (e.g., in an instance variable)
  3. Create new instances that link to the long-lived object
  4. Observe memory grows unbounded

I've pulled together a small reproduction. To see the issue:

  1. git clone https://github.com/nirvdrum/grpc-protobuf-experiments
  2. cd grpc-protobuf-experiments
  3. bundle install
  4. bundle exec rake
  5. bundle exec ruby leak-simple.rb

At the heart of the problem is a simple message definition:

syntax = "proto3";

package proto.leak;

message Recursive {
    repeated Recursive data = 1;
}

The repository includes a Gemfile to lock down the versions I used to reproduce the problem and a Rakefile to ease the protoc instantiation. You must be using an older protoc (e.g., brew install protobuf@21) because there appears to be a Ruby code generation bug in newer releases. This older version of protoc has no bearing on the runtime, where we do use a modern version of the google-protobuf gem. Ubuntu uses an older release of protoc (3.12.21), not a newer -- apologies for the confusion.

The leak-simple.rb script in the linked repo creates local instances of Proto::Leak::Recursive in a loop with each object linking back to a global instance of Proto::Leak::Recursive via the repeated data field. After an inner loop finishes executing we trigger GC and then print out the RSS of the process along with Ruby's view of how much memory is allocated. Increasing the inner loop iterations will print out larger RSS values.

In simplified form, the loop looks like:

require_relative "gen/protobuf/simple_pb"
datum = Proto::Leak::Recursive.new

100_000.times { Proto::Leak::Recursive.new(data: [datum]) }

The repository version expands on that simple loop to print out the total memory growth (RSS) over the duration of the script execution. Running with VERBOSE=true will also print out periodic memory usage numbers. The repo also has leak-bigtable.rb and leak-bigtable-extracted.rb scripts. These can also be run with bundle exec ruby leak-bigtable.rb and bundle exec ruby leak-bigtable-extracted.rb. All three scripts are functionally similar, but vary on the source of the proto definitions.

The leak-bigtable.rb file uses the google-cloud-bigtable gem to create instances of Google::Cloud::Bigtable::V2::RowFilter::Chain. This script reproduces the memory leak we have that's impacting production workloads.

The leak-bigtable-extracted.rb file uses a subset of the proto definitions from the google-cloud-bigtable gem. The goal here was to use the source proto definitions but strip them down to the bare minimum.

The leak-simple.rb gets at the heart of the problem and reproduces without the large google-cloud-bigtable dependency.

What did you expect to see

I'd expect objects created in a loop that don't escape will not lead to a growth in memory consumption.

What did you see instead?

Memory continues to grow unbounded. However, ObjectSpace does not reflect this growth, suggesting that the memory is growing off-heap in the the google-protobuf extension. After debugging, however, we believe the problem has to do with the fuse operation for Google::Protobuf::Internal::Arena. It looks like there is an arena that is held for a long time and constantly appended to. However, that would be memory on the Ruby heap and ObjectSpace is telling us that the heap isn't growing. We believe that's an artifact of a second bug where the memsize hook (i.e., Arena_memsize or a related function) is not properly reporting the post-fuse size. In particular, dividing the size by the fused count looks like the wrong operation:

memsize /= fused_count;

The total memsize is divided by the total number of arenas fused into it, presumably because the other arenas would report their size, but those arenas are no longer alive and thus have no size to report.

@nirvdrum nirvdrum added the untriaged auto added to all issues by default when created. label Dec 3, 2024
@deannagarcia
Copy link
Member

Thanks for this report and the detailed reproduction instructions! We are looking into the issue now. Were you able to see if this issue existed before (i.e. in 28.0) or on main?

I see you said you couldn't use a newer version of protoc "because there appears to be a Ruby code generation bug in newer releases". Would you be able to open a new bug for that issue so we can look into it as well?

@deannagarcia deannagarcia added ruby and removed untriaged auto added to all issues by default when created. labels Dec 4, 2024
@deannagarcia deannagarcia self-assigned this Dec 4, 2024
@JasonLunn
Copy link
Contributor

The memory growth issue sounds like another expression of #10106 which leads me to believe this is working as intended. Applying the workaround suggested there to this issue would mean using deep_copy on datum before assigning to the short lived object e.g.:

require_relative "gen/protobuf/simple_pb"
datum = Proto::Leak::Recursive.new

100_000.times { Proto::Leak::Recursive.new(data: [Google::Protobuf.deep_copy(datum)]) }

CC: @ahayworth

@nirvdrum
Copy link
Author

nirvdrum commented Dec 4, 2024

Thanks. I hadn't come across that issue, so thank you for linking them.

Applying the workaround suggested there to this issue would mean using deep_copy on datum before assigning to the short lived object...

I tried to strip this down to the bare minimum to illustrate the issue. But, the code in question is mostly in google-cloud-bigtable, having to do with chaining filters. The Bigtable client can hold long-lived references. I can file a different issue for that project, but I think users of Bigtable are going to be limited in what they can do to workaround this. Calling deep_copy on the root of the object graph on each query may be prohibitively expensive.

@nirvdrum
Copy link
Author

nirvdrum commented Dec 4, 2024

Is the issue with the misreported memory size a known problem? If not, I'll file a separate issue / open a PR to track that. I think that has a fairly straightforward solution. Having that in place would help application authors narrow down where all their memory is going.

@JasonLunn
Copy link
Contributor

Is the issue with the misreported memory size a known problem? If not, I'll file a separate issue / open a PR to track that. I think that has a fairly straightforward solution. Having that in place would help application authors narrow down where all their memory is going.

Please do open an issue for the memory reporting.

@nirvdrum
Copy link
Author

nirvdrum commented Dec 4, 2024

I see you said you couldn't use a newer version of protoc "because there appears to be a Ruby code generation bug in newer releases". Would you be able to open a new bug for that issue so we can look into it as well?

@deannagarcia I'm sorry, that was a note from an earlier draft I forgot to delete. It stems from Ubuntu 24.04 shipping protoc 3.21.12 and the code it generates not working with either 4.28.3 or 4.29.0 of the google-protobuf gem. I mixed things up earlier in my notes.

@deannagarcia
Copy link
Member

Thanks for the clarification! So did you download a newer version of protoc to use? I would expect you to see errors if working with a protoc version that doesn't match the runtime.

@nirvdrum
Copy link
Author

nirvdrum commented Dec 6, 2024

Yes. I ended up using Linuxbrew (Homebrew for Linux) to unify the setup on my MacBook and my Linux machine. But, I've also used the available static builds successfully. I checked both Ubuntu 24.10 and the 25.04 pre-release and they're both shipping protoc 3.21.12 for whatever reason.

@dazuma
Copy link
Contributor

dazuma commented Dec 12, 2024

@nirvdrum Regarding the google-cloud-bigtable client specifically (which is my area), I think we can make some simple changes to remove the long-lived objects that are holding protobuf data in a long-lived arena. There are, for example, some common constants (i.e. that last the lifetime of the Ruby process) that we can either clone or create on the fly. I've opened googleapis/google-cloud-ruby#28100 to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants