Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when spawning isolates that use HTTP #59937

Open
davidmartos96 opened this issue Jan 18, 2025 · 5 comments
Open

Memory leak when spawning isolates that use HTTP #59937

davidmartos96 opened this issue Jan 18, 2025 · 5 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends.

Comments

@davidmartos96
Copy link
Contributor

Dart version: Dart SDK version: 3.6.0 (stable) (Thu Dec 5 07:46:24 2024 -0800) on "linux_x64"

On our production shelf server, deployed on Cloud Run, we've discovered a memory leak. The leak happens when using Isolate.run and inside the isolate callback we fetch data from an external endpoint using the http package.

Here is a reproduction repository: https://github.com/davidmartos96/dart-isolate-http-leak

The RSS memory will rapidly grow with the cli.dart and stress_server.dart scripts provided. The scripts also provide a way to disable the isolate to see the difference.

Cloud Run visualization

Image

@davidmartos96 davidmartos96 added the area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. label Jan 18, 2025
@mraleph
Copy link
Member

mraleph commented Jan 18, 2025

Are you sure it is a memory leak and not a higher peak memory usage? Each time you invoke Isolate.run you need to create a fresh isolate - fresh isolate means fresh static state, etc. If you run many of these concurrently it takes time for GC to reclaim all that memory. It's only a leak if RSS does not drop back eventually.

@davidmartos96
Copy link
Contributor Author

davidmartos96 commented Jan 18, 2025

@mraleph Thank you for the response.

We've been trying to debug a leak on our Dart server the last weeks, where memory was starting to grow slowly but steadily. We've used a 1GB RAM configuration, and it never stopped growing until reaching the max memory and Cloud Run starting a new instance because of OOM.

We will still need to wait a couple of days to see if a recent deploy without Isolate.run fully stops the leak. Otherwise it could be something else. The server serves around 1.5 requests/second, so it's not a big server where we would expect 1 GB memory.

One thing we've been wondering while debugging these past weeks, is the implications of setting the VM flag --old_gen_heap_size=500 in the 1 GB Cloud Run machine. Would that, assuming a "non leaking" long running program like a server, make the Garbage collector try to stay below that memory, and possibly do more aggressive collects if needed?

@davidmartos96
Copy link
Contributor Author

The endpoint on our server where we use Isolate.run is only called on some occasions, the reproduction stress script is not a representation of how our server was processing these requests, but it shows the memory growth without garbage collecting. On my local machine it kept growing past 3 GB, even when constraining the old heap size flag.

@davidmartos96
Copy link
Contributor Author

As a follow up, since we removed the Isolate.run from that particular endpoint route, the memory usage has stabilized on the 20% of the 1 GB machine. It even drops to the 10% on some occasions. This API route is called once every 10 seconds approximately, that's why we are suspicious of some kind of memory leak.
Here is a comparison visualization with the Cloud Run charts. This is regular usage on the production server, not using the stress tools in the reproduction repository.

Image

@mraleph
Copy link
Member

mraleph commented Jan 22, 2025

I can confirm that there is indeed a leak inside our HTTPS stack. The repro is like this:

import 'dart:io';
import 'dart:isolate';

import 'package:http/http.dart' as http;

Future<void> fetchResponse() async {
  return await Isolate.run(() async {
    try {
      final apiURL = "https://cache-test.skilldevs.com/sample-json.json";
      final externalApiRes = await http.get(Uri.parse(apiURL));
      if (externalApiRes.statusCode != 200) {
        throw Exception("Invalid status code ${externalApiRes.statusCode}");
      }
    } catch (e) {
      throw Exception("Failed to fetch data: $e");
    }
  });
}

void main() async {
  while (true) {
    await fetchResponse();
    final rssMb = ((ProcessInfo.currentRss) / (1024 * 1024)).toStringAsFixed(2);
    print('RSS: ${rssMb} Mb');
    await Future.delayed(const Duration(milliseconds: 100));
  }
}

Native side of the SSL filter implementation is holding a bunch of things strongly via persistent handles, including _RawSecureSocket itself via handshake_complete_ - which is questionable design to begin with because it means _RawSecureSocket is not collectable by GC unless this state is cleared.

This state is supposed to be cleared via SSLFilter::Destroy which is called from SecureSocket_Destroy1 which is usually called when socket is being closed.

However if filter is active (e.g. we have sent a request to the filter via IOService and waiting for the response - this happens in _pushAllFilterStages) then we delay destroying the SSLFilter state until after response is received.

That's where Isolate.run causes an issue: we exit the temporary isolate while the _pushAllFilterStages is waiting for the response - which means response never actually arrives and the SSLFilter and all associated state and the socket object itself - all of this just leaks.

There are at least two problems with this code - first one is ownership cycle between SSLFilter and _RawSecureSocket and the second one is lifetime management of SSLFilter which delays disposal until IOService response is delivered (probably should just use some reference counting instead).

cc @brianquinlan @mkustermann

Footnotes

  1. There is a comment in SecureSocket_Destroy which indicates that SSLFilter::Destroy can also be called via finalizer attached to the socket object. But as I have described already this is not true because SSLFilter is holding _RawSecureSocket alive via persistent handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends.
Projects
None yet
Development

No branches or pull requests

2 participants