Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM crash on GC with current maven snapshot #18

Open
ochafik opened this issue Mar 18, 2015 · 14 comments
Open

JVM crash on GC with current maven snapshot #18

ochafik opened this issue Mar 18, 2015 · 14 comments

Comments

@ochafik
Copy link
Member

ochafik commented Mar 18, 2015

From @twitwi on July 1, 2013 22:16

Hi @ochafik

I have a (now) very simple program that crashes with the latest maven snapshots but works with javacl-1.0.0-RC3.jar.

public static void main(String[] args) {
    CLContext context = JavaCL.createBestContext();
    CLDevice[] devices = context.getDevices();
    for (int i = 0; i < devices.length; i++) {
        System.err.println(i+": "+devices[i]);
    }
    System.err.println("Now GC'ing");
    System.gc(); // crash here
    System.err.println("GC'ed");
}

I run using optirun under linux mint 15, 64bits, with a "NVS 5400M (NVIDIA CUDA)" device. The rest of the program works fine (complicated opencl kernel work) but the GC crashes the VM with:

0: NVS 5400M (NVIDIA CUDA)
Now GC'ing
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8e46073bcc, pid=15591, tid=140248881133312
#
# JRE version: 7.0_21-b02
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x75cbcc]  PSRootsClosure<false>::do_oop(oopDesc**)+0xc

It might be a temporary issue (or specific to my device) but I prefer to report it.

Rémi

Copied from original issue: nativelibs4java/nativelibs4java#420

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @twitwi on July 2, 2013 7:16

I could try on another machine (linux 64bits, xubuntu 12.10) and it does not crash on GC.
So it seems the problem is contextual (driver? optirun? …)

0: GeForce GTX 560 Ti (NVIDIA CUDA)
Now GC'ing
GC'ed

The used GPU is a secondary card (not used for any display).

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

Hi @twitwi ,

Thank you so much for investigating and providing such a narrowed down test case!
By any chance, have you tried installing AMD Stream? (CPU-only OpenCL implementation)

Also, have you made sure the exact same version of Java is being used on both setups? (and have you tried turning compressed oops on/off, just in case?)

Could you also try calling CLAbstractEntity.release() on each on context prior to GC'ing?
And could you put your test in a loop (as done in BridJ's MemoryTest ) to give more chances to xubuntu to fail as well?

Finally, a fuller native stack trace might be useful, please do not fear of spamming this issue with a larger log :-)

Cheers

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @twitwi on July 3, 2013 23:20

I tried (on the failing machine) with two versions of java (6 and 7), varying UseCompressedOops and bridj.protected.
Also with the AMD APP as I had it installed before.
The test program is quite simple:

public static void main(String[] args) {
    CLContext context = JavaCL.createBestContext();
    CLDevice[] devices = context.getDevices();
    for (int i = 0; i < devices.length; i++) {
        System.err.println(i+": "+devices[i]);
    }
    System.err.println("Releasing context");
    context.release();
    System.err.println("Now GC'ing");
    System.gc();
    System.err.println("GC'ed");
}

Results incoming…

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @twitwi on July 3, 2013 23:36

Overall, with optirun, the only impacting variable is the javacl-core version (RC3 (works) vs SNAPSHOT).

Overall, with amdapp (cpu, no ati card), bridj.protected=true makes the CL platform not found, while RC3 works and SNAPSHOT does the same error.

The java version, UseCompressedOops and releasing the context seem to have no impact.

Details:
http://dl.heeere.com/withoptirun.zip
http://dl.heeere.com/withamdapp.zip

Script that produced it:

pre= #optirun
for dep in dependency dependency-RC3 ; do
    for java in java /usr/lib/jvm/java-6-openjdk-amd64/bin/java ; do
        echo "JAVA: $java"
        echo
        $java -version
        echo
        for opt in {,-Dbridj.protected=true}" "{,-XX:+UseCompressedOops,-XX:-UseCompressedOops} ; do
            echo "RUNNING: $pre $java $opt -cp target/DPGMMJavaCL-1.0-SNAPSHOT.jar:target/$dep/* com.heeere.dpgmm.javacl.TestGC"
            $pre $java $opt -cp target/DPGMMJavaCL-1.0-SNAPSHOT.jar:target/$dep/* com.heeere.dpgmm.javacl.TestGC
            echo;echo;echo
        done
    done
done

Maybe, I should bisect the thing if it is not reproducible elsewhere.

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @twitwi on July 4, 2013 11:0

I just read part of the JavaCL code and I have a note to add. My AMDAPP is not installed in /opt/AMDAPP/lib (custom install)… in case it matters

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

Hi @twitwi ,

Thanks for taking the time to investigate, much appreciated!
Bisecting might help but might be non-trivial, since the issue might come from BridJ as well (you'd have to recompile both libraries/BridJ and libraries/OpenCL at each step).
Could you please add a last check with BRIDJ_DIRECT=0 if you have time? (direct mode is also disabled with BRIDJ_PROTECTED=1, but since it mysteriously made the platform to disappear... (which is an issue of its own right, maybe even related)
As for the lib path, I doubt it could cause the issue, although it might be good to see which library BridJ picks, which should be somewhere in the verbose or debug logs.
(if it's not the right lib, providing the full path with -Dbridj.OpenCL.library=/some/path/amdocl64.so could help)

I'm now trying to install mint linux :-)

Cheers

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

Just to check: are you using the Ubuntu-based Linux Mint (the default one), or Linux Mint Debian Edition?

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @twitwi on July 4, 2013 12:37

default one

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @twitwi on July 4, 2013 12:38

(ubuntu based, not kubuntu neither)

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

Hi @twitwi ,

Could you please try again with the latest 1.0-SNAPSHOT? There's a magic one-line fix that might help...

Cheers

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @twitwi on July 30, 2013 1:32

Hi and thanks for the patch,
I'm away for a few weeks. I'll try when I come back.

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

Hi Rémi,

Friendly ping :-)

Cheers

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

From @ochrons on March 6, 2014 12:54

I seem to be seeing the same error with RC3:

When garbage collector runs, there is an Access Violation error due to CLDevice cleanup (log attached with stack trace etc.)

Using JavaCL 1.0.0-RC3

Environment: Windows 8.1, two OpenCL platforms
Number of devices in platform NVIDIA CUDA: 1
Number of devices in platform Intel(R) OpenCL: 1
--- Info for device Quadro 2000M: ---
CL_DEVICE_NAME: Quadro 2000M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 331.65
--- Info for device Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz: ---
CL_DEVICE_NAME: Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz
CL_DEVICE_VENDOR: Intel(R) Corporation
CL_DRIVER_VERSION: 3.0.1.15216

To replicate (simple Scala app):

object Start extends App {
  override def main(args: Array[String]) = {
    val oclPlatforms: Array[CLPlatform] = JavaCL.listPlatforms()
    // list the NVIDIA devices
    oclPlatforms(0).listAllDevices(true)
    System.gc()
    Thread.sleep(1000)
  }
}

Same problem when calling JavaCL.getBestDevice()

Getting the CPU devices, on the other hand works ok:

object Start extends App {
  override def main(args: Array[String]) = {
    val oclPlatforms: Array[CLPlatform] = JavaCL.listPlatforms()
    // list the CPU devices
    oclPlatforms(1).listAllDevices(true)
    System.gc()
    Thread.sleep(1000)
  }
}

Relevant part of the log dump:

Stack: [0x000000000af20000,0x000000000b020000],  sp=0x000000000b01e918,  free space=1018k
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.nativelibs4java.opencl.library.OpenCLLibrary.clReleaseDevice(J)I+0
j  com.nativelibs4java.opencl.CLDevice.clear()V+7
j  com.nativelibs4java.opencl.CLAbstractEntity.doRelease()V+10
j  com.nativelibs4java.opencl.CLAbstractEntity.finalize()V+1
v  ~StubRoutines::call_stub
j  java.lang.ref.Finalizer.invokeFinalizeMethod(Ljava/lang/Object;)V+0
j  java.lang.ref.Finalizer.runFinalizer()V+45
j  java.lang.ref.Finalizer.access$100(Ljava/lang/ref/Finalizer;)V+1
j  java.lang.ref.Finalizer$FinalizerThread.run()V+24
v  ~StubRoutines::call_stub

@ochafik
Copy link
Member Author

ochafik commented Mar 18, 2015

Need to try with 1.0.0-RC4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant