Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesla P4 - CUDA error cudaErrorIllegalAddress #278

Closed
altendky opened this issue Feb 12, 2023 · 10 comments
Closed

Tesla P4 - CUDA error cudaErrorIllegalAddress #278

altendky opened this issue Feb 12, 2023 · 10 comments

Comments

@altendky
Copy link
Contributor

While previously I have run bladebit CUDA with my Tesla P4, after noticing a few other people reporting issues with the card I tried again and was able to consistently recreate the crash. For this first failure I was using the Ubuntu binary from https://github.com/Chia-Network/bladebit/actions/runs/4129720923/jobs/7135639600#step:3:5.

https://gist.github.com/altendky/3ad52845cbb71c106dbe276f3d95bba1

Completed table 1 in 29.27 seconds with 3429027681 / 4294803672 entries ( 79.84% ).
Compressing tables 2 and 3...
 Step 1 completed step in 4.59 seconds.
CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

*** Panic!!! *** Fatal Error:  
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered.
./bladebit_cuda(+0xcf8cb)[0x564cf43288cb]
./bladebit_cuda(+0xcf0af)[0x564cf43280af]
./bladebit_cuda(+0x5217a)[0x564cf42ab17a]
./bladebit_cuda(+0x52443)[0x564cf42ab443]
./bladebit_cuda(+0x36e6d)[0x564cf428fe6d]
./bladebit_cuda(+0x2e7f0)[0x564cf42877f0]
./bladebit_cuda(+0x1c98b)[0x564cf427598b]
./bladebit_cuda(+0x18245)[0x564cf4271245]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f30b9f79083]
./bladebit_cuda(+0x1974e)[0x564cf427274e]

After Harold requested debug info, I made #271 to get debug builds with the following results from https://github.com/Chia-Network/bladebit/actions/runs/4149269955

https://gist.github.com/altendky/25ef339f5cfd28345dd641bdd9a1e4bb

Completed table 1 in 505.43 seconds with 3429368445 / 4294952657 entries ( 79.85% ).
Compressing tables 2 and 3...
 Step 1 completed step in 40.28 seconds.
Assertion Failed @ /home/runner/work/bladebit/bladebit/cuda/GpuStreams.cpp:571 UploadArray().
fish: “./bladebit_cuda -f b0a374845f4f…” terminated by signal SIGTRAP (Trace or breakpoint trap)

ASSERT( self->outgoingSequence - self->lockSequence < 2 );

void GpuUploadBuffer::UploadArray( const void* hostBuffer, uint32 length, uint32 elementSize, uint32 srcStride, 
                                   uint32 countStride, const uint32* counts, cudaStream_t workStream )
{
    ASSERT( hostBuffer );
    ASSERT( self->outgoingSequence - self->lockSequence < 2 );
@CharlieTemplar
Copy link

CharlieTemplar commented Feb 16, 2023

Just hit the same with a new P4. It crashed but and the process hung, couldn't stop or kill the process, had to reboot.

Compressing tables 2 and 3...
 Step 1 completed step in 4.46 seconds.
CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

*** Panic!!! *** Fatal Error:
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered.
./bladebit_cuda(+0xcf8cb)[0x5571133c18cb]
./bladebit_cuda(+0xcf0af)[0x5571133c10af]
./bladebit_cuda(+0x5217a)[0x55711334417a]
./bladebit_cuda(+0x52443)[0x557113344443]
./bladebit_cuda(+0x36e6d)[0x557113328e6d]
./bladebit_cuda(+0x2e7f0)[0x5571133207f0]
./bladebit_cuda(+0x1c98b)[0x55711330e98b]
./bladebit_cuda(+0x18245)[0x55711330a245]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f539c4efd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f539c4efe40]
./bladebit_cuda(+0x1974e)[0x55711330b74e]
kernel: [29475.902539] process '/root/bladebit_cuda' started with executable stack
kernel: [29691.711200] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
kernel: [29691.711207] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: [29691.711209] {1}[Hardware Error]: event severity: corrected
kernel: [29691.711212] {1}[Hardware Error]:  Error 0, type: corrected
kernel: [29691.711215] {1}[Hardware Error]:  fru_text: B3
kernel: [29691.711217] {1}[Hardware Error]:   section_type: memory error
kernel: [29691.711219] {1}[Hardware Error]:   error_status: 0x0000000000000400
kernel: [29691.711222] {1}[Hardware Error]:   physical_address: 0x0000006f60527e00
kernel: [29691.711228] {1}[Hardware Error]:   node: 1 card: 2 module: 0 rank: 0 bank: 1 row: 32260 column: 1016
kernel: [29691.711230] {1}[Hardware Error]:   error_type: 2, single-bit ECC
kernel: [29691.711256] mce: [Hardware Error]: Machine check events logged
kernel: [29955.269849] NVRM: GPU at PCI:0000:03:00: GPU-c2351e08-89f3-2cf2-9804-d1206ce6f89b
kernel: [29955.269861] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 1): Out Of Range Address
kernel: [29955.269915] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x504e48=0x14000e 0x504e50=0x0 0x504e44=0xd3eff2 0x504e4c=0x17f
kernel: [29955.270015] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 3): Out Of Range Address
kernel: [29955.270066] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x505e48=0x16000e 0x505e50=0x20 0x505e44=0xd3eff2 0x505e4c=0x17f
kernel: [29955.270156] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 4): Out Of Range Address
kernel: [29955.270205] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x506648=0x7000e 0x506650=0x20 0x506644=0xd3eff2 0x50664c=0x17f
kernel: [29955.270300] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
kernel: [29955.270352] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x50c648=0x15000e 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
kernel: [29955.270451] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x50ce48=0x0 0x50ce50=0x20 0x50ce44=0xd3eff2 0x50ce4c=0x17f
kernel: [29955.270544] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 1, TPC 2): Out Of Range Address
kernel: [29955.270597] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x50d648=0x19000e 0x50d650=0x20 0x50d644=0xd3eff2 0x50d64c=0x17f
kernel: [29955.270684] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 1, TPC 3): Out Of Range Address
kernel: [29955.270735] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x50de48=0x2b000e 0x50de50=0x20 0x50de44=0xd3eff2 0x50de4c=0x17f
kernel: [29955.270827] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 1, TPC 4): Out Of Range Address
kernel: [29955.270877] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x50e648=0x19000e 0x50e650=0x20 0x50e644=0xd3eff2 0x50e64c=0x17f
kernel: [29955.270980] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x514648=0x0 0x514650=0x20 0x514644=0xd3eff2 0x51464c=0x17f
kernel: [29955.271071] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x514e48=0x0 0x514e50=0x20 0x514e44=0xd3eff2 0x514e4c=0x17f
kernel: [29955.271164] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x515648=0x0 0x515650=0x20 0x515644=0xd3eff2 0x51564c=0x17f
kernel: [29955.271245] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 2, TPC 3): Out Of Range Address
kernel: [29955.271297] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x515e48=0x8000e 0x515e50=0x20 0x515e44=0xd3eff2 0x515e4c=0x17f
kernel: [29955.271388] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x516648=0x0 0x516650=0x20 0x516644=0xd3eff2 0x51664c=0x17f
kernel: [29955.271476] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 3, TPC 0): Out Of Range Address
kernel: [29955.271528] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x51c648=0x33000e 0x51c650=0x20 0x51c644=0xd3eff2 0x51c64c=0x17f
kernel: [29955.271610] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 3, TPC 1): Out Of Range Address
kernel: [29955.271664] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x51ce48=0x5000e 0x51ce50=0x20 0x51ce44=0xd3eff2 0x51ce4c=0x17f
kernel: [29955.271745] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 3, TPC 2): Out Of Range Address
kernel: [29955.271821] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x51d648=0x8000e 0x51d650=0x20 0x51d644=0xd3eff2 0x51d64c=0x17f
kernel: [29955.271910] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 3, TPC 3): Out Of Range Address
kernel: [29955.271965] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x51de48=0x11000e 0x51de50=0x20 0x51de44=0xd3eff2 0x51de4c=0x17f
kernel: [29955.272049] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 3, TPC 4): Out Of Range Address
kernel: [29955.272105] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x51e648=0x2e000e 0x51e650=0x20 0x51e644=0xd3eff2 0x51e64c=0x17f
kernel: [29955.274816] NVRM: Xid (PCI:0000:03:00): 43, pid=18224, name=bladebit_cuda, Ch 00000008

@vb216
Copy link

vb216 commented Feb 22, 2023

Don't know if this adds anything, but have been trying to debug a bit more too, here's a gist with output from compute-sanitizer during the crash (ignore the 'About to uploadArray', was just me trying to see where it might be having issues.

https://gist.github.com/vb216/852194bc9e7307e46cc01f1880134f67

Alot of repetition of CudaConvertRMapToLinePoints hitting address out of bounds on different locations

That's with ECC turned off on the GPU incase it made any difference (doesnt seem to have done).

Just running the debug build to see if anything more useful comes out of that.

@harold-b
Copy link
Contributor

The UploadArray has been fixed (but not published yet), pending still to determine if it is related to the invalid memory address issue

@vb216
Copy link

vb216 commented Feb 27, 2023

Great news, keen to test

@CharlieTemplar
Copy link

The UploadArray has been fixed (but not published yet), pending still to determine if it is related to the invalid memory address issue

Do you have an estimate when this might be released or at lease available to build/test?
Thanks

@happycouak
Copy link

Same issue heren also with a P4

@harold-b
Copy link
Contributor

harold-b commented Mar 1, 2023

Do you have an estimate when this might be released or at lease available to build/test?

This should likely be landing before the weekend

@fisherwei
Copy link

Do you have an estimate when this might be released or at lease available to build/test?

This should likely be landing before the weekend

it seems fixed?

Bladebit Chia Plotter
Version      : 3.0.0-alpha1-dev
Git Commit   : b40fce737fe4d72b7882e9b0cd03f1bf8230a90a
Compiled With: gcc 11.3.0

[Global Plotting Config]
 Will create 1 plots.
 Thread count          : 20
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : xxx
 Pool contract address : xxx
 Benchmark mode        : disabled

[Bladebit CUDA Plotter]
Selected cuda device 0 : Tesla P4
 CUDA Compute Capability   : 6.1
 SM count                  : 20
 Max blocks per SM         : 32
 Max threads per SM        : 2048
 Async Engine Count        : 2
 L2 cache size             : 2.00 MB
 L2 persist cache max size : 0.00 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 7.43 GB
  Free                     : 7.32 GB

Allocating buffers (this may take a few seconds)...
Kernel RAM required       : 90240524288  bytes ( 86060.07  MiB or 84.04  GiB )
Intermediate RAM required : 2999001088   bytes ( 2860.07   MiB or 2.79   GiB )
Host RAM required         : 168443248640 bytes ( 160640.00 MiB or 156.88 GiB )
Total Host RAM required   : 258683772928 bytes ( 246700.07 MiB or 240.92 GiB )
GPU RAM required          : 6139441152   bytes ( 5855.03   MiB or 5.72   GiB )
Allocating buffers

Generating plot 1 / 1: xxx
Plot temporary file: /chia/temp/plot-k32-2023-03-05-13-47-xxx.plot.tmp

Generating F1
Finished F1 in 4.83 seconds.
Table 2 completed in 21.48 seconds with 4294800559 entries.
Table 3 completed in 30.51 seconds with 4294518162 entries.
Table 4 completed in 34.54 seconds with 4293967298 entries.
Table 5 completed in 34.38 seconds with 4293039917 entries.
Table 6 completed in 33.51 seconds with 4291105176 entries.
Table 7 completed in 31.85 seconds with 4287199487 entries.
Finalizing Table 7
Finalized Table 7 in 14.61 seconds.
Completed Phase 1 in 205.72 seconds
Marked Table 6 in 18.12 seconds.
Marked Table 5 in 15.48 seconds.
Marked Table 4 in 14.73 seconds.
Marked Table 3 in 14.46 seconds.
Marked Table 2 in 14.35 seconds.
Completed Phase 2 in 77.14 seconds
Compressing Table 1 and 2...
 Step 1 completed step in 4.61 seconds.
 Step 2 completed step in 24.21 seconds.
Completed table 1 in 28.82 seconds with 3428999797 / 4294800559 entries ( 79.84% ).
Compressing tables 2 and 3...
 Step 1 completed step in 4.63 seconds.
 Step 2 completed step in 15.01 seconds.
 Step 3 completed step in 29.83 seconds.
Completed table 2 in 49.48 seconds with 3439074383 / 4294518162 entries ( 80.08% ).
Compressing tables 3 and 4...
 Step 1 completed step in 9.50 seconds.
 Step 2 completed step in 21.46 seconds.
 Step 3 completed step in 29.52 seconds.
Completed table 3 in 60.48 seconds with 3464607606 / 4293967298 entries ( 80.69% ).
Compressing tables 4 and 5...
 Step 1 completed step in 4.60 seconds.
 Step 2 completed step in 20.93 seconds.
 Step 3 completed step in 29.83 seconds.
Completed table 4 in 55.35 seconds with 3530416064 / 4293039917 entries ( 82.24% ).
Compressing tables 5 and 6...
 Step 1 completed step in 8.15 seconds.
 Step 2 completed step in 18.78 seconds.
 Step 3 completed step in 37.09 seconds.
Completed table 5 in 64.02 seconds with 3709309680 / 4291105176 entries ( 86.44% ).
Compressing tables 6 and 7...
 Step 1 completed step in 5.47 seconds.
 Step 2 completed step in 24.07 seconds.
 Step 3 completed step in 42.06 seconds.
Completed table 6 in 71.61 seconds with 4287199487 / 4287199487 entries ( 100.00% ).
Serializing P7 entries
Completed serializing P7 entries in 19.75 seconds.
Completed Phase 3 in 349.51 seconds
Completed Plot 1 in 632.37 seconds ( 10.54 minutes )

/chia/temp/plot-k32-2023-03-05-13-47-xxx.plot.tmp -> /chia/temp/plot-k32-2023-03-05-13-47-xxx.plot
Completed writing plot in 0.07 seconds
Final plot table pointers: 
 Table 1:       1287879156 ( 0x000000004cc379f4 )
 Table 2:      16125676410 ( 0x00000003c12a4b7a )
 Table 3:      30105316110 ( 0x00000007026aab0e )
 Table 4:      44188743585 ( 0x0000000a49dab7a1 )
 Table 5:      58539678285 ( 0x0000000da13c9a4d )
 Table 6:      73617810060 ( 0x0000001123f6a28c )
 Table 7:      91045032060 ( 0x0000001532b4f07c )
 C 1    :             4096 ( 0x0000000000001000 )
 C 2    :          1718980 ( 0x00000000001a3ac4 )
 C 3    :          1719156 ( 0x00000000001a3b74 )

Final plot table sizes: 
 Table 1: 14150.43 MiB
 Table 2: 13332.02 MiB
 Table 3: 13431.00 MiB
 Table 4: 13686.12 MiB
 Table 5: 14379.63 MiB
 Table 6: 16619.89 MiB
 Table 7: 16865.45 MiB
 C 1    : 1.64 MiB
 C 2    : 0.00 MiB
 C 3    : 1226.58 MiB

@harold-b
Copy link
Contributor

harold-b commented Mar 5, 2023

Closing as it appears the fixe was successful as confirmed by 2 independent users.
Commit b40fce7

@harold-b harold-b closed this as completed Mar 5, 2023
@h00k66
Copy link

h00k66 commented Apr 27, 2023

what is your Tesla P4 clock speed when plotting/farming ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants