We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After starting the Celeborn Worker, the following issues occur:
Worker not getting resources (Slots) The logs show:
maxSlots: 0 activeSlots: 0
This indicates that the Worker is not being allocated any resources, preventing it from executing any tasks.
Incorrect disk space information In the heartbeat message sent from the Worker to the Master, the disk space shows:
usableSpace: 8.0 EiB totalSpace: 0.0 B
This suggests that the Worker is not correctly detecting the disk space, which may indicate a problem with the storage path or mount.
No running Shuffle tasks The logs show:
committed shuffles: 0 running applications: 0
This indicates that no tasks are being submitted to the Worker, likely due to insufficient resources or failed shuffle allocation.
Memory management is normal, but no Shuffle operations are taking place Even though memory usage is normal:
Direct memory usage: 4.0 MiB/1024.0 MiB
No shuffle operations are occurring due to the lack of available resources.
maxSlots
activeSlots
usableSpace
totalSpace
committed shuffles
Here are the relevant log entries:
25/01/21 16:57:32,635 DEBUG [worker-disk-checker] LocalDeviceMonitor: Device check start 25/01/21 16:57:34,108 INFO [worker-memory-manager-reporter] MemoryManager: Direct memory usage: 4.0 MiB/1024.0 MiB, disk buffer size: 0.0 B, sort memory size: 0.0 B, read buffer size: 0.0 B, memory file storage size: 0.0 B 25/01/21 16:57:37,897 INFO [worker-forward-message-scheduler] StorageManager: Updated diskInfos: 25/01/21 16:57:37,898 DEBUG [worker-forward-message-scheduler] MasterClient: Send rpc message HeartbeatFromWorker(21.102.91.135,38565,35711,40333,42329, Stream(DiskInfo(maxSlots: 0, committed shuffles 0, running applications 0, shuffleAllocations: Map(), mountPoint: HDFS, usableSpace: 8.0 EiB, totalSpace: 0.0 B , avgFlushTime: 999999 ns, avgFetchTime: 999999 ns, activeSlots: 0, storageType: HDFS) status: HEALTHY dirs , ?),{},[],{}, false,WorkerStatus{state= Normal, stateStartTime=1737448232150},688489b2-4a4b-4840-bb9f-78bfd46031b8#54)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Description:
After starting the Celeborn Worker, the following issues occur:
Worker not getting resources (Slots)
The logs show:
This indicates that the Worker is not being allocated any resources, preventing it from executing any tasks.
Incorrect disk space information
In the heartbeat message sent from the Worker to the Master, the disk space shows:
This suggests that the Worker is not correctly detecting the disk space, which may indicate a problem with the storage path or mount.
No running Shuffle tasks
The logs show:
This indicates that no tasks are being submitted to the Worker, likely due to insufficient resources or failed shuffle allocation.
Memory management is normal, but no Shuffle operations are taking place
Even though memory usage is normal:
No shuffle operations are occurring due to the lack of available resources.
Steps to Reproduce:
maxSlots
,activeSlots
,usableSpace
,totalSpace
, andcommitted shuffles
fields.Log Summary:
Here are the relevant log entries:
The text was updated successfully, but these errors were encountered: