Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] title Worker not getting resources (Slots) #3075

Open
yangbin09 opened this issue Jan 21, 2025 · 0 comments
Open

[BUG] title Worker not getting resources (Slots) #3075

yangbin09 opened this issue Jan 21, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@yangbin09
Copy link


Description:

After starting the Celeborn Worker, the following issues occur:

  1. Worker not getting resources (Slots)
    The logs show:

    maxSlots: 0
    activeSlots: 0
    

    This indicates that the Worker is not being allocated any resources, preventing it from executing any tasks.

  2. Incorrect disk space information
    In the heartbeat message sent from the Worker to the Master, the disk space shows:

    usableSpace: 8.0 EiB
    totalSpace: 0.0 B
    

    This suggests that the Worker is not correctly detecting the disk space, which may indicate a problem with the storage path or mount.

  3. No running Shuffle tasks
    The logs show:

    committed shuffles: 0
    running applications: 0
    

    This indicates that no tasks are being submitted to the Worker, likely due to insufficient resources or failed shuffle allocation.

  4. Memory management is normal, but no Shuffle operations are taking place
    Even though memory usage is normal:

    Direct memory usage: 4.0 MiB/1024.0 MiB
    

    No shuffle operations are occurring due to the lack of available resources.

Steps to Reproduce:

  1. Start the Celeborn Worker.
  2. Start a Spark job that performs a shuffle operation.
  3. Check the logs of the Celeborn Worker, paying attention to the maxSlots, activeSlots, usableSpace, totalSpace, and committed shuffles fields.

Log Summary:

Here are the relevant log entries:

25/01/21 16:57:32,635 DEBUG [worker-disk-checker] LocalDeviceMonitor: Device check start
25/01/21 16:57:34,108 INFO [worker-memory-manager-reporter] MemoryManager: Direct memory usage: 4.0 MiB/1024.0 MiB, disk buffer size: 0.0 B, sort memory size: 0.0 B, read buffer size: 0.0 B, memory file storage size: 0.0 B
25/01/21 16:57:37,897 INFO [worker-forward-message-scheduler] StorageManager: Updated diskInfos:
25/01/21 16:57:37,898 DEBUG [worker-forward-message-scheduler] MasterClient: Send rpc message HeartbeatFromWorker(21.102.91.135,38565,35711,40333,42329, Stream(DiskInfo(maxSlots: 0, committed shuffles 0, running applications 0, shuffleAllocations: Map(), mountPoint: HDFS, usableSpace: 8.0 EiB, totalSpace: 0.0 B , avgFlushTime: 999999 ns, avgFetchTime: 999999 ns, activeSlots: 0, storageType: HDFS) status: HEALTHY dirs , ?),{},[],{}, false,WorkerStatus{state= Normal, stateStartTime=1737448232150},688489b2-4a4b-4840-bb9f-78bfd46031b8#54)
@yangbin09 yangbin09 added the bug Something isn't working label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant