Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ray_scheduler: workspace + fixed no role logging (pytorch#492)
Summary: This updates Ray to have proper workspace support. * `-c working_dir=...` is deprecated in favor of `torchx run --workspace=...` * `-c requirements=...` is optional and requirements.txt will be automatically read from the workspace if present * `torchx log ray://foo/bar` works without requiring `/ray/0` Pull Request resolved: pytorch#492 Test Plan: (torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx run -s ray --wait --log dist.ddp --env LOGLEVEL=INFO -j 2x1 -m scripts.compute_world_size torchx 2022-05-18 16:55:31 INFO Checking for changes in workspace `file:///home/tristanr/Developer/torchrec/examples/ray`... torchx 2022-05-18 16:55:31 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically. torchx 2022-05-18 16:55:31 INFO Built new image `/tmp/torchx_workspacebe6331jv` based on original image `ghcr.io/pytorch/torchx:0.2.0dev0` and changes in workspace `file:///home/tristanr/Developer/torch rec/examples/ray` for role[0]=compute_world_size. torchx 2022-05-18 16:55:31 WARNING The Ray scheduler does not support port mapping. torchx 2022-05-18 16:55:31 INFO Uploading package gcs://_ray_pkg_63a39f7096dfa0bd.zip. torchx 2022-05-18 16:55:31 INFO Creating a file package for local directory '/tmp/torchx_workspacebe6331jv'. ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td torchx 2022-05-18 16:55:31 INFO Launched app: ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td torchx 2022-05-18 16:55:31 INFO AppStatus: msg: PENDING num_restarts: -1 roles: - replicas: - hostname: <NONE> id: 0 role: ray state: !!python/object/apply:torchx.specs.api.AppState - 2 structured_error_msg: <NONE> role: ray state: PENDING (2) structured_error_msg: <NONE> ui_url: null torchx 2022-05-18 16:55:31 INFO Job URL: None torchx 2022-05-18 16:55:31 INFO Waiting for the app to finish... torchx 2022-05-18 16:55:31 INFO Waiting for app to start before logging... torchx 2022-05-18 16:55:43 INFO Job finished: SUCCEEDED (torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx log ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td ray/0 Waiting for placement group to start. ray/0 running ray.wait on [ObjectRef(8f2664c081ffc268e1c4275021ead9801a8d33861a00000001000000), ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)] ray/0 (CommandActor pid=494377) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: ray/0 (CommandActor pid=494377) entrypoint : scripts.compute_world_size ray/0 (CommandActor pid=494377) min_nodes : 2 ray/0 (CommandActor pid=494377) max_nodes : 2 ray/0 (CommandActor pid=494377) nproc_per_node : 1 ray/0 (CommandActor pid=494377) run_id : compute_world_size-mpr03nzqvvg3td ray/0 (CommandActor pid=494377) rdzv_backend : c10d ray/0 (CommandActor pid=494377) rdzv_endpoint : localhost:29500 ray/0 (CommandActor pid=494377) rdzv_configs : {'timeout': 900} ray/0 (CommandActor pid=494377) max_restarts : 0 ray/0 (CommandActor pid=494377) monitor_interval : 5 ray/0 (CommandActor pid=494377) log_dir : None ray/0 (CommandActor pid=494377) metrics_cfg : {} ray/0 (CommandActor pid=494377) ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group ray/0 (CommandActor pid=494406) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: ray/0 (CommandActor pid=494406) entrypoint : scripts.compute_world_size ray/0 (CommandActor pid=494406) min_nodes : 2 ray/0 (CommandActor pid=494406) max_nodes : 2 ray/0 (CommandActor pid=494406) nproc_per_node : 1 ray/0 (CommandActor pid=494406) run_id : compute_world_size-mpr03nzqvvg3td ray/0 (CommandActor pid=494406) rdzv_backend : c10d ray/0 (CommandActor pid=494406) rdzv_endpoint : 172.26.20.254:29500 ray/0 (CommandActor pid=494406) rdzv_configs : {'timeout': 900} ray/0 (CommandActor pid=494406) max_restarts : 0 ray/0 (CommandActor pid=494406) monitor_interval : 5 ray/0 (CommandActor pid=494406) log_dir : None ray/0 (CommandActor pid=494406) metrics_cfg : {} ray/0 (CommandActor pid=494406) ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: ray/0 (CommandActor pid=494377) restart_count=0 ray/0 (CommandActor pid=494377) master_addr=tristanr-arch2 ray/0 (CommandActor pid=494377) master_port=48089 ray/0 (CommandActor pid=494377) group_rank=1 ray/0 (CommandActor pid=494377) group_world_size=2 ray/0 (CommandActor pid=494377) local_ranks=[0] ray/0 (CommandActor pid=494377) role_ranks=[1] ray/0 (CommandActor pid=494377) global_ranks=[1] ray/0 (CommandActor pid=494377) role_world_sizes=[2] ray/0 (CommandActor pid=494377) global_world_sizes=[2] ray/0 (CommandActor pid=494377) ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t/attempt_0/0/error.json ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: ray/0 (CommandActor pid=494406) restart_count=0 ray/0 (CommandActor pid=494406) master_addr=tristanr-arch2 ray/0 (CommandActor pid=494406) master_port=48089 ray/0 (CommandActor pid=494406) group_rank=0 ray/0 (CommandActor pid=494406) group_world_size=2 ray/0 (CommandActor pid=494406) local_ranks=[0] ray/0 (CommandActor pid=494406) role_ranks=[0] ray/0 (CommandActor pid=494406) global_ranks=[0] ray/0 (CommandActor pid=494406) role_world_sizes=[2] ray/0 (CommandActor pid=494406) global_world_sizes=[2] ray/0 (CommandActor pid=494406) ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p/attempt_0/0/error.json ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish. ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.000942230224609375 seconds ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish. ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0013003349304199219 seconds ray/0 (CommandActor pid=494377) [0]:initializing `gloo` process group ray/0 (CommandActor pid=494377) [0]:successfully initialized process group ray/0 (CommandActor pid=494377) [0]:rank: 1, actual world_size: 2, computed world_size: 2 ray/0 (CommandActor pid=494406) [0]:initializing `gloo` process group ray/0 (CommandActor pid=494406) [0]:successfully initialized process group ray/0 (CommandActor pid=494406) [0]:rank: 0, actual world_size: 2, computed world_size: 2 ray/0 running ray.wait on [ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)] Reviewed By: kiukchung, msaroufim Differential Revision: D36500237 Pulled By: d4l3k fbshipit-source-id: 9ecf85b7860a7220262f0146890012cc88630cd2
- Loading branch information