Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run par as an entrypoint if there is no patch or jetter patch. #994

Merged
merged 1 commit into from
Dec 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
run par as an entrypoint if there is no patch or jetter patch. (#994)
Summary:

# Context:
Currently, when running torchx local job, we are using penv_python as entrypoint. That means we pass the actual .par or .xar file as argument to penv_python.  within penv_python, the par/xar is executed as a new process.

# Old way to run torchx local job.

For example, if the local job is running "jetter --help", torchx runs it like:
  PENV_PAR='/data/users/yikai/fbsource/buck-out/v2/gen/fbcode/a6cb9616985b22b0/jetter/__jetter-bin__/jetter-bin-inplace.par' penv_python -m jetter.main --help
It passes the par file as an environment variable called "PENV_PAR"(There is another way to pass this to penv_python, which is passing 'PENV_PARNAME' as env variable then get the par file's path using it. But it is very very rare, only 0.1% of total usage.)

# New way to run torchx local job
After migration, We will run it like:
  PAR_MAIN_OVERRIDE=jetter.main /data/users/yikai/fbsource/buck-out/v2/gen/fbcode/a6cb9616985b22b0/jetter/__jetter-bin__/jetter-bin-inplace.par --help


NOTE: This diff only migrates one of the most common use cases, which: 1. There are no patch or jetter patch. 2. it's a par not xar. 3. the par file is passed via "PENV_PAR" env variable.  For other use cases, we still run penv_python as entrypoint.

Reviewed By: Sanjay-Ganeshan

Differential Revision: D66621649
yikaiMeta authored and facebook-github-bot committed Dec 18, 2024
commit 81b78b8596f62823e5dbd525d3cf5c56fb8f4f6f
20 changes: 18 additions & 2 deletions torchx/schedulers/local_scheduler.py
Original file line number Diff line number Diff line change
@@ -696,12 +696,11 @@ def _popen(
log.debug(f"Running {role_name} (replica {replica_id}):\n {args_pfmt}")
env = self._get_replica_env(replica_params)

proc = subprocess.Popen(
proc = self.run_local_job(
args=replica_params.args,
env=env,
stdout=stdout_,
stderr=stderr_,
start_new_session=True,
cwd=replica_params.cwd,
)
return _LocalReplica(
@@ -714,6 +713,23 @@ def _popen(
error_file=env.get("TORCHELASTIC_ERROR_FILE", "<N/A>"),
)

def run_local_job(
self,
args: List[str],
env: Dict[str, str],
stdout: Optional[io.FileIO],
stderr: Optional[io.FileIO],
cwd: Optional[str] = None,
) -> "subprocess.Popen[bytes]":
return subprocess.Popen(
args=args,
env=env,
stdout=stdout,
stderr=stderr,
start_new_session=True,
cwd=cwd,
)

def _get_replica_output_handles(
self,
replica_params: ReplicaParam,