run par as an entrypoint if there is no patch or jetter patch. (#994)

Summary: # Context: Currently, when running torchx local job, we are using penv_python as entrypoint. That means we pass the actual .par or .xar file as argument to penv_python. within penv_python, the par/xar is executed as a new process. # Old way to run torchx local job. For example, if the local job is running "jetter --help", torchx runs it like: PENV_PAR='/data/users/yikai/fbsource/buck-out/v2/gen/fbcode/a6cb9616985b22b0/jetter/__jetter-bin__/jetter-bin-inplace.par' penv_python -m jetter.main --help It passes the par file as an environment variable called "PENV_PAR"(There is another way to pass this to penv_python, which is passing 'PENV_PARNAME' as env variable then get the par file's path using it. But it is very very rare, only 0.1% of total usage.) # New way to run torchx local job After migration, We will run it like: PAR_MAIN_OVERRIDE=jetter.main /data/users/yikai/fbsource/buck-out/v2/gen/fbcode/a6cb9616985b22b0/jetter/__jetter-bin__/jetter-bin-inplace.par --help NOTE: This diff only migrates one of the most common use cases, which: 1. There are no patch or jetter patch. 2. it's a par not xar. 3. the par file is passed via "PENV_PAR" env variable. For other use cases, we still run penv_python as entrypoint. Reviewed By: Sanjay-Ganeshan Differential Revision: D66621649
pytorch · facebook-github-bot · Dec 18, 2024 · Dec 18, 2024 · Dec 18, 2024 · 81b78b8596f62823e5dbd525d3cf5c56fb8f4f6f
commit 81b78b8596f62823e5dbd525d3cf5c56fb8f4f6f
diff --git a/torchx/schedulers/local_scheduler.py b/torchx/schedulers/local_scheduler.py
@@ -696,12 +696,11 @@ def _popen(
         log.debug(f"Running {role_name} (replica {replica_id}):\n {args_pfmt}")
         env = self._get_replica_env(replica_params)
 
-        proc = subprocess.Popen(
+        proc = self.run_local_job(
             args=replica_params.args,
             env=env,
             stdout=stdout_,
             stderr=stderr_,
-            start_new_session=True,
             cwd=replica_params.cwd,
         )
         return _LocalReplica(
@@ -714,6 +713,23 @@ def _popen(
             error_file=env.get("TORCHELASTIC_ERROR_FILE", "<N/A>"),
         )
 
+    def run_local_job(
+        self,
+        args: List[str],
+        env: Dict[str, str],
+        stdout: Optional[io.FileIO],
+        stderr: Optional[io.FileIO],
+        cwd: Optional[str] = None,
+    ) -> "subprocess.Popen[bytes]":
+        return subprocess.Popen(
+            args=args,
+            env=env,
+            stdout=stdout,
+            stderr=stderr,
+            start_new_session=True,
+            cwd=cwd,
+        )
+
     def _get_replica_output_handles(
         self,
         replica_params: ReplicaParam,