This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

First pass of job lock usage review #233

Open

yuesong wants to merge 1 commit into master from yuesong/joblock-review

Contributor

yuesong commented Apr 18, 2017

This is not a PR in the typical sense. The changes are all comments as the result of my review of all Spawn.jobLock usage. The necessity of many are questionable to me. My main question is the intent of the jobLock in the first place: is it to guard against concurrent modification to SpawnState.jobs map, to individual jobs, or to tasks? Based on current code base, it's being used rather liberally for all of those and more. History may also be a factor: SpawnState.jobs is a ConcurrentHashMap, so some of the locks are obviously unnecessary, but jobs might have been a regular Map at one time such that the locks were required? I want to understand if I'm missing something important. This is my attempt at a collaborative code review to get other more knowledgable developers to comment.

Usages that I'm rather certain to be unnecessary are marked as "Not needed" - I could be wrong. Usages that I'm unsure are marked as "Maybe". All inputs are welcome.


          First pass of job lock usage review

711c5ac

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java

@@ @@ -1656,6 +1668,7 @@ public boolean checkStatusForMove(String hostID) { @@
                   }
                   public boolean prepareTaskStatesForRebalance(Job job, JobTask task, boolean isMigration) {
+                      // Why?

ghost Apr 18, 2017

I think it is conceivable that the task could have suddenly kicked in between the call to isInMovableState and the call to job.setTaskState .

If that were the case, the task state would show up as REBALANCE, but the true state would be RUNNING, which would be a bad outcome.

Holding the job lock here prevents the task from kicking until we have decided whether we are rebalancing this task.

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java

@@ @@ -1932,6 +1948,7 @@ public void stopJob(String jobUUID) throws Exception { @@
                   public void killJob(String jobUUID) throws Exception {
                       boolean success = false;
                       while (!success && !shuttingDown.get()) {
+                          // Maybe

ghost Apr 18, 2017 •

edited by ghost

Loading

I think this loop is concerning for an unrelated reason -- if the queueLock is locked, we call tryLock over and over again without any sort of delay in between. Maybe we should sleep for a few millis on failure?

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java

                       } finally {
                           releaseJobLock();
                       }
+                      // If job is deleted in another thread, stopJob and killJob will throw exception. Should catch that and return DeleteStatus.JOB_MISSING

ghost Apr 18, 2017

I don't understand why we are calling both stopJob and killJob in the first place. If we are hoping to let the job shut down cleanly, 100 ms is certainly not enough when we consider that the job almost certainly has to replicate and backup. I think killJob should be sufficient, since our intent is to delete anyway.

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java

                    * @return A string description
                    */
                   public JSONObject fixTaskDir(String jobId, int node, boolean ignoreTaskState, boolean orphansOnly) {
+                      // Maybe

ghost Apr 18, 2017

Given all the task state checks inside the loop, I think this one should be kept.

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java

@@ @@ -1388,6 +1397,7 @@ private void updateJobDependencies(String jobId) { @@
                       if ((job == null) || (job.getParameters() == null)) {
                           return dataSources;
                       }
+                      // Maybe

ghost Apr 18, 2017

Given that parameters is a non-threadsafe collection that is exposed directly via a getter, I think that some protection is necessary. There is nothing to stop someone from calling job.getParameters.add(something) at an inopportune moment.

Changing the getter to return a copy rather than the actual list might be a more lightweight solution

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java

@@ @@ -1260,6 +1265,8 @@ public RebalanceOutcome rebalanceHost(String hostUUID) { @@
                    * @return True if the task is successfully removed
                    */
                   public boolean deleteTask(String jobUUID, String hostUuid, Integer node, boolean isReplica) {
+                      // Maybe - but probably can be better handled by a lock on the individual job/task?
+                      // Why do spawnMQ.sendControlMessage and queueJobTaskUpdateEvent need to be guarded?

ghost Apr 18, 2017

At the very least, I think the task.setReplicas call is worthy of consideration.

If a task has replicas A, B, and C, and one thread tries to remove A at the same time as another thread tries to remove B, then it is possible that one removal would be swallowed depending on the timing.

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java

                           return false;
                       }
                       Job job;
+                      // Maybe

ghost Apr 18, 2017

Strongly believe that this lock should be kept.

If we somehow tried to swap in two threads at the same time (swaps can be executed via an HTTP call, I believe) then modifications to the replicas and the hostUUID could interleave in a very harmful way -- for example, setting the replica and the host to the same UUID and "losing" a valid replica elsewhere.

ghost reviewed

View reviewed changes

hydra-main/src/main/java/com/addthis/hydra/job/spawn/balancer/SpawnBalancer.java

                   private List<JobTask> findAllTasksAssignedToHost(String failedHostUUID) {
                       List<JobTask> rv = new ArrayList<>();
+                      // Why?

ghost Apr 18, 2017

I think that it is prudent to hold the job lock while failing hosts -- which is the only time this method is called. We don't want to, e.g., kick tasks before all the bad hosts have been replaced.

However, this read-only method does not appear to be useful place to get the lock. I would suggest SpawnBalancer's attemptFixTaskForFailedHost as a method that more clearly needs the lock.

Contributor Author

yuesong commented Apr 20, 2017

@alstaplesmoore Thanks for the detailed comments. I need to spend some time digesting. It's probably easier to look at a block of code guarded by the lock and reason whether it can be executed by two threads concurrently, but it's harder to tell the whether different blocks need to be guarded by the same lock.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet