-
Notifications
You must be signed in to change notification settings - Fork 86
First pass of job lock usage review #233
base: master
Are you sure you want to change the base?
Conversation
@@ -1656,6 +1668,7 @@ public boolean checkStatusForMove(String hostID) { | |||
} | |||
|
|||
public boolean prepareTaskStatesForRebalance(Job job, JobTask task, boolean isMigration) { | |||
// Why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is conceivable that the task could have suddenly kicked in between the call to isInMovableState and the call to job.setTaskState .
If that were the case, the task state would show up as REBALANCE, but the true state would be RUNNING, which would be a bad outcome.
Holding the job lock here prevents the task from kicking until we have decided whether we are rebalancing this task.
@@ -1932,6 +1948,7 @@ public void stopJob(String jobUUID) throws Exception { | |||
public void killJob(String jobUUID) throws Exception { | |||
boolean success = false; | |||
while (!success && !shuttingDown.get()) { | |||
// Maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this loop is concerning for an unrelated reason -- if the queueLock is locked, we call tryLock over and over again without any sort of delay in between. Maybe we should sleep for a few millis on failure?
@@ -1687,6 +1701,7 @@ public DeleteStatus forceDeleteJob(String jobUUID) throws Exception { | |||
} finally { | |||
releaseJobLock(); | |||
} | |||
// If job is deleted in another thread, stopJob and killJob will throw exception. Should catch that and return DeleteStatus.JOB_MISSING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are calling both stopJob and killJob in the first place. If we are hoping to let the job shut down cleanly, 100 ms is certainly not enough when we consider that the job almost certainly has to replicate and backup. I think killJob should be sufficient, since our intent is to delete anyway.
@@ -1480,6 +1490,7 @@ public RebalanceOutcome rebalanceJob(String jobUUID, int tasksToMove, String use | |||
* @return A string description | |||
*/ | |||
public JSONObject fixTaskDir(String jobId, int node, boolean ignoreTaskState, boolean orphansOnly) { | |||
// Maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all the task state checks inside the loop, I think this one should be kept.
@@ -1388,6 +1397,7 @@ private void updateJobDependencies(String jobId) { | |||
if ((job == null) || (job.getParameters() == null)) { | |||
return dataSources; | |||
} | |||
// Maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that parameters
is a non-threadsafe collection that is exposed directly via a getter, I think that some protection is necessary. There is nothing to stop someone from calling job.getParameters.add(something) at an inopportune moment.
Changing the getter to return a copy rather than the actual list might be a more lightweight solution
@@ -1260,6 +1265,8 @@ public RebalanceOutcome rebalanceHost(String hostUUID) { | |||
* @return True if the task is successfully removed | |||
*/ | |||
public boolean deleteTask(String jobUUID, String hostUuid, Integer node, boolean isReplica) { | |||
// Maybe - but probably can be better handled by a lock on the individual job/task? | |||
// Why do spawnMQ.sendControlMessage and queueJobTaskUpdateEvent need to be guarded? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, I think the task.setReplicas call is worthy of consideration.
If a task has replicas A, B, and C, and one thread tries to remove A at the same time as another thread tries to remove B, then it is possible that one removal would be swallowed depending on the timing.
@@ -1133,6 +1137,7 @@ public boolean swapTask(JobTask task, String replicaHostID, boolean kickOnComple | |||
return false; | |||
} | |||
Job job; | |||
// Maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly believe that this lock should be kept.
If we somehow tried to swap in two threads at the same time (swaps can be executed via an HTTP call, I believe) then modifications to the replicas and the hostUUID could interleave in a very harmful way -- for example, setting the replica and the host to the same UUID and "losing" a valid replica elsewhere.
@@ -808,6 +808,7 @@ public void fixTasksForFailedHost(List<HostState> hosts, String failedHost) { | |||
|
|||
private List<JobTask> findAllTasksAssignedToHost(String failedHostUUID) { | |||
List<JobTask> rv = new ArrayList<>(); | |||
// Why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it is prudent to hold the job lock while failing hosts -- which is the only time this method is called. We don't want to, e.g., kick tasks before all the bad hosts have been replaced.
However, this read-only method does not appear to be useful place to get the lock. I would suggest SpawnBalancer's attemptFixTaskForFailedHost
as a method that more clearly needs the lock.
@alstaplesmoore Thanks for the detailed comments. I need to spend some time digesting. It's probably easier to look at a block of code guarded by the lock and reason whether it can be executed by two threads concurrently, but it's harder to tell the whether different blocks need to be guarded by the same lock. |
This is not a PR in the typical sense. The changes are all comments as the result of my review of all Spawn.jobLock usage. The necessity of many are questionable to me. My main question is the intent of the jobLock in the first place: is it to guard against concurrent modification to
SpawnState.jobs
map, to individual jobs, or to tasks? Based on current code base, it's being used rather liberally for all of those and more. History may also be a factor:SpawnState.jobs
is aConcurrentHashMap
, so some of the locks are obviously unnecessary, butjobs
might have been a regular Map at one time such that the locks were required? I want to understand if I'm missing something important. This is my attempt at a collaborative code review to get other more knowledgable developers to comment.Usages that I'm rather certain to be unnecessary are marked as "Not needed" - I could be wrong. Usages that I'm unsure are marked as "Maybe". All inputs are welcome.