Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

First pass of job lock usage review #233

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yuesong
Copy link
Contributor

@yuesong yuesong commented Apr 18, 2017

This is not a PR in the typical sense. The changes are all comments as the result of my review of all Spawn.jobLock usage. The necessity of many are questionable to me. My main question is the intent of the jobLock in the first place: is it to guard against concurrent modification to SpawnState.jobs map, to individual jobs, or to tasks? Based on current code base, it's being used rather liberally for all of those and more. History may also be a factor: SpawnState.jobs is a ConcurrentHashMap, so some of the locks are obviously unnecessary, but jobs might have been a regular Map at one time such that the locks were required? I want to understand if I'm missing something important. This is my attempt at a collaborative code review to get other more knowledgable developers to comment.

Usages that I'm rather certain to be unnecessary are marked as "Not needed" - I could be wrong. Usages that I'm unsure are marked as "Maybe". All inputs are welcome.

@@ -1656,6 +1668,7 @@ public boolean checkStatusForMove(String hostID) {
}

public boolean prepareTaskStatesForRebalance(Job job, JobTask task, boolean isMigration) {
// Why?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is conceivable that the task could have suddenly kicked in between the call to isInMovableState and the call to job.setTaskState .

If that were the case, the task state would show up as REBALANCE, but the true state would be RUNNING, which would be a bad outcome.

Holding the job lock here prevents the task from kicking until we have decided whether we are rebalancing this task.

@@ -1932,6 +1948,7 @@ public void stopJob(String jobUUID) throws Exception {
public void killJob(String jobUUID) throws Exception {
boolean success = false;
while (!success && !shuttingDown.get()) {
// Maybe
Copy link

@ghost ghost Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this loop is concerning for an unrelated reason -- if the queueLock is locked, we call tryLock over and over again without any sort of delay in between. Maybe we should sleep for a few millis on failure?

@@ -1687,6 +1701,7 @@ public DeleteStatus forceDeleteJob(String jobUUID) throws Exception {
} finally {
releaseJobLock();
}
// If job is deleted in another thread, stopJob and killJob will throw exception. Should catch that and return DeleteStatus.JOB_MISSING
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are calling both stopJob and killJob in the first place. If we are hoping to let the job shut down cleanly, 100 ms is certainly not enough when we consider that the job almost certainly has to replicate and backup. I think killJob should be sufficient, since our intent is to delete anyway.

@@ -1480,6 +1490,7 @@ public RebalanceOutcome rebalanceJob(String jobUUID, int tasksToMove, String use
* @return A string description
*/
public JSONObject fixTaskDir(String jobId, int node, boolean ignoreTaskState, boolean orphansOnly) {
// Maybe
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the task state checks inside the loop, I think this one should be kept.

@@ -1388,6 +1397,7 @@ private void updateJobDependencies(String jobId) {
if ((job == null) || (job.getParameters() == null)) {
return dataSources;
}
// Maybe
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that parameters is a non-threadsafe collection that is exposed directly via a getter, I think that some protection is necessary. There is nothing to stop someone from calling job.getParameters.add(something) at an inopportune moment.

Changing the getter to return a copy rather than the actual list might be a more lightweight solution

@@ -1260,6 +1265,8 @@ public RebalanceOutcome rebalanceHost(String hostUUID) {
* @return True if the task is successfully removed
*/
public boolean deleteTask(String jobUUID, String hostUuid, Integer node, boolean isReplica) {
// Maybe - but probably can be better handled by a lock on the individual job/task?
// Why do spawnMQ.sendControlMessage and queueJobTaskUpdateEvent need to be guarded?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, I think the task.setReplicas call is worthy of consideration.

If a task has replicas A, B, and C, and one thread tries to remove A at the same time as another thread tries to remove B, then it is possible that one removal would be swallowed depending on the timing.

@@ -1133,6 +1137,7 @@ public boolean swapTask(JobTask task, String replicaHostID, boolean kickOnComple
return false;
}
Job job;
// Maybe
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly believe that this lock should be kept.

If we somehow tried to swap in two threads at the same time (swaps can be executed via an HTTP call, I believe) then modifications to the replicas and the hostUUID could interleave in a very harmful way -- for example, setting the replica and the host to the same UUID and "losing" a valid replica elsewhere.

@@ -808,6 +808,7 @@ public void fixTasksForFailedHost(List<HostState> hosts, String failedHost) {

private List<JobTask> findAllTasksAssignedToHost(String failedHostUUID) {
List<JobTask> rv = new ArrayList<>();
// Why?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is prudent to hold the job lock while failing hosts -- which is the only time this method is called. We don't want to, e.g., kick tasks before all the bad hosts have been replaced.

However, this read-only method does not appear to be useful place to get the lock. I would suggest SpawnBalancer's attemptFixTaskForFailedHost as a method that more clearly needs the lock.

@yuesong
Copy link
Contributor Author

yuesong commented Apr 20, 2017

@alstaplesmoore Thanks for the detailed comments. I need to spend some time digesting. It's probably easier to look at a block of code guarded by the lock and reason whether it can be executed by two threads concurrently, but it's harder to tell the whether different blocks need to be guarded by the same lock.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant