Need advice with optimizing the running speed of OR-Tools Routing #4172

zeldaxlove64 · 2024-04-09T08:11:50Z

zeldaxlove64
Apr 9, 2024

Python 3.10, OR-Tools 9.8.3296

I have built a dispatching system for airport operations using Routing, and it is running well now. However, there are currently two efficiency issues bothering me.

In my system, since it involves dispatching tasks and staffs, the "task" here essentially correspond to "node" in Routing, and the "staff" correspond to "vehicle" in Routing. So I will use "task" and "staff" to describe the issues.

The first issue is that I need different priorities for each task with respect to each staff, based on their position, qualifications, or other subjective reasons. I have built many dimensions similar to the following code:

def resistance_callback(from_index, staff_no):
    from_no = manager.IndexToNode(from_index)
    if from_no != depot_no:
        return resistance_matrix[from_no][staff_no]
    return 0

resistance_callback_indices = []

for staff_no in staffs_no_list:
    resistance_callback_indices.append(routing.RegisterUnaryTransitCallback(partial(resistance_callback, staff_no=staff_no)))

routing.AddDimensionWithVehicleTransits(resistance_callback_indices, 0, 999999999, True, "resistance")
resistance_dimension = routing.GetDimensionOrDie("resistance")

for staff_no in staffs_no_list:
    index = routing.End(staff_no)
    resistance_dimension.SetCumulVarSoftUpperBound(index, 0, 1)

For a task, if the value of a staff in the dimension is higher, the penalty value will be higher, and the task will be more resistant to being assigned to this staff.

But now I found that in a dispatching involving hundreds of tasks and hundreds of staffs, building such dimensions consumes the vast majority of the time. Building such a dimension takes over a dozen seconds, while data processing or other dimension building only takes a few milliseconds, which is a difference of several thousand times.

I further found that the most time-consuming part is resistance_callback_indices.append(routing.RegisterUnaryTransitCallback(partial(resistance_callback, staff_no=staff_no))), where the callback function will be called: C = (T + 2S)² * S times, which T is number of tasks and S is number of staffs.

Later, based on this, I added a benchmark-like feature to my system to evaluate server performance and estimate dispatching runtime, using the number of callbacks per millisecond as a metric. Like the following example code:

from functools import partial
import time

import numpy as np
from ortools.constraint_solver import pywrapcp


def build_resistance_dimension_test():
    n_tasks = np.random.randint(300, 401)
    n_staffs = np.random.randint(50, 101)
    depot_no = n_tasks
    n_nodes = n_tasks + 1
    staffs_no_list = list(range(n_staffs))
    resistance_matrix = np.random.randint(0, 101, size=(n_tasks, n_staffs))

    manager = pywrapcp.RoutingIndexManager(n_nodes, n_staffs, depot_no)
    model_parameters = pywrapcp.DefaultRoutingModelParameters()
    model_parameters.max_callback_cache_size = 2 * n_nodes * n_nodes
    routing = pywrapcp.RoutingModel(manager, model_parameters)

    def resistance_callback(from_index, staff_no):
        from_no = manager.IndexToNode(from_index)
        if from_no != depot_no:
            return resistance_matrix[from_no][staff_no]
        return 0

    resistance_callback_indices = []

    t1 = time.time()

    for staff_no in staffs_no_list:
        resistance_callback_indices.append(routing.RegisterUnaryTransitCallback(partial(resistance_callback, staff_no=staff_no)))

    t2 = time.time()

    routing.AddDimensionWithVehicleTransits(resistance_callback_indices, 0, 999999999, True, "resistance")
    resistance_dimension = routing.GetDimensionOrDie("resistance")

    for staff_no in staffs_no_list:
        index = routing.End(staff_no)
        resistance_dimension.SetCumulVarSoftUpperBound(index, 0, 1)

    n_callback = pow((n_tasks + 2 * n_staffs), 2) * n_staffs
    t_ms = (t2 - t1) * 1000
    n_callback_per_ms = n_callback / t_ms

    return n_callback_per_ms

I want to know if there are any parts that can be optimized about building such kind of dimensions.

The second issue is, because I'm not sure which strategies to choose, I'm combining 11 FIRST_SOLUTION_STRATEGIES with 5 LOCAL_SEARCH_STRATEGIES in pairs, resulting in 55 sets of parameters. Then I will create a multiprocessing pool to pass all 55 sets of data and parameters for computation by pool.starmap_async(), and finally select the solution with the smallest penalty value solution.ObjectiveValue() as the ultimate result.

Throughout the computation process, each subprocess needs to run through all steps: including data preprocessing, model building, and computation. The only difference between each subprocess is the choice of strategies. As mentioned in the previous issue, model building is actually the most time-consuming part. In the current production environment, the SEARCH_TIME_LIMIT typically ranges 1 - 5 seconds to obtain a decent result. However, the overall time consumption usually varies from 30+ seconds to several minutes.

I once attempted to merge the data processing and model building steps into one execution and then distribute it to subprocesses for computation, but I couldn't accomplish it. I also tried saving a computed solution to a local file and then reading this file to switch strategies parameters for further computation, but I couldn't achieve that either, because the solution for some strategies routing.WriteAssignment() is True but routing.ReadAssignment() is None.

I want to optimize the overall process time. Can you give me some advice on the above situation and issues? Thank you.

Crossposted at https://stackoverflow.com/questions/78296596/need-advice-with-optimizing-the-running-speed-of-or-tools-routing

jmarca · 2024-04-09T20:51:21Z

jmarca
Apr 9, 2024

Hi,

I thought I replied via the mailing list interface, but I guess it did not post. If it comes through later, apologies in advance for the double posting.

Addressing your first issue only here.

Caveat: I'm not exactly certain what your problem is, but I think you mean the delay is just for the call to routing.AddDimension(), right? So if you are building one of your resistance dimensions, and you print a timestamp before and a timestamp after the call, you would see tens of seconds difference.

Assuming this is true, my guess is that you have turned on caching in your solver, or else the solver internals are (finally) automatically caching all dimensions up front. What the solver is doing is is calling your callback to get the results of the function, for every node to node pair, for every possible worker. This is a lot of individual calls, and it is necessarily single threaded because that is how the C++ to python interface works in OR Tools API.

(And if you haven't explicitly turned on caching of dimensions, you should. But I can't tell without seeing how you create the routing object.)

Hmm, as I look at your code, the clue phone is ringing and it is for me. My original thought was to suggest the vector API that has no callback:

  /// Creates a dimension where the transit variable is constrained to be
  /// equal to 'values[i]' for node i; 'capacity' is the upper bound of
  /// the cumul variables. 'name' is the name used to reference the dimension;
  /// this name is used to get cumul and transit variables from the routing
  /// model.
  /// Returns a pair consisting of an index to the registered unary transit
  /// callback and a bool denoting whether the dimension has been created.
  /// It is false if a dimension with the same name has already been created
  /// (and doesn't create the new dimension but still register a new callback).
  std::pair<int, bool> AddVectorDimension(std::vector<int64_t> values,
                                          int64_t capacity,
                                          bool fix_start_cumul_to_zero,
                                          const std::string& name);

But that (and the matching AddMatrixDimension) do not allow per-vehicle variations in the input lookup data.

It seems to me one simple improvement might be to skip the check for staff no. You already know that on the python side of the fence, and you've already set it in your dimension creation call. At the very least it saves a zero-op if statement.

Here's what I suggest (modifying your input code a little bit). It won't be significantly faster, but it might speed up a little bit.

I'm sure you already know this, but just in case you're missing the obvious, with the "vehicle transit" type API calls, each staff has their own evaluator (because you pass it a vector of callbacks, one per xtaff). Thus your original callback test of the staff no is obviated.

Also, to speed up the call further, I am going to pre-seed the vector of values for each staff based on prior knowledge of which node is their depot node. To do that I am going to swap around your resistance matrix, to have staff first, then node.

reversed_resistance_matrix = {}
// preseed the zero values for each staff's depot.
for staff_no in staffs_no_list:
    reversed_resistance_matrix[staff_no] = {}
    for from_node in nodes:
        if from_node == depot_nodes[staff_no]:  // fixes the mystery use of depot_no in your call
	    reversed_resistance_matrix[staff_no][from_node] = 0
	else:
	    reversed_resistance_matrix[staff_no][from_node] = resistance_matrix[from_node][staff_no]

// essentially the same function as yours, just no more check of staff no, no more check of depot node
def resistance_callback(from_index, staff_no):
    from_no = manager.IndexToNode(from_index)
    return reversed_resistance_matrix[staff_no][from_no]

resistance_callback_indices = []

for staff_no in staffs_no_list:
    // expanding this to clarify things
    // generate a callback that will ONLY be used for staff_no
    staff_no_calllback_fn = partial(resistance_callback, staff_no=staff_no)

    // generate the index of that callback for use by the solver
    staff_no_callback_fn_idx = routing.RegisterUnaryTransitCallback(staff_no_callback_fn)

    // stash the staff-specific callback index for use in the dimension creation
    resistance_callback_indices.append(staff_no_callback_fn_idx)

// create a single dimension, but each vehicle has its own, private callback
// function to get the cost of each node
routing.AddDimensionWithVehicleTransits(resistance_callback_indices, 0, 999999999, True, "resistance")

resistance_dimension = routing.GetDimensionOrDie("resistance")

for staff_no in staffs_no_list:
    index = routing.End(staff_no)
    resistance_dimension.SetCumulVarSoftUpperBound(index, 0, 1)

Hope that helps. I don't anticipate a lot of speeding up there, but at least the function has less work to do.

James

5 replies

zeldaxlove64 Apr 10, 2024
Author

Thank you very much for patiently reading through my question and taking the time to provide an answer. I started learning what you mentioned early in the morning when I came to work, and I conducted tests all day long.

Firstly, I'd like to explain about "depot". The depot is a virtual point in the Routing example on the OR-Tools official website, representing the starting and ending points of a route. I have adopted the name "depot", but unlike in the official website example where this point is placed at the beginning of the nodes, I tend to place this point at the end of the nodes (tasks) for easier handling of large airport data.

My understanding of what you said is to modify the resistance_matrix so that each column represents a staff's attribute, and to include the value corresponding to the "depot" (which is 0 for all staffs) into the resistance_matrix. Therefore, I made the following two modifications to the code:
In the data generation part, I change:

resistance_matrix = np.random.randint(0, 101, size=(n_tasks, n_staffs))

to:

resistance_matrix = np.random.randint(0, 101, size=(n_tasks, n_staffs))
resistance_matrix = np.vstack([resistance_matrix, [0]*n_staffs])
resistance_matrix = resistance_matrix.T

In the resistance_callback function part, I change:

def resistance_callback(from_index, staff_no):
    from_no = manager.IndexToNode(from_index)
    if from_no != depot_no:
        return resistance_matrix[from_no][staff_no]
    return 0

to:

def resistance_callback(from_index, staff_no):
    from_no = manager.IndexToNode(from_index)

    return resistance_matrix[staff_no][from_no]

And I will refer to the situation before and after the modifications as the old-way and the new-way, respectively. I will use the benchmark-like feature in my system I mentioned earlier to evaluate the performance of these two ways.

At first, I conducted tests on my own laptop, including various scenarios such as single-core, multi-core, small batch data, large batch data, etc. I tested and compared the old and new ways, and found that the performance of the new-way was only 92% of the old-way, meaning there was an 8% performance loss after switching to the new-way.

Of course, I believe that the test results may be influenced by the Windows operating system of the laptop or the cooling situation, so I deployed the code to a testing server and also conducted extensive testing there.

On the testing server, the performance of the new-way is only 95% of the old-way, meaning that there is actually a 5% performance loss when switching to the new-way.

To be honest, I don't expect such test results, and I believe you don't either.

zeldaxlove64 Apr 11, 2024
Author

Today, I carefully studied again and found the difference in the data structure we used in the resistance_matrix. I used List/numpy.array, while you used Dict. I was wondering if the hash would really be faster? So, I changed the data structure of the resistance_matrix to Dict and conducted extensive testing.

In the data generation part, I change:

depot_no = n_tasks
n_nodes = n_tasks + 1
staffs_no_list = list(range(n_staffs))
resistance_matrix = np.random.randint(0, 101, size=(n_tasks, n_staffs))

to:

depot_no = n_tasks
n_nodes = n_tasks + 1
staffs_no_list = list(range(n_staffs))
nodes_no_list = list(range(n_nodes))

resistance_matrix = {}
for node_no in nodes_no_list:
    resistance_matrix[node_no] = {}
    for staff_no in staffs_no_list:
        if node_no == depot_no:
            resistance_matrix[node_no][staff_no] = 0
        else:
            resistance_matrix[node_no][staff_no] = np.random.randint(0, 101)

In the resistance_callback function part:

def resistance_callback(from_index, staff_no):
    from_no = manager.IndexToNode(from_index)

    return resistance_matrix[from_no][staff_no]

The results were surprising. On my laptop, the performance improved by 20% after using the Dict resistance_matrix. And on the testing server, the performance improved by 15%. This is a significant improvement. By the way, the performance are the same, whether the matrix have tasks first or staffs first.

Next, I will review the code of my system again and update it with this achievement. Thank you once again for your generous help.

jmarca Apr 11, 2024

You want to do the least work possible in the callback. Dicts are more primitive, faster than numpy structures.

And to your other comment, you will see orders of magnitude speed ups loading data directly from C++. It seem in your case the data load is the bottleneck so I would switch.

sschnug May 18, 2024

Dicts are more primitive, faster than numpy structures.

No way a dict is faster than numpy. Everything and i mean everything is performance-improving on dense numpy-arrays: index-computation (slides vs. hashing), cache-friendliness / compactness, cache-friendliness / locality, prefetching-prediction (more linear).

If you really see speedups by switching from numpy to dicts, i would immediately double-check my experiments. Also... check the internal node-speed.

zeldaxlove64 May 20, 2024
Author

@sschnug
The largest performance overhead comes from callback functions. During model construction, data is constantly passed between Python and C++. From my experiments, using dictionaries is about 15% - 20% faster than using NumPy arrays.
To maximize performance, I have rewritten my code in C++. After testing with my own benchmark function, the performance of the C++ version has improved by 50-100 times.

Mizux · 2024-04-10T08:16:20Z

Mizux
Apr 10, 2024
Maintainer

having some staff and task, as well you want to run concurrently or in // various model setup....
I'm thinking whether it would be better to implement your model using cp-sat instead since it do support multiple workers built-in and can share information between them (i.e. one model, co-op search).

1 reply

zeldaxlove64 Apr 11, 2024
Author

Thank you very much for your advice.

The primary reason is still the issue of learning costs. From the perspective of usability, I'm concerned that CP-SAT may not achieve the same level of effectiveness as Routing in the same amount of time. From the example code on the official website, I feel that CP-SAT is more low-level, while Routing, due to its encapsulation leaning towards application scenarios, can develop systems suitable for real business scenarios more quickly and easily. And by using some tricks, I have currently achieved coverage of most of complex constraints
(maybe 85%?) to meet the operational requirements of airport dispatching.

While I believe that building such a system from the ground up would be more powerful, based on my current skills, available tutorials and resources, I feel that I am unable to utilize CP-SAT to achieve what Routing does. Of course, if there were a tutorial that could teach me how to use CP-SAT to accomplish what Routing does, I would be more than willing to learn.

By the way, would there be a significant performance improvement if I were to change the current using Python implementation to C++?

jmarca · 2024-05-18T16:56:38Z

jmarca
May 18, 2024

I think the difference is when called from within C++. I'm just reporting my experience when embedding these structures into the OR-Tools callbacks. I completely agree that numpy is faster within python itself...I mean, that's the whole point of numpy, right? But my guess is that the speedup I see is that there is more language overhead when calling out from C++. If your tests prove otherwise, that would be great. On the other hand, the fastest of all is to use the matrix and vector dimensions where you just shove all the data at once to OR Tools. Regards, James

…

On Sat, May 18, 2024 at 05:55:13AM -0700, sschnug wrote: > Dicts are more primitive, faster than numpy structures. No way a dict is faster than numpy. Everything and i mean everything is performance-improving on dense numpy-arrays: index-computation (slides vs. hashing), cache-friendliness / compactness, cache-friendliness / locality, prefetching-prediction (more linear). If you really see speedups by switching from numpy to dicts, i would immediately double-check my experiments. Also... check the internal node-speed. -- Reply to this email directly or view it on GitHub: #4172 (reply in thread) You are receiving this because you commented. Message ID: ***@***.***>

-- James E. Marca Activimetrics LLC

0 replies

jmarca · 2024-05-20T16:11:00Z

jmarca
May 20, 2024

I also found that C++ is by far the fastest for model construction. I had one client who used C++, and the speed up was so great I started to re-learn c++. But most of the people I work with use python. Did you try the matrix and vector dimensions? From my understanding, data is passed just once, when the dimensions are created. James

…

On Sun, May 19, 2024 at 07:09:10PM -0700, Rabbids wrote: The largest performance overhead comes from callback functions. During model construction, data is constantly passed between Python and C++. From my experiments, using dictionaries is about 15% - 20% faster than using NumPy arrays. To maximize performance, I have rewritten my code in C++. After testing with my own benchmark function, the performance of the C++ version has improved by 50-100 times. -- Reply to this email directly or view it on GitHub: #4172 (reply in thread) You are receiving this because you commented. Message ID: ***@***.***>

-- James E. Marca Activimetrics LLC

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need advice with optimizing the running speed of OR-Tools Routing #4172

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Need advice with optimizing the running speed of OR-Tools Routing #4172

zeldaxlove64 Apr 9, 2024

Replies: 4 comments · 6 replies

jmarca Apr 9, 2024

zeldaxlove64 Apr 10, 2024 Author

zeldaxlove64 Apr 11, 2024 Author

jmarca Apr 11, 2024

sschnug May 18, 2024

zeldaxlove64 May 20, 2024 Author

Mizux Apr 10, 2024 Maintainer

zeldaxlove64 Apr 11, 2024 Author

jmarca May 18, 2024

jmarca May 20, 2024

zeldaxlove64
Apr 9, 2024

Replies: 4 comments 6 replies

jmarca
Apr 9, 2024

zeldaxlove64 Apr 10, 2024
Author

zeldaxlove64 Apr 11, 2024
Author

zeldaxlove64 May 20, 2024
Author

Mizux
Apr 10, 2024
Maintainer

zeldaxlove64 Apr 11, 2024
Author

jmarca
May 18, 2024

jmarca
May 20, 2024