1. Questions
More optimization solvers opt for machine learning (ML) integration in order to take advantage of pattern recognition in the search for solutions (Bengio, Lodi, and Prouvost 2021; KarimiMamaghan et al. 2022). Online optimization poses several challenges for this integration to happen, such as the change of the problem structure across the optimization periods, plus the duration of the period or the time step, which could be very short to perform learning. In online vehicle routing problems (Jaillet and Wagner 2008), this issue can happen for instance, in case the vehicles’ locations and the customers’ requests are revealed over time. We study the online taxi problem as defined by Bertsimas, Jaillet, and Martin (2019), where the transportation requests issued from customers are revealed a number of minutes before their pickup time and their assignment to taxis has to be acknowledged within minutes before. This is a special case of the online vehicle routing problem (see Appendix A for a formal definition). Their approach manages to solve the online taxi problem for a large demand: about 6006500 customers for each reoptimization which lasts 30 seconds in the area of Manhattan. It offers then a good starting point to test the optimizationML integration in a dynamic context. Our basic idea is to design ML routines within the reoptimization in order to improve the quality of the constructed tours. We show that learning can happen even in the case of tight time steps.
2. Methods
The reoptimization of Bertsimas, Jaillet, and Martin (2019) relies on a great simplification of the working graph of the problem with the aim to make the resolution more tractable for each time step. The formal problem definition and their proposed approach are presented in Appendix A. The first ML routine that we implemented is based on the neural network structure. Because of its poor results, this method is only detailed in Appendix B.
Our second ML routine uses reinforcement learning on the working graph to construct the paths of taxis, in addition to using random paths. This routine replaces steps 2 through 7 of Algorithm 1 (see Appendix A), with random paths and paths constructed with the Qlearning algorithm (Watkins and Dayan 1992) on the KG graph. Both types of paths start from the current positions of the taxis, as shown in the example of Figure 1. Once a total of Emax arcs are chosen from KG, the MIPmaxflow is solved on this constructed graph. This procedure is done iteratively. Learning for each agent taxi is performed across the different iterations of MIPmaxflow within one time step through Qlearning. The main parameter of the routine is the percentage prl∈[0,1] of taxis that construct their paths using learning. These agents are chosen from all taxis in the first iteration in a random way, as follows:

perform random walks starting from all taxis,

solve MIPmaxflow for this selection, and

choose the prl×100 taxis which have the largest rewards,
Figure 1.An example of taxi path building on a 2G graph.
By this way, we can concentrate learning on the seemingly most favorable taxis. Random walks are taken to be nonbacktracking in this routine. The description of the steps of the routine is shown in Algorithm 1. As you may notice, no solving of maxFlow is performed. The state space of Qlearning represents the set of nodes of G: S=V. The current state of each agent taxi is its current node. If a given agent taxi is in a vertex with no outgoing neighbor, it stops exploring, otherwise, it chooses a neighbor according to the Qtable. All the agents explore their neighborhoods in a sequential way. The reward of each agent taxi is given at the end of each iteration in step number 10, by the reward computed for each taxi from the MIPmaxflow resolution. Thus, the reward of each agent taxi depends on the choices of the remaining agents. Setting a state space S=VT, where each agent accounts for all other agents’ current positions is timeconsuming. Thus, we did not account for it, given the limited time allowed for optimization in a time step.
Algorithm 1.Qlearning routine
Input: The graph G, the parameter K, a starting solution s, and the probability of agent taxis prl. 
Output: A solution for the current time step 
1: while time is available do 
2: First iteration: Fix the agent taxis T using prl; 
3: Initialize the backbone graph BG by removing all arcs KG; 
4: Add solution s to BG; 
5: while BG has less than Emax×(1−prl) arcs do 
6: For each taxi t not in T, generate a random walk on KG from t; 
7: Add those arcs to BG; 
8: end while 
9: All agents choose their actions (paths) using Qlearning to complete Emax arcs in BG; 
10: Solve MIPmaxflow on BG; 
11: Update the solution s; 
12: end while 
Concerning the implementation of the Qlearning policy, we use Monte Carlo methods to estimate the stateaction value. For the exploration strategy, we use the epsilongreedy method with a parameter ϵ=0.1.
For comparison matters, we consider the case with only random walks prl=0, denoted RWbased routine.
3. Findings
Results of Table 1 rely on the Bertsimas, Jaillet, and Martin (2019) dataset, which is based on the demand for taxis on the Manhattan network. More details on the construction of the dataset and the implementation are provided in Appendix C. We set the simulation time to 20 minutes, and Emax=2000. The maximum request times Trequestmax, the number of taxis #taxis, and the time step Treop are varied as in the table below. For each customer c, the request time trequestc is uniformly picked within the interval [tminc−Trequestmax,tminc]. The time allowed for reoptimization in each time step is equal to the half of Treop, meaning that if Treop=5min, only 2.5min is allowed to construct the solution, the remaining time is reserved to transfer the new actions to the taxis. We fix prl=10%. In the table, #nser designates the number of not served customers in each simulation.
Table 1.Results of simulation by varying #taxis, Trequestmax , and Treop.
These results show that Qlearning subroutine approach offers higher gaps of profits than the RNN routine, surpassing Bertsimas, Jaillet, and Martin (2019) approach in the following cases: i/ when the reoptimization time step is larger than 1min, but can be still considered small (5min), and ii/ when the supply of transport is large compared to the demand (5000 taxis). The explanation behind the changes of #nser in function of #taxi, Trequestmax and Treop is given in Appendix D. The data of Manhattan taxi requests are characterized by a large traffic demand and a large supply of taxis. Any improvement of only 1% of the profit gains can be translated into thousands of US dollars over the day. Compared to the random walks subroutine, the Qlearning subroutine produces higher profit gaps, suggesting that the effect of reinforcement learning is mostly positive. Figure 2 displays the average number of optimizations in a time step for the four approaches.
Figure 2.Average number of optimizations in one time step (#opt).
ACKNOWLEDGMENTS
We would like to thank unknown referees for their comments. This research was funded by the project FITS “Flexible and Intelligent Transportation Systems” (ANR18CE220014) operated by the French National Research Agency (ANR).
Supplementary Materials
A. Supplementary materials: Model
The taxi problem is a special case of the dialaride problems, where vehicles are allowed to serve only one customer at a time. In the online context, each customer c∈C has a request time trequestc, a confirmation time tconfc>trequestc, when the optimizer confirms to c whether his request is accepted or not, and a pickup time window Ic=[tminc,tmaxc].
Bertsimas, Jaillet, and Martin (2019) model the problem as a directed graph G(V,E) where the nodes represent the customers c∈C and the current positions of the taxis k∈K. Each arc (c′,c) has an associated profit Rc′,c, which is equal to the fare paid by the customer c to satisfy minus the cost of driving from the dropoff point of c′, or from the position of the taxi in case of arc (k,c) and c is the first customer visited by the taxi k. The main assumption of the model is that the working graph G has no cycle. This is made possible by deleting arcs that do not satisfy the time windows of the customers in addition to other tighter time constraints. It is important to notice that the shorter the time windows are, the fewer cycles exist in G. The objective of the optimization is to maximize the total profit of the solution. Without subtours elimination, the routing problem becomes a network maxflow problem, with integer and realvalued variables. The mixed integer programming formulation of the offline problem, i.e. trequestc=0,∀c∈C, is as follows.
maxx,y,pc,tc∑k∈K,c∈CRk,cyk,c+∑c,c′∈CRc′,cxc′,c
s.t.pc=∑k∈Kyk,c ∀c∈C
∑c∈Cxc′,c≤pc′ ∀c′∈C
∑c∈Cyk,c≤1 ∀k∈K
xc′,c∈{0,1} ∀c′,c∈C
yk,c∈{0,1} ∀k∈K,∀c∈C
pc∈{0,1} ∀c∈C
tminc≤tc≤tmaxc ∀c∈C
tc≥t′c+(tminc−tmaxc′)+(Tc′,c−(tminc−tmaxc′))xc′,c ∀c′,c∈C
tc≥tminc+(tinitk+Tk,c−tminc)yk,c ∀k∈K,∀c∈C
Tc′,c designates the travel time to serve c′ and to drive to the pickup location of c. The decision variables yk,c and xc,c′ specify the order of visit of the customers for each taxi, while pc tells if customer c is served or not, and tc provides the pickup time of c. This problem is denoted MIPmaxFlow. When the customers’ pickup times are fixed, the maxflow problem, i.e. (1)(7), becomes efficiently solvable through the simplex algorithm, thanks to some integrality results (see Theorem 1. of Bertsimas, Jaillet, and Martin (2019). We denote this subproblem maxFlow.
The approach of Bertsimas, Jaillet, and Martin (2019) to solve the online version consists of solving MIPmaxFlow in each time step for customers with known request times, or in other terms with trequestc≤kTreopt, such that Treopt is the time step duration, and kTreopt corresponds to the starting time of the current reoptimization. To scale the optimization to the realworld high demands scenario, the authors adopt the routine described in Algorithm 2. At first, the graph G containing the current customers and the current taxi positions is pruned into a graph KG, by selecting the K best incoming and outgoing arcs of each node of G in terms of lost times, which are defined as waiting times plus times of a taxi driving empty. Then, a backbone graph BG is constructed iteratively by solving maxFlow problem on uniformly generated pickup times, before solving MIPmaxflow on the constructed graph. This step is done repeatedly. The parameters Emax and K are chosen such that MIPmaxflow and maxFlow are solvable in reasonable times. More complete details of the model and the algorithm choices are found in Bertsimas, Jaillet, and Martin (2019).
Algorithm 2.Backbonebased routine
Input: The graph G, the parameter K, the limit number Emax~, and a starting solution s (could be empty) 
Output: A solution for the current time step 
1: Compute KG; 
2: while time is available do 
3: Initialize the backbone graph BG by removing all arcs of KG; 
4: while BG has less than Emax arcs do 
5: For each customer c, generate uniformly a pickup time tc from Ic or from Isc (is the time window where the solution s is propagated in Ic ); 
6: Solve maxFlow on KG with the tc pickup times of the customers; 
7: Add the optimal arcs of the solution to the graph BG; 
8: end while 
9: Solve MIPmaxflow on BG; 
10: Update the solution s; 
11: end while 
B. Supplementary materials: RNN routine
This ML routine has the goal to learn the most adequate pickup times of the customers, and then solve maxFlow problem for these times. We choose the Recurrent Neural Network (RNN) for the architecture, which is a neural network well suited to manage vector sequences over time (Alom et al. 2019).
The routine replaces steps 3 to 10 of Algorithm 2. Instead of randomly generating pickup times (step 5) and solving the maxFlow problem for these times (step 6) for the construction of the support backbone graph (stage 7), we learn the pickup times of the customers and solve maxFlow problem for these optimal times. No solving of MIPmaxflow is involved. The description of the steps is shown in Algorithm 3. The obtained solution updates the current solution similarly to Algorithm 2.
Algorithm 3.RNNbased routine
Input: The graph G, the parameter K, and a starting solution s (could be empty) 
Output: A solution for the current time step 
1: Compute KG; 
2: while time is available do 
3: while exploration time is available do 
4: For each customer c, generate uniformly a pickup time tc∈Ic or tc∈Isc; 
5: Solve maxFlow on KG with tc pickup times of customer; 
6: Add the arc profits of the solution to features X and (tc) to y; 
7: end while 
8: Train the RNN model m using X and y; 
9: Generate times by t=m([Re1,…,RE)]′); 
10: Solve maxFlow with times t; 
11: Update the solution s; 
12: end while 
For the training of the RNN, we solve maxFlow with pickup times uniformly chosen at random within customers’ time windows. We also include in the training the times corresponding to the upper bounds of the time windows. The RNN outputs the customers’ pickup times and takes for inputs the solution of the corresponding maxFlow problem, which is in our case a vector over the graph arcs such that if an arc belongs to the solution, its profit value (Rc′,c or Rk,c) is taken, otherwise zero. A scheme of RNN is shown in Figure 3. We obtain the learned pickup times of the current reoptimization by taking as input the profits of all the arcs of the graph KG. Regarding the RNN architecture, we use the sigmoid activation function, plus the cross entropy loss function and the Adam optimizer. According to our tests, tuning across other RNN hyperparameters does not improve the result of the RNN routine. We set the exploration time here (the step 3 of Algorithm 3) to two third of the time allowed for optimization in each time step.
Figure 3.Architecture of the RNN.
C. Supplementary materials: Data
The data consists of several elements. The first element is the demands of the customers, which come from the taxitrip dataset of the New York City Taxi and Limousine Commission on the area of the Manhattan network. It is constructed by filtering requests with pickup and dropoff points located in Manhattan. For each customer, tminc is set to the real pickup time, tmaxc=tminc+5min, and tconfc=tminc−4min. The request time trequestc is a variable of the simulation. Bertsimas, Jaillet, and Martin (2019) use the demands of Yellow Cabs alone on a specific day, which is 04/15/2016. We use the same inputs.
The second element is the travel times in the Manhattan network. Bertsimas, Jaillet, and Martin (2019) use the Yellow Cabs data to estimate those times. Having no access to this data, we set travel times according to OpenStreetMap (OSM) in the following way. For each road segment, we set the speed to the maximum speed given by OSM. In the following step for the simplification of the network, if two or more segments are merged, we set the speed to the minimum of the merged segments’ speeds. This allows us to considerably reduce the speed in the network since taking the maximum speed is not a reasonable assumption. The shortest path computation uses travel times computed using those speeds. For the profit computation, we use the customer’s real fare set with costs taken proportional to the trip travel times: a cost of 5$ per hour of driving and a cost of 1$ per hour of waiting, same as Bertsimas, Jaillet, and Martin (2019).
The approaches that are compared in Table 1 share the same input data for each simulation run : the same travel times and request times trequestc. In order to run Bertsimas, Jaillet, and Martin (2019)'s program, we convert their Julia source code from version 0.5 to a stable version ≥1.0.
D. Supplementary materials: Demand/supply balance
The quality of the solutions, thus the profits and the number of not served customers, changes according to the number of taxis. With 2000 taxis (and (Trequestmax,Treop)=(5,1)), the solution covers 18% more transportation requests than that of 3000 taxis (and (Trequestmax,Treop)=(5,1)), due to a betterperforming arc selection procedure for the backbone graph (steps 48 in Algorithm 2; steps 59 in Algorithm 1). This is because all problems share the same value of the maximum number of arcs Emax=2000. This latter value is taken to make the MIPmaxFlow optimization possible in a short duration. If the number of taxis increases, the selection of arcs is altered. With 5000 taxis, the excess of vehicles makes it possible to correct the lack of coverage with 4000 and 3000 taxis. There is almost always a free taxi able to serve a new customer request. Thus, the solutions’ profits become in the same range of the case of 2000 taxis. Notice that the number of not served customers increases as the time step Treop and/or the maximum request times Trequestmax increase, since more customers are present in MIPmaxFlow optimizations compared to the base case of (Trequestmax,Treop)=(5,1).