More optimization solvers opt for machine learning (ML) integration in order to take advantage of pattern recognition in the search for solutions (Bengio, Lodi, and Prouvost 2021; Karimi-Mamaghan et al. 2022). Online optimization poses several challenges for this integration to happen, such as the change of the problem structure across the optimization periods, plus the duration of the period or the time step, which could be very short to perform learning. In online vehicle routing problems (Jaillet and Wagner 2008), this issue can happen for instance, in case the vehicles’ locations and the customers’ requests are revealed over time. We study the online taxi problem as defined by Bertsimas, Jaillet, and Martin (2019), where the transportation requests issued from customers are revealed a number of minutes before their pick-up time and their assignment to taxis has to be acknowledged within minutes before. This is a special case of the online vehicle routing problem (see Appendix A for a formal definition). Their approach manages to solve the online taxi problem for a large demand: about 600-6500 customers for each re-optimization which lasts 30 seconds in the area of Manhattan. It offers then a good starting point to test the optimization-ML integration in a dynamic context. Our basic idea is to design ML routines within the re-optimization in order to improve the quality of the constructed tours. We show that learning can happen even in the case of tight time steps.
The re-optimization of Bertsimas, Jaillet, and Martin (2019) relies on a great simplification of the working graph of the problem with the aim to make the resolution more tractable for each time step. The formal problem definition and their proposed approach are presented in Appendix A. The first ML routine that we implemented is based on the neural network structure. Because of its poor results, this method is only detailed in Appendix B.
Our second ML routine uses reinforcement learning on the working graph to construct the paths of taxis, in addition to using random paths. This routine replaces steps 2 through 7 of Algorithm 1 (see Appendix A), with random paths and paths constructed with the Q-learning algorithm (Watkins and Dayan 1992) on the graph. Both types of paths start from the current positions of the taxis, as shown in the example of Figure 1. Once a total of arcs are chosen from the MIPmaxflow is solved on this constructed graph. This procedure is done iteratively. Learning for each agent taxi is performed across the different iterations of MIPmaxflow within one time step through Q-learning. The main parameter of the routine is the percentage of taxis that construct their paths using learning. These agents are chosen from all taxis in the first iteration in a random way, as follows:
perform random walks starting from all taxis,
solve MIPmaxflow for this selection, and
choose thetaxis which have the largest rewards,
By this way, we can concentrate learning on the seemingly most favorable taxis. Random walks are taken to be non-backtracking in this routine. The description of the steps of the routine is shown in Algorithm 1. As you may notice, no solving of maxFlow is performed. The state space of Q-learning represents the set of nodes of : The current state of each agent taxi is its current node. If a given agent taxi is in a vertex with no outgoing neighbor, it stops exploring, otherwise, it chooses a neighbor according to the Q-table. All the agents explore their neighborhoods in a sequential way. The reward of each agent taxi is given at the end of each iteration in step number 10, by the reward computed for each taxi from the MIPmaxflow resolution. Thus, the reward of each agent taxi depends on the choices of the remaining agents. Setting a state space where each agent accounts for all other agents’ current positions is time-consuming. Thus, we did not account for it, given the limited time allowed for optimization in a time step.
Concerning the implementation of the Q-learning policy, we use Monte Carlo methods to estimate the state-action value. For the exploration strategy, we use the epsilon-greedy method with a parameter
For comparison matters, we consider the case with only random walksdenoted RW-based routine.
Results of Table 1 rely on the Bertsimas, Jaillet, and Martin (2019) dataset, which is based on the demand for taxis on the Manhattan network. More details on the construction of the dataset and the implementation are provided in Appendix C. We set the simulation time to 20 minutes, and The maximum request times the number of taxis #taxis, and the time step are varied as in the table below. For each customer the request time is uniformly picked within the interval The time allowed for re-optimization in each time step is equal to the half of meaning that if only is allowed to construct the solution, the remaining time is reserved to transfer the new actions to the taxis. We fix In the table, #nser designates the number of not served customers in each simulation.
These results show that Q-learning subroutine approach offers higher gaps of profits than the RNN routine, surpassing Bertsimas, Jaillet, and Martin (2019) approach in the following cases: i/ when the re-optimization time step is larger than 1min, but can be still considered small (5min), and ii/ when the supply of transport is large compared to the demand (5000 taxis). The explanation behind the changes of #nser in function of #taxi, and is given in Appendix D. The data of Manhattan taxi requests are characterized by a large traffic demand and a large supply of taxis. Any improvement of only of the profit gains can be translated into thousands of US dollars over the day. Compared to the random walks subroutine, the Q-learning subroutine produces higher profit gaps, suggesting that the effect of reinforcement learning is mostly positive. Figure 2 displays the average number of optimizations in a time step for the four approaches.
We would like to thank unknown referees for their comments. This research was funded by the project FITS -“Flexible and Intelligent Transportation Systems” (ANR-18-CE22-0014) operated by the French National Research Agency (ANR).