为了正常的体验网站,请在浏览器设置里面开启Javascript功能!
首页 > A Reinforcement Learning Approach for the

A Reinforcement Learning Approach for the

2011-12-28 10页 pdf 206KB 37阅读

用户头像

is_312069

暂无简介

举报
A Reinforcement Learning Approach for the A Reinforcement Learning Approach for the Flexible Job Shop Scheduling Problem Yailen Mart´ınez1,2, Ann Nowe´1, Juliett Sua´rez2, and Rafael Bello2 1 CoMo Lab, Department of Computer Science, Vrije Universiteit Brussel, Belgium {ymartine,ann.nowe}@vub.ac.be 2 D...
A Reinforcement Learning Approach for the
A Reinforcement Learning Approach for the Flexible Job Shop Scheduling Problem Yailen Mart´ınez1,2, Ann Nowe´1, Juliett Sua´rez2, and Rafael Bello2 1 CoMo Lab, Department of Computer Science, Vrije Universiteit Brussel, Belgium {ymartine,ann.nowe}@vub.ac.be 2 Department of Computer Science, Central University of Las Villas, Cuba {yailenm,jsf,rbellop}@uclv.edu.cu Abstract. In this work we present a Reinforcement Learning approach for the Flexible Job Shop Scheduling problem. The proposed approach follows the ideas of the hierarchical approaches and combines learning and optimization in order to achieve better results. Several problem in- stances were used to test the algorithm and to compare the results with those reported by previous approaches. 1 Introduction Scheduling is a scientific domain concerning the allocation of tasks to a limited set of resources over time. The goal of scheduling is to maximize (or minimize) different optimization criteria such as the makespan or the tardiness. The scien- tific community usually classifies the problems according to different characteris- tics, for example, the number of machines (one machine, parallel machines), the shop type (Job Shop, Flow Shop or Open Shop) and so on. These kind of prob- lems have captured the interest of many researchers from a number of different research communities for decades. To find a good schedule (or the best sched- ule) can be a very difficult task depending on the constraints of the problem and the environment. The Job Shop Scheduling Problem (JSSP) is one of the most popular scheduling models existing in practice, and it is also among the hard- est combinatorial optimization problems [1]. The Flexible Job Shop Scheduling Problem (FJSSP) is a generalization of the classical JSSP, where operations are not processed by a fixed machine, but there is a choice between a set of available machines that can execute it. Therefore, the FJSSP has an extra decision step besides the sequencing, the job routing. To determine the job route means to choose, for each operation, which machine will execute it from the set of available ones. Literature on flexible job shop scheduling is not rare, but approaches using learning based methods are. In the literature we find different (meta-)heuristic approaches for this problem, for example, Ant Colony Optimization [2] and Ge- netic Algorithms [3] [4]. In [5] Thomas Gabel and Martin Riedmiller suggested and analyzed the ap- plication of reinforcement learning techniques to solve the task of job shop C.A. Coello Coello (Ed.): LION 5, LNCS 6683, pp. 253–262, 2011. c© Springer-Verlag Berlin Heidelberg 2011 254 Y. Mart´ınez et al. scheduling problems. They demonstrated that interpreting and solving this kind of problems as a multi-agent learning problem is beneficial for obtaining near- optimal solutions and can very well compete with alternative solution approaches. Reinforcement Learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. Each time the agent performs an action in its environment, a trainer may provide a reward or penalty to indicate the desirability of the resulting state. For example, when training an agent to play a game, the trainer might provide a positive reward when the game is won, negative when it is lost and zero in all other states. The task of the agent is to learn from this indirect, delayed reward, to choose sequences of actions that produce the greatest cumulative reward [6]. In this paper we present a Reinforcement Learning approach for the FJSSP. More specifically, we adopt the assign-then-sequence rule proposed by the hi- erarchical approaches and combine a two step learning algorithm with a mode optimization procedure in order to achieve better results. The remainder of this paper is organized as follows. Section 2 introduces the problem formulation and a literature review on the subject is also given. Section 3 gives and overview on reinforcement learning and in Section 4 the algorithm is presented detailing what is done in each step. In Section 5 we present a computational study using some classical instances, comparing our results with some previous approaches. Some final conclusions and ideas for future work are given in Section 6. 2 Flexible Job Shop Scheduling Problem 2.1 Problem Formulation The Flexible Job Shop Scheduling Problem consists of performing a set of n jobs J = {J1, J2, . . . , Jn} on a set of m machines M = {M1,M2, . . . ,Mm}. Each job Ji has an ordered set of oi operations Oi = {Oi,1, Oi,2, . . . , Oi,oi}. Each operation Oi,j can be performed on any among a subset of available machines (Mi,j ⊆ M). Executing operation Oi,j on machine Mk takes pi,j,k processing time. Operations of the same job have to respect the precedence constraints given by the operation sequence. A machine can only execute one operation at a time. An operation can only be executed on one machine and can not leave it before the treatment is finished. There are no precedence constraints among the operations of different jobs. The problem is to assign each operation to an appropriate machine (routing problem), and then to sequence the operations in the selected machines (sequencing problem) in order to minimize the makespan, i.e., the time needed to complete all the jobs, which is defined as Cmax = max{Ci|1 ≤ i ≤ n}: where Ci is the completion time of job Ji. 2.2 Previous Approaches Different heuristic procedures have been developed in the last years for the FJSSP, for example, tabu search, dispatching rules, simulated annealing and A Reinforcement Learning Approach for the Flexible JSSP 255 genetic algorithms. According to the literature review, all these methods can be classified into two main categories: hierarchical approaches and integrated approaches, meaning that we have two different ways to deal with the problem. The hierarchical approaches are based on the idea of decomposing the original problem in order to reduce its complexity. A typical decomposition is “assign then sequence”, meaning that the assignment of operations to machines and the sequencing of the operations on the resources are treated separately. Once the assignment is done (each operation has a machine assigned to execute it), the resulting sequencing problem is a classical JSSP. This approach is followed by Brandimarte [7], who was the first to use decomposition for the FJSSP, Kacem [4] and Pezzella [3] also followed this idea in the implementation of Genetic Algorithms. Integrated approaches consider assignment and sequencing at the same time. The methods following this type of approach usually give better results but they are also more difficult to implement. 2.3 Dispatching Rules As mentioned above, the complexity of the FJSSP gives raise to the search of heuristic algorithms able to provide good solutions. Dispatching rules are among the more frequently applied heuristics due to their ease of implementation and low time complexity. A dispatching rule is a sequencing strategy by which a priority is assigned to each job waiting to be executed on a specific machine. Whenever a machine is available, a priority-based dispatching rule inspects the waiting jobs and the one with the highest priority is selected to be processed next [8]. Some of the most used dispatching rules are: – Shortest Processing Time (SPT): The highest priority is given to the waiting operation with the shortest processing time. – First In First Out (FIFO): The operation that arrived to the queue first receives the highest priority. – Most Work Remaining (MWKR): Highest priority is given to the operation belonging to the job with the most total processing time remaining to be done. – Earliest Due Date (EDD): The job due out first is processed first. There are also some composite dispatching rules (CDR), which combine single dispatching rules and results have shown that a careful combination can perform better in terms of quality. 3 Reinforcement Learning Reinforcement Learning (RL) is a technique that allows an agent to learn how to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which 256 Y. Mart´ınez et al. actions yield the most reward by trial-and-error. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two charac- teristics, trial-and-error search and delayed reward, are the two most important distinguishing features of RL [9]. In the standard RL paradigm, an agent is connected to its environment via perception and action, as depicted in Figure 1. In each step of interaction, the agent senses the current state s of its environment, and then selects an action a which may change this state. The action generates a reinforcement signal r, which is received by the agent. The task of the agent is to learn a policy for choosing actions in each state so that the maximal long-run cumulative reward is received. Fig. 1. The Reinforcement Learning Paradigm One of the challenges that arise in RL is the trade-off between exploration and exploitation. To obtain a high reward, a RL agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. Therefore the agent must sample the available actions sufficiently and progressively favor those that appear to be best. Some previous works showed the effectiveness of the Q-Learning algorithm in the solution of scheduling problems, more specifically the Job Shop Scheduling Problem [10] and the Parallel Machines Job Shop Scheduling Problem [11], that is why the Q-Learning was chosen among the different existing algorithms to solve the Flexible Job Shop Scheduling Problem. 3.1 Q-Learning A well-known reinforcement learning algorithm is Q-Learning [12], which works by learning an action-value function that expresses the expected utility (i.e. cumulative reward) of taking a given action in a given state. A Reinforcement Learning Approach for the Flexible JSSP 257 The core of the algorithm is a simple value iteration update, each state-action pair (s, a) has a Q-value associated. When action a is selected by the agent located in state s, the Q-value for that state-action pair is updated based on the reward received when selecting that action and the best Q-value for the subse- quent state s′. The update rule for the state action pair (s, a) is the following: Q(s, a) = Q(s, a) + α[r + γmaxa′(Q(s′, a′))−Q(s, a)] (1) In this expression, α ∈ {0, 1} is the learning rate and r the reward or penalty resulting from taking action a in state s. The learning rate α determines the degree by which the old value is updated. For example, if the learning rate is 0, then nothing is updated at all. If, on the other hand, α = 1, then the old value is replaced by the new estimate. Usually a small value is chosen for the learning rate, for example, α = 0.1. The discount factor (parameter γ) has a range value of 0 to 1 (γ ∈ {0, 1}). If γ is closer to zero, the agent will tend to consider only immediate reward. If γ is closer to one, the agent will consider future reward with greater weight. 4 The Proposed Approach: Learning / Optimization The Learning/Optimization method is an offline scheduling approach divided in two steps. First, a two-stage learning method is applied to obtain feasible sched- ules, which are then used as initial data for the mode optimization procedure [13] developed during the second step. The learning method implemented decomposes the problem following the assign-then-sequence approach. Therefore, we have two learning phases, during the first phase operations learn which is the most suitable machine and during the second phase machines learn in which order to execute the operations in or- der to minimize the makespan. For this, each phase has a Q-Learning algorithm associated and different dispatching rules are taken into account when giving re- wards to the agents. As the process is being divided in two, we take into account the goal of each phase in order to decide where to place the agents and which are the possible actions. In the first phase, where the learning takes care of the routing, we have an agent per operation being responsible for choosing a proper machine to execute the corresponding operation, this machine is selected from the given set of available ones, and the selection is based on the processing time of the operation on the machine and also on the workload of the machine so far. It could also be possible to have an agent per job, which would be responsible of selecting a proper machine for each of its operations. This is not the case for the second phase, where the learning algorithm takes care of the sequencing and each operation already knows where it has to be executed so, the main idea is to decide the order in which they will be processed on the machines, that is why in this phase we placed the agents on the different resources, and for these agents an action will be to choose an operation from the queue of operations waiting at the corresponding resource. 258 Y. Mart´ınez et al. To start the algorithm every job releases its first operation at time 0, all these operations go to the machine they have assigned and start to be processed, if two or more operations go to the same machine then only one of them is selected and the rest remain in the queue until the machine is available again. To choose the next action the agent takes into account the Q-Values associated to the possible operations to execute at that time step in the corresponding machine. According to the epsilon greedy policy, in order to balance exploration and exploitation, the agent has a small probability of selecting an action at random, and a higher probability of selecting the best action, in this case the operation with the highest Q-Value associated, in this step the dispatching rule taken into account to give reward to the agent is the Shortest Processing Time (SPT). Once a feasible schedule is obtained, the mode optimization procedure is exe- cuted, which we refer to as the second step. This is a forward-backward procedure which tries to shift the schedule to the left in order to minimize the makespan, it has the following steps: – Order the operations according to their end times (the time when they were ended in the schedule received as input). – Taking into account the previous ordering, for each operation, choose the machine that will finish it first (shortest end time, not shortest processing time). The result is a backward schedule. – Repeat steps 1 and 2 to obtain a forward schedule. Once the mode optimization is executed, the quality of the solution is taken into account to give feedback to the agents of the learning phases. 4.1 Pseudo-code of the Algorithm Step 1 - Learning Phase 1 - Routing For each operation - Choose a machine Phase 2 - Sequencing While there are operations to execute For each machine with operations in the queue Choose operation to execute Update Queues of the System Step 2 - Execute the Mode Optimization Procedure 4.2 Example Assuming that we have a small instance with 2 jobs and 3 machines, where Job1 has 2 operations and Job2 has 3 operations, and these operations can be executed by the following sets of machines, where each pair represents a possible machine and the corresponding processing time. A Reinforcement Learning Approach for the Flexible JSSP 259 J0O0 { M0, 10 M1, 15 J0O1 { M1, 12 M2, 18 J1O0 { M0, 20 M2, 25 J1O1 { M0, 25 M1, 18 J1O2 { M1, 15 M2, 25 As mentioned in the description of the algorithm, the first learning phase takes care of the routing, meaning that the first step is to choose an appropriate machine for each operation. Let’s say that after executing the first phase the resulting assignment is the following: J0O0−M0, J0O1−M1, J1O0−M0, J1O1− M0 and J1O2−M2. A possible schedule for this operation-machine assignment is shown in Figure 2. Applying the Mode Optimization Procedure to this schedule, the first step is to order the operations according to their end time, that will give us the following ordering: J1O2, J1O1, J1O0, J0O1 and J0O0. Taking into account this ordering, the operations will choose a machine to execute it basing the decision in the possible end time. For example, J1O2 can choose between going to M1 for 15 time steps or to M2 for 25 time steps, obviously the best choice is M1, meaning that M1 will be busy between time 0 and 15. Fig. 2. Schedule Then the next operation on the ordered list makes a choice, in this case J1O1 can choose between M0 for 25 time steps and M1 for 18, as this is the second operation of J1 it can not start until the previous one is finished so, the starting time will be 15, the possible end times are 40 and 33, being the best choice M1, which will be occupied from 15 to 33. When an operation from another job has to choose a machine has to respect this busy times but can search for an available slot of the size of the time it requires. 5 Experimental Results 5.1 Instances The approach proposed in this paper was tested on a set of instances from liter- ature [7]. The results shown below are those obtained for the set of Brandimarte 260 Y. Mart´ınez et al. Table 1. Brandimarte Instances Instance Jobs Machines Lower Bound Mk01 10 6 36 Mk02 10 6 24 Mk03 15 8 204 Mk04 15 8 48 Mk05 15 4 168 Mk06 10 15 33 Mk07 20 5 133 Mk08 20 10 523 Mk09 20 10 299 Mk10 20 15 165 instances, this dataset consists of 10 problems (Mk01-Mk10) with number of jobs ranging from 10 to 20 and a number of machines ranging from 4 to 15 (Table 1). 5.2 Parameters Different parameter settings were studied before deciding which combination to use for the final experiments. The parameters involved on this study were the discount factor (λ) and epsilon (�). The different combinations involved the following sets of values: λ = {0.8, 0.85, 0.9} and � = {0.01, 0.1, 0.15, 0.2}. After analyzing all the possibilities the best setting was picked, which resulted to be λ = 0.8 and � = 0.1, together with a discount factor α = 0.1. The algorithm was executed for 1000 iterations. 5.3 Comparative Study Table 2 shows a comparative study between the proposed approach and some results already reported. LB is the Lower Bound for each instance, taken from the original Brandimarte data. The algorithms used to compare our method (QL) are: – GA: Genetic Algorithm [3], algorithm integrating different strategies for gen- erating the initial population, selecting the individuals for reproduction and reproducing new individuals. – ACO: Ant Colony Optimization [2], it provides an effective integration be- tween the Ant Colony Optimization model and knowledge model. – GEN: Abbreviation of GENACE, an architecture proposed in [14] where an effective integration between evolution and learning within a random search process is proposed. – Brand: Tabu Search [7], a hierarchical algorithm for the flexible job shop scheduling based on the tabu search metaheuristic. Table 3 shows the mean relative errors in % (MRE) of the different approaches used to compare our method with respect to the best-known lower bound. The relative error (RE) is defined as RE =[(MK -LB)/LB 100]%, where MK is the A Reinforcement Learning Approach for the Flexible JSSP 261 Table 2. Experimental Results Inst. LB GA ACO GEN Brand QL Mk01 36 40 39 40 42 40 Mk02 24 26 29 29 32 26 Mk03 204 204 204 204 211 204
/
本文档为【A Reinforcement Learning Approach for the】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索