A Reinforcement Learning Approach for the
Flexible Job Shop Scheduling Problem
Yailen Mart´ınez1,2, Ann Nowe´1, Juliett Sua´rez2, and Rafael Bello2
1 CoMo Lab, Department of Computer Science, Vrije Universiteit Brussel, Belgium
{ymartine,ann.nowe}@vub.ac.be
2 Department of Computer Science, Central University of Las Villas, Cuba
{yailenm,jsf,rbellop}@uclv.edu.cu
Abstract. In this work we present a Reinforcement Learning approach
for the Flexible Job Shop Scheduling problem. The proposed approach
follows the ideas of the hierarchical approaches and combines learning
and optimization in order to achieve better results. Several problem in-
stances were used to test the algorithm and to compare the results with
those reported by previous approaches.
1 Introduction
Scheduling is a scientific domain concerning the allocation of tasks to a limited
set of resources over time. The goal of scheduling is to maximize (or minimize)
different optimization criteria such as the makespan or the tardiness. The scien-
tific community usually classifies the problems according to different characteris-
tics, for example, the number of machines (one machine, parallel machines), the
shop type (Job Shop, Flow Shop or Open Shop) and so on. These kind of prob-
lems have captured the interest of many researchers from a number of different
research communities for decades. To find a good schedule (or the best sched-
ule) can be a very difficult task depending on the constraints of the problem and
the environment. The Job Shop Scheduling Problem (JSSP) is one of the most
popular scheduling models existing in practice, and it is also among the hard-
est combinatorial optimization problems [1]. The Flexible Job Shop Scheduling
Problem (FJSSP) is a generalization of the classical JSSP, where operations are
not processed by a fixed machine, but there is a choice between a set of available
machines that can execute it. Therefore, the FJSSP has an extra decision step
besides the sequencing, the job routing. To determine the job route means to
choose, for each operation, which machine will execute it from the set of available
ones.
Literature on flexible job shop scheduling is not rare, but approaches using
learning based methods are. In the literature we find different (meta-)heuristic
approaches for this problem, for example, Ant Colony Optimization [2] and Ge-
netic Algorithms [3] [4].
In [5] Thomas Gabel and Martin Riedmiller suggested and analyzed the ap-
plication of reinforcement learning techniques to solve the task of job shop
C.A. Coello Coello (Ed.): LION 5, LNCS 6683, pp. 253–262, 2011.
c© Springer-Verlag Berlin Heidelberg 2011
254 Y. Mart´ınez et al.
scheduling problems. They demonstrated that interpreting and solving this kind
of problems as a multi-agent learning problem is beneficial for obtaining near-
optimal solutions and can very well compete with alternative solution approaches.
Reinforcement Learning is the problem faced by an agent that must learn
behavior through trial-and-error interactions with a dynamic environment. Each
time the agent performs an action in its environment, a trainer may provide a
reward or penalty to indicate the desirability of the resulting state. For example,
when training an agent to play a game, the trainer might provide a positive
reward when the game is won, negative when it is lost and zero in all other
states. The task of the agent is to learn from this indirect, delayed reward, to
choose sequences of actions that produce the greatest cumulative reward [6].
In this paper we present a Reinforcement Learning approach for the FJSSP.
More specifically, we adopt the assign-then-sequence rule proposed by the hi-
erarchical approaches and combine a two step learning algorithm with a mode
optimization procedure in order to achieve better results.
The remainder of this paper is organized as follows. Section 2 introduces
the problem formulation and a literature review on the subject is also given.
Section 3 gives and overview on reinforcement learning and in Section 4 the
algorithm is presented detailing what is done in each step. In Section 5 we present
a computational study using some classical instances, comparing our results with
some previous approaches. Some final conclusions and ideas for future work are
given in Section 6.
2 Flexible Job Shop Scheduling Problem
2.1 Problem Formulation
The Flexible Job Shop Scheduling Problem consists of performing a set of n
jobs J = {J1, J2, . . . , Jn} on a set of m machines M = {M1,M2, . . . ,Mm}. Each
job Ji has an ordered set of oi operations Oi = {Oi,1, Oi,2, . . . , Oi,oi}. Each
operation Oi,j can be performed on any among a subset of available machines
(Mi,j ⊆ M). Executing operation Oi,j on machine Mk takes pi,j,k processing
time. Operations of the same job have to respect the precedence constraints given
by the operation sequence. A machine can only execute one operation at a time.
An operation can only be executed on one machine and can not leave it before the
treatment is finished. There are no precedence constraints among the operations
of different jobs. The problem is to assign each operation to an appropriate
machine (routing problem), and then to sequence the operations in the selected
machines (sequencing problem) in order to minimize the makespan, i.e., the time
needed to complete all the jobs, which is defined as Cmax = max{Ci|1 ≤ i ≤ n}:
where Ci is the completion time of job Ji.
2.2 Previous Approaches
Different heuristic procedures have been developed in the last years for the
FJSSP, for example, tabu search, dispatching rules, simulated annealing and
A Reinforcement Learning Approach for the Flexible JSSP 255
genetic algorithms. According to the literature review, all these methods can
be classified into two main categories: hierarchical approaches and integrated
approaches, meaning that we have two different ways to deal with the problem.
The hierarchical approaches are based on the idea of decomposing the original
problem in order to reduce its complexity. A typical decomposition is “assign
then sequence”, meaning that the assignment of operations to machines and the
sequencing of the operations on the resources are treated separately. Once the
assignment is done (each operation has a machine assigned to execute it), the
resulting sequencing problem is a classical JSSP. This approach is followed by
Brandimarte [7], who was the first to use decomposition for the FJSSP, Kacem
[4] and Pezzella [3] also followed this idea in the implementation of Genetic
Algorithms.
Integrated approaches consider assignment and sequencing at the same time.
The methods following this type of approach usually give better results but they
are also more difficult to implement.
2.3 Dispatching Rules
As mentioned above, the complexity of the FJSSP gives raise to the search of
heuristic algorithms able to provide good solutions. Dispatching rules are among
the more frequently applied heuristics due to their ease of implementation and
low time complexity.
A dispatching rule is a sequencing strategy by which a priority is assigned to
each job waiting to be executed on a specific machine. Whenever a machine is
available, a priority-based dispatching rule inspects the waiting jobs and the one
with the highest priority is selected to be processed next [8]. Some of the most
used dispatching rules are:
– Shortest Processing Time (SPT): The highest priority is given to the waiting
operation with the shortest processing time.
– First In First Out (FIFO): The operation that arrived to the queue first
receives the highest priority.
– Most Work Remaining (MWKR): Highest priority is given to the operation
belonging to the job with the most total processing time remaining to be
done.
– Earliest Due Date (EDD): The job due out first is processed first.
There are also some composite dispatching rules (CDR), which combine single
dispatching rules and results have shown that a careful combination can perform
better in terms of quality.
3 Reinforcement Learning
Reinforcement Learning (RL) is a technique that allows an agent to learn how
to maximize a numerical reward signal. The learner is not told which actions
to take, as in most forms of machine learning, but instead must discover which
256 Y. Mart´ınez et al.
actions yield the most reward by trial-and-error. In the most interesting and
challenging cases, actions may affect not only the immediate reward but also
the next situation and, through that, all subsequent rewards. These two charac-
teristics, trial-and-error search and delayed reward, are the two most important
distinguishing features of RL [9].
In the standard RL paradigm, an agent is connected to its environment via
perception and action, as depicted in Figure 1. In each step of interaction, the
agent senses the current state s of its environment, and then selects an action
a which may change this state. The action generates a reinforcement signal r,
which is received by the agent. The task of the agent is to learn a policy for
choosing actions in each state so that the maximal long-run cumulative reward
is received.
Fig. 1. The Reinforcement Learning Paradigm
One of the challenges that arise in RL is the trade-off between exploration
and exploitation. To obtain a high reward, a RL agent must prefer actions that
it has tried in the past and found to be effective in producing reward. But to
discover such actions, it has to try actions that it has not selected before. The
agent has to exploit what it already knows in order to obtain reward, but it
also has to explore in order to make better action selections in the future. The
dilemma is that neither exploration nor exploitation can be pursued exclusively
without failing at the task. Therefore the agent must sample the available actions
sufficiently and progressively favor those that appear to be best.
Some previous works showed the effectiveness of the Q-Learning algorithm in
the solution of scheduling problems, more specifically the Job Shop Scheduling
Problem [10] and the Parallel Machines Job Shop Scheduling Problem [11], that
is why the Q-Learning was chosen among the different existing algorithms to
solve the Flexible Job Shop Scheduling Problem.
3.1 Q-Learning
A well-known reinforcement learning algorithm is Q-Learning [12], which works
by learning an action-value function that expresses the expected utility (i.e.
cumulative reward) of taking a given action in a given state.
A Reinforcement Learning Approach for the Flexible JSSP 257
The core of the algorithm is a simple value iteration update, each state-action
pair (s, a) has a Q-value associated. When action a is selected by the agent
located in state s, the Q-value for that state-action pair is updated based on the
reward received when selecting that action and the best Q-value for the subse-
quent state s′. The update rule for the state action pair (s, a) is the following:
Q(s, a) = Q(s, a) + α[r + γmaxa′(Q(s′, a′))−Q(s, a)] (1)
In this expression, α ∈ {0, 1} is the learning rate and r the reward or penalty
resulting from taking action a in state s. The learning rate α determines the
degree by which the old value is updated. For example, if the learning rate is 0,
then nothing is updated at all. If, on the other hand, α = 1, then the old value
is replaced by the new estimate. Usually a small value is chosen for the learning
rate, for example, α = 0.1. The discount factor (parameter γ) has a range value
of 0 to 1 (γ ∈ {0, 1}). If γ is closer to zero, the agent will tend to consider only
immediate reward. If γ is closer to one, the agent will consider future reward
with greater weight.
4 The Proposed Approach: Learning / Optimization
The Learning/Optimization method is an offline scheduling approach divided in
two steps. First, a two-stage learning method is applied to obtain feasible sched-
ules, which are then used as initial data for the mode optimization procedure
[13] developed during the second step.
The learning method implemented decomposes the problem following the
assign-then-sequence approach. Therefore, we have two learning phases, during
the first phase operations learn which is the most suitable machine and during
the second phase machines learn in which order to execute the operations in or-
der to minimize the makespan. For this, each phase has a Q-Learning algorithm
associated and different dispatching rules are taken into account when giving re-
wards to the agents. As the process is being divided in two, we take into account
the goal of each phase in order to decide where to place the agents and which
are the possible actions. In the first phase, where the learning takes care of the
routing, we have an agent per operation being responsible for choosing a proper
machine to execute the corresponding operation, this machine is selected from
the given set of available ones, and the selection is based on the processing time
of the operation on the machine and also on the workload of the machine so far.
It could also be possible to have an agent per job, which would be responsible
of selecting a proper machine for each of its operations. This is not the case for
the second phase, where the learning algorithm takes care of the sequencing and
each operation already knows where it has to be executed so, the main idea is to
decide the order in which they will be processed on the machines, that is why in
this phase we placed the agents on the different resources, and for these agents
an action will be to choose an operation from the queue of operations waiting
at the corresponding resource.
258 Y. Mart´ınez et al.
To start the algorithm every job releases its first operation at time 0, all these
operations go to the machine they have assigned and start to be processed, if
two or more operations go to the same machine then only one of them is selected
and the rest remain in the queue until the machine is available again.
To choose the next action the agent takes into account the Q-Values associated
to the possible operations to execute at that time step in the corresponding
machine. According to the epsilon greedy policy, in order to balance exploration
and exploitation, the agent has a small probability of selecting an action at
random, and a higher probability of selecting the best action, in this case the
operation with the highest Q-Value associated, in this step the dispatching rule
taken into account to give reward to the agent is the Shortest Processing Time
(SPT).
Once a feasible schedule is obtained, the mode optimization procedure is exe-
cuted, which we refer to as the second step. This is a forward-backward procedure
which tries to shift the schedule to the left in order to minimize the makespan,
it has the following steps:
– Order the operations according to their end times (the time when they were
ended in the schedule received as input).
– Taking into account the previous ordering, for each operation, choose the
machine that will finish it first (shortest end time, not shortest processing
time). The result is a backward schedule.
– Repeat steps 1 and 2 to obtain a forward schedule.
Once the mode optimization is executed, the quality of the solution is taken
into account to give feedback to the agents of the learning phases.
4.1 Pseudo-code of the Algorithm
Step 1 - Learning
Phase 1 - Routing
For each operation - Choose a machine
Phase 2 - Sequencing
While there are operations to execute
For each machine with operations in the queue
Choose operation to execute
Update Queues of the System
Step 2 - Execute the Mode Optimization Procedure
4.2 Example
Assuming that we have a small instance with 2 jobs and 3 machines, where
Job1 has 2 operations and Job2 has 3 operations, and these operations can be
executed by the following sets of machines, where each pair represents a possible
machine and the corresponding processing time.
A Reinforcement Learning Approach for the Flexible JSSP 259
J0O0
{
M0, 10
M1, 15
J0O1
{
M1, 12
M2, 18
J1O0
{
M0, 20
M2, 25
J1O1
{
M0, 25
M1, 18
J1O2
{
M1, 15
M2, 25
As mentioned in the description of the algorithm, the first learning phase
takes care of the routing, meaning that the first step is to choose an appropriate
machine for each operation. Let’s say that after executing the first phase the
resulting assignment is the following: J0O0−M0, J0O1−M1, J1O0−M0, J1O1−
M0 and J1O2−M2. A possible schedule for this operation-machine assignment is
shown in Figure 2. Applying the Mode Optimization Procedure to this schedule,
the first step is to order the operations according to their end time, that will
give us the following ordering: J1O2, J1O1, J1O0, J0O1 and J0O0. Taking into
account this ordering, the operations will choose a machine to execute it basing
the decision in the possible end time.
For example, J1O2 can choose between going to M1 for 15 time steps or to
M2 for 25 time steps, obviously the best choice is M1, meaning that M1 will be
busy between time 0 and 15.
Fig. 2. Schedule
Then the next operation on the ordered list makes a choice, in this case J1O1
can choose between M0 for 25 time steps and M1 for 18, as this is the second
operation of J1 it can not start until the previous one is finished so, the starting
time will be 15, the possible end times are 40 and 33, being the best choice
M1, which will be occupied from 15 to 33. When an operation from another job
has to choose a machine has to respect this busy times but can search for an
available slot of the size of the time it requires.
5 Experimental Results
5.1 Instances
The approach proposed in this paper was tested on a set of instances from liter-
ature [7]. The results shown below are those obtained for the set of Brandimarte
260 Y. Mart´ınez et al.
Table 1. Brandimarte Instances
Instance Jobs Machines Lower Bound
Mk01 10 6 36
Mk02 10 6 24
Mk03 15 8 204
Mk04 15 8 48
Mk05 15 4 168
Mk06 10 15 33
Mk07 20 5 133
Mk08 20 10 523
Mk09 20 10 299
Mk10 20 15 165
instances, this dataset consists of 10 problems (Mk01-Mk10) with number of jobs
ranging from 10 to 20 and a number of machines ranging from 4 to 15 (Table 1).
5.2 Parameters
Different parameter settings were studied before deciding which combination
to use for the final experiments. The parameters involved on this study were
the discount factor (λ) and epsilon (�). The different combinations involved the
following sets of values: λ = {0.8, 0.85, 0.9} and � = {0.01, 0.1, 0.15, 0.2}.
After analyzing all the possibilities the best setting was picked, which resulted
to be λ = 0.8 and � = 0.1, together with a discount factor α = 0.1. The algorithm
was executed for 1000 iterations.
5.3 Comparative Study
Table 2 shows a comparative study between the proposed approach and some
results already reported. LB is the Lower Bound for each instance, taken from
the original Brandimarte data. The algorithms used to compare our method
(QL) are:
– GA: Genetic Algorithm [3], algorithm integrating different strategies for gen-
erating the initial population, selecting the individuals for reproduction and
reproducing new individuals.
– ACO: Ant Colony Optimization [2], it provides an effective integration be-
tween the Ant Colony Optimization model and knowledge model.
– GEN: Abbreviation of GENACE, an architecture proposed in [14] where an
effective integration between evolution and learning within a random search
process is proposed.
– Brand: Tabu Search [7], a hierarchical algorithm for the flexible job shop
scheduling based on the tabu search metaheuristic.
Table 3 shows the mean relative errors in % (MRE) of the different approaches
used to compare our method with respect to the best-known lower bound. The
relative error (RE) is defined as RE =[(MK -LB)/LB 100]%, where MK is the
A Reinforcement Learning Approach for the Flexible JSSP 261
Table 2. Experimental Results
Inst. LB GA ACO GEN Brand QL
Mk01 36 40 39 40 42 40
Mk02 24 26 29 29 32 26
Mk03 204 204 204 204 211 204