AIAA JOURNAL
Vol. 40, No. 5, May 2002
Parallelization of Rotorcraft Aerodynamics Navier–Stokes Codes
Kivanc Ekici¤ and Anastasios S. Lyrintzis†
Purdue University, West Lafayette, Indiana 47907-1282
The modi cation of unsteady three-dimensional Navier–Stokes codes for application on massively parallel and
distributed computing environments is investigated. Previously, the Eulermode of the Navier–Stokes code TURNS
has been parallelized. For the ef cient implementationof theNavier–Stokesmode of TURNS onparallel computing
systems, several algorithmic changes should be made. The main modi cation is done on the implicit operator,
lower–upper symmetric Gauss–Seidel. Two new implicit operators are used because of convergence problems of
traditional operators with high cell aspect ratio grids needed for viscous calculations. Results for Navier–Stokes
cases are presented for various operators. The message passing interface protocol is used because of its portability
to various parallel architectures.
Nomenclature
a = speed of sound
e = total energy
p = pressure
Pr = Prandtl number
Re = Reynolds number
u; v; w = Cartesian velocities
x; y; z = Cartesian coordinates
° = speci c heat ratio
¹ = dynamic viscosity
»; ´; ³ = generalized coordinates
½ = density
9 = rotor azimuth angle
Introduction
R OTORCRAFT aerodynamicsare very complex becausethree-dimensional unsteady transonic viscous aerodynamic phe-
nomena are involved. The ability to predict the ow around heli-
copter rotors is vital for the control of high-speed losses, vibration,
and noise. The advent of tilt-rotor aircraft in the past few years
further complicates rotor aerodynamics.
Computational uiddynamics(CFD)hasprovento be aneffective
tool for predictionof rotorcraftaerodynamicsand noise. Traditional
CFD methods used for this purpose have included transonic small
disturbancepotentialmodels and full-potentialmodels.1;2 Whereas
potentialmethodscan generatea good solutionat relativelylowcost,
theyareunableto resolvesome owfeaturesthat canhavea large im-
pact on the aerodynamic and aeroacoustic properties. For instance,
the vorticalwake producedby the rotor blades plays a dominant role
in unsteady load uctuation, affecting aerodynamic performance.
Also, shocks produced by transonic ow near the blade tip, often
encountered in forward ight, have a large effect on the high-speed
impulsive noise generated by the blades and consequentlymust be
determined accurately.For these reasons,many more recent design
analyses3 use Euler/Navier–Stokes methods.
However,Navier–Stokesmethods tend to be computationallyde-
manding,whichhasdelayedtheirwidespreadindustrialuse.Parallel
Received 28 February 2001; revision received 24 September 2001; ac-
cepted for publication 8 October 2001. Copyright c° 2001 by Kivanc Ekici
and Anastasios S. Lyrintzis. Published by the American Institute of Aero-
nautics and Astronautics, Inc., with permission. Copies of this paper may
be made for personal or internal use, on condition that the copier pay the
$10.00 per-copy fee to the CopyrightClearance Center, Inc., 222Rosewood
Drive, Danvers, MA 01923; include the code 0001-1452/02 $10.00 in cor-
respondence with the CCC.
¤Research Assistant, School of Aeronautics and Astronautics; currently
Research Associate, Department of Mechanical Engineering and Materi-
als Science, Duke University, Durham, NC 27708-0300. Student Member
AIAA.
†Associate Professor, School of Aeronautics and Astronautics. Associate
Fellow AIAA.
computation offers the potential for cheaper and faster computa-
tions. Several vendors have released machines that utilize com-
modity reduced instruction set computer (RISC) processors that
are the same as those used in high-end workstations, for example,
IBM SP2 and SGI Origin. These machines generally attain bet-
ter price/performance than vector processors and have peak execu-
tion rates that are several times faster than vector parallelmachines
such as the Cray C-90. It is also possible to assemble a collec-
tion of off-the-shelf hardware components and software packages
and achieve high performance at very low prices.4;5 Several com-
panies are beginning to utilize parallel processing in the form of
networked workstations, which sit idle during off hours, to attain
supercomputerperformance.6 Unfortunately,many traditionalCFD
algorithmswere developedwith serial or vectorizedcomputationin
mind, and consequentlyrequiresigni cantmodi cation for ef cient
parallel execution.Also, a certain amount of code rewriting, for ex-
ample, adding message passing calls, is necessary, which can be
a signi cant time investment for long and well-established codes.
These factors have been deterrents to more widespread industrial
adoption of parallel processing.
The current study focuses on efforts to improve the ef ciency of
the TURNS code developedby Srinivasanet al.7 and Srinivasanand
Baeder.8 TURNS is chosenbecauseit providesa well-known indus-
try standardfor rotorcraftaerodynamics.The codeuses a third-order
accurate,upwindbiasedscheme thatprovidesshockcapturing.Flux
limiters are appliedso that the scheme is total variationdiminishing.
In TURNS themain bottleneckfor the paralleloptimizationis the
lower–upper symmetric Gauss–Seidel (LU-SGS) implicit operator,
currently used for both steady and unsteady calculations with the
code. Candler et al.9 developed an ef cient algorithm called data
parallel lower–upper relaxation (DP-LUR) to parallelize LU-SGS
for a data parallel environment.Wissink et al.10¡12 developeda new
hybrid domain decomposition implementation of the LU-SGS and
DP-LUR operators and applied it to the Euler mode of TURNS
successfully. This method can be extended to the Navier–Stokes
equations. However, the LU-SGS algorithm exhibits poor conver-
gence for the high-cell-aspect-ratio (CAR) grids needed to resolve
theboundarylayer of highReynolds ows, as shown, for example,in
Refs. 13 and 14. Wright et al.15 proposed a full matrix modi cation
of the LU relaxation method that works well for the high-aspect-
ratio grids. In addition,Wright et al.16 also developed another type
of fullmatrixmethod, data parallel line relaxation(DPLR), which is
even better for the high-aspect-ratiogrids.Neither of thesemethods
have been tested for three-dimensionalunsteady ows.
The focus of this study is to develop a new hybridmethod similar
to that of Wissink using the full matrix approach and to apply this
new algorithm togetherwith DPLR to TURNS for the ef cient par-
allelizationof rotorcraftaerodynamicscodes.Note that the LU-SGS
algorithmhas been used in a number of well-knownCFD codes, for
example, INS3D17 and OVERFLOW,18 primarily for its stability
properties with larger time steps. Therefore, our proposed paral-
lelizationmethodologywill be applicable to several Navier–Stokes
887
旋翼机空气动力学:www.tech-domain.com
888 EKICI AND LYRINTZIS
codes. Message passing interface (MPI) will be used because of its
portability to different parallel architectures.
Governing Equations
The nondimensional conservation law form of the three-
dimensional, thin-layer Navier–Stokes equations in curvilinear
space »; ´; ³ , and ¿ can be written as19;20
@¿q C @» E C @´F C @³G D Re¡1@³ S (1)
where
q D 1
J
266664
½
½u
½v
½w
e
377775 ; E D 1J
266664
½U
½uU C »x p
½vU C »y p
½wU C »z p
U .eC p/¡ »t p
377775
F D 1
J
266664
½V
½uV C ´x p
½vV C ´y p
½wV C ´z p
V .eC p/¡ ´t p
377775 ; G D 1J
266664
½W
½uW C ³x p
½vW C ³y p
½wW C ³z p
W .eC p/¡ ³t p
377775 (2)
The contravariant velocities in the preceding equations, namely,
U;V ; and W , are de ned as
U D »t C »xu C »yv C »zw
V D ´t C ´xu C ´yv C ´zw
W D ³t C ³xu C ³yv C ³zw (3)
The quantitiessuch as »x ; »y; and »z are themetrics of the coordinate
transformation, and J is the Jacobian of the transformation. The
viscous ux vector for the thin layer approximation S is given by
S D ¹
J
266666664
0¡
³
2
x C ³ 2y C ³ 2z
¢
u³ C 1=3.³xu³ C ³yv³ C ³zw³ /³x¡
³
2
x C ³ 2y C ³ 2z
¢
v³ C 1=3.³xu³ C ³yv³ C ³zw³ /³y¡
³
2
x C ³ 2y C ³ 2z
¢
w³ C 1=3.³xu³ C ³yv³ C ³zw³ /³z¡
³
2
x C ³ 2y C ³ 2z
¢
m C 1=3.³xu³ C ³yv³ C ³zw³ /n
377777775
(4)
where
m D 0:5.u2 C v2 Cw2/³ C [1=Pr.° ¡ 1/].a2/³
n D .³xu C ³yv C ³zw/
For the turbulent viscous ows, the viscosity coef cient ¹ is com-
puted as the sum of the laminar viscosity coef cient ¹l estimated
using Sutherland’s law and the turbulent viscosity¹t estimated us-
ing the Baldwin–Lomax algebraic turbulencemodel. The pressure
can be obtained from the equation of state given as
p D .° ¡ 1/fe ¡ .½=2/.u2 C v2 C w2/g (5)
Numerical Method
The rst-order implicitEuler integrationin time gives the follow-
ing form of the equations:£
I C1¿ ¡±» An C ±´Bn C ±³Cn ¡ Re¡1±2³ Mn¢¤1qn D ¡1¿ f .qn/
(6)
where
f .q
n
/ D ¡±» En C ±´Fn C ±³Gn ¡ Re¡1±³ Sn¢ (7)
is the ux evaluated using the upwind-biased scheme of Roe21 at
time leveln; A, B, andC are the inviscidJacobiansgivenby Pulliam
and Steger20; and M is the viscous Jacobian given by Tysinger and
Caughey.22 The monotone upstream-centered scheme of Anderson
et al.23 for the conservativelaws (MUSCL) is used to get high-order
accuracy with ux limiters. First, a ux vector splitting is applied
to the ux Jacobians A, B, and C , of the inviscid ux vectors E ,
F , and G . In the » direction, A is split into its positive and negative
eigenvalueparts, AD AC C A¡, and the positivematrix is backward
differenced,whereasthe negativematrix is forwarddifferenced.The
same technique is applied to the other inviscid Jacobians in the cor-
responding directions. The viscous JacobianM , on the other hand,
is simply central differenced in ³ direction. When the backward
differencedC terms in L and the forward differenced¡ terms in U
are collected, the systemcan be divided into lower and upper block-
tridiagonal factors with a diagonal factor given by
D D I C1¿.AC ¡ A¡ C BC ¡ B¡ C CC ¡ C¡ C 2M/ j;k;l
L D D ¡1¿ ¡ACj ¡ 1 C BCk¡ 1 C CCl¡ 1 C Ml¡ 1¢
U D D C 1¿¡A¡j C 1 C B¡k C 1 C C¡l C 1 ¡ Ml C 1¢ (8)
Second, to avoid costly computation times for the inversion of the
5£ 5 D matrix, an ef cient spectral radius approximationproposed
by Yoon and Kwak14 is applied to the ux Jacobians:
A§ D 12 .A § ½A I / (9)
where ½A is the spectral radius of A in the » direction.This approx-
imation results in a very simple form of D:
D D I C1¿.½A I C ½B I C ½C I C 2½M I / j;k;l (10)
for which the inversion is just a scalar operation.
TheLU-SGS schemeproposedbyYoon andJameson24 is used for
the time integration of the implicit equations. The resulting system
of equations is solved by applying a two-step symmetric Gauss–
Seidel algorithm as follows:
L1q
¤ D ¡1¿ f .qn/; U1qn D D1q¤ (11)
InEq. (11),1q¤ is the intermediatesolutionand1qn is the updateto
the vector of dependent variables at time step n, qn C 1 D qn C1qn .
Note that the approximationsare made only to the left-hand side of
the system to improve the convergence to the steady-state solution;
hence, they will not effect the accuracy of the scheme.
Time Accuracy
The implicitalgorithmimplementedinTURNS is usedfor steady-
state as well as time-accurate solutions. For steady-state problems,
a rst-order scheme in time is used, and the convergence to the
steady state is acceleratedusing large time steps.On the other hand,
for time-accurate problems, the equations are integrated through
time using a second-orderscheme starting from a meaningful initial
condition, usually the steady-state solution. In this case, the time
step is chosen according to the timescale of the problem.
LU-SGS has some factorizationerror associatedwith the approx-
imations made on the left-hand side of the algorithm. This destroys
the good convergencecharacteristicsin large time steps for steady-
state solutions and puts a limit to the size of the time steps used.
In addition, for time-accurate solutions, the factorization error still
exists, but it can be reduced by using inner relaxation steps.
For steady-state problems one can vary 1¿ in space, which can
be interpreted as an attempt to use a uniform Courant–Friedrichs–
Lewy number throughoutthe ow eld. The space-varyingtime step
approach is especially effective for grids that have very ne to very
coarse spacings in the domain. The Navier–Stokes grids used in this
study are good examples of such wide variety length scale grids. In
TURNS, a purely geometric variation of the time-step approach is
used25;26 in the following form:
1¿ D 1¿ jref 1C "
p
J
1CpJ (12)
to accelerate the steady-stateconvergence.In Eq. (12),1¿ jref is the
xed reference time step, " is a small constant, for example, 0.002,
旋翼机空气动力学:www.tech-domain.com
EKICI AND LYRINTZIS 889
and J is the Jacobian of the transformation.On the other hand, for
unsteady solutions, the time step used in the calculations is a xed
value.
Boundary Conditions
The boundaryconditionsare appliedexplicitly,whichmeans that
the values of the conserved variables are updated at the end of each
iteration. The Cartesian velocities u; v, and w are calculated from
the extrapolationof the contravariantvelocities.On the surface, the
tangency condition is imposed for the Euler calculations, whereas
a no-slip condition is imposed for the Navier–Stokes calculations.
These two conditionsrequire the contravariantvelocitiesW D 0 and
U D V DW D 0, respectively. The density and the pressure on the
surface are calculated by zeroth-order extrapolations, whereas the
total energy is determined from the equation of state.
Parallelization
Previously, the parallelizationof the Euler mode of TURNS was
done by Wissink11 in a domain decomposition fashion using MPI
routines.The parallelizationof the code in multiple instructionmul-
tiple data (MIMD) fashion was chosen for two main reasons. First,
code rewriting from FORTRAN 77 to FORTRAN 90, which is re-
quired by single instruction multiple data, is avoided. Second, the
MIMDapproachmakes theprogrammoreportableto differentkinds
of parallel architectures.
A simple domain decomposition strategy, which uses the values
from the previous iteration of the current time step on processor
boundaries, is chosen. This strategy was used before by Wissink
et al.,27 and no signi cant degradationof performancewas observed
for up to 456 processors. The alternative, of course, is to do a for-
mal domain decomposition as explained in Ref. 28. However, this
approach is time consuming to implement in a legacy code, and
because the rst, simple strategy has worked well previously, it was
used in this study.
The computational grid is decomposed to equally spaced parti-
tions to prevent load imbalance, and a copy of the code is run on
each processor.On the boundariesof each partition, an overlapping
strategy is used to calculate the conserved variables. For example,
the conserved variables on the left boundaries of each partition is
calculated as the last interior points near the right boundary of the
previous partition. Likewise, the conserved variables on the right
boundaries of each partition is calculated as the rst interior points
near the left boundary of the next partition. This single overlap
strategy and a typical two-dimensional grid partitioning are given
in Figs. 1 and 2, respectively.
After all of the processors are done with their solutions, the val-
ues on the boundaries are sent to and received from the neighboring
processors using MPI SEND and MPI RECEIVE subroutines, and
Fig. 1 Single overlap strategy.
Fig. 2 Partitions of a typical two-dimensional grid.
the calculationsfor the next iterationcan be started. For small num-
ber of points per processor, we expect to see some performance
degradation because the boundary values are less accurate.
The partitioningof the grid is allowed only in the streamwise and
radial directions,as shown in Fig. 3, to eliminate the load imbalance
when applying the boundary conditions.
In addition, the grid collapses to a singular plane at outboard the
tip of the blade and across the wake. At these locations, to satisfy
the continuityof the owquantities,valuesare interpolatedbetween
the processorson each side of the singularplane. Partitioning in the
normal directionwould make some of the processors sit idle while
the otherswould perform the interpolation,which would cause load
imbalance. Therefore, partitioning of the computational domain is
done in the streamwise and radial directions only.
Diagonal Operators
In this section, two parallel implicit operators that use the spec-
tral approximation of LU-SGS scheme previously implemented in
TURNS are described.The useof spectralapproximationretains the
ease of computation, as well as the advantage of using large time
steps for the steady-state solution.
LU-SGS Operator
The LU-SGS algorithm is the basis of the parallel implicit oper-
ators implemented in TURNS. In the two-step symmetric Gauss–
Seidel method (see Ref. 11), there are off-diagonal terms coming
from L and U on the left-hand side of both equations. If we move
these off-diagonal terms to the right-handside, the resulting system
becomes diagonal, and the updates of the conserved quantities can
easily be obtained.The algorithmfor the LU-SGS implicit operator
can be written as follows.
Algorithm 1(LU-SGS):
Do j; k; l D 1; : : : ; Jmax; Kmax; Lmax
1q
¤
j;k;l D D¡11¿
£¡ f .qn/C ACj ¡ 11q¤j ¡ 1 C BCk ¡ 11q¤k¡ 1
CCCl¡ 11q¤l ¡ 1 C Ml¡ 11q¤l ¡ 1
¤
EndDo
Do j; k; l D Jmax; Kmax; Lmax; : : : ; 1
1q
n
j;k;l D 1q¤j;k;l ¡ D¡11¿
£
A¡j C 11q
n
j C 1 C B¡k C 11qnk C 1
CC¡lC 11qnl C 1 ¡ MlC 11qnl C 1
¤
EndDo
旋翼机空气动力学:www.tech-domain.com
890 EKICI AND LYRINTZIS
Fig. 3 Partitioning strategy of the computational grid.
Fig. 4 Schematic diagram of LU-
SGS sweeps.
In the rst step of the solution, the intermediatesolutionvector1q¤
is obtained by sweeps that start form the lower left corner and pro-
ceed to the upper right corner of the computational grid. The nal
solution is then obtained by sweeps in the reverse direction. The
direction of sweeps in the computationaldomain is shown in Fig. 4.
The LU-SGS method uses hyperplanes of different sizes through-
out the grid. Although this approach can easily be vectorized, it
is dif cult to implement the method on the distributed-memory
computers because of the load balancing problems caused by the
variable-size hyperplanes and the large amount of communication
due to the recursion between planes. For example, if we look at
the forward sweep, the point . j; k; l/ can begin the computation
only when points . j ¡ 1; k; l/, . j; k¡ 1; l/, and . j; k; l ¡ 1/ have
completed the step. Therefore, this data dependency causes load
balancingproblem, and not all of the processors can be kept busy at
the same time. Overall, the LU-SGS algorithmin its original form is
not very suitable for distributed-memoryparallelization,and hence,
some algorithmic changes should be made.
DP-LUR Operator
A very ef cient point-relaxationimplementation of the LU-SGS
method for the data-parallel environment, DP-LUR, has been in-
troduced by Candler et al.9 for solving hypersonic ow problems.
Basically, this method replaces the two symmetric Gauss–Seidel
sweeps with Jacobi sweeps and moves all of the off-diagonal terms
to the right-handside. The procedure for this method can be written
as follows.
Algorithm 2 (DP-LUR):
1q
.0/
j;k;l D ¡D¡1 ¢1¿ f .qn/
Do i D 1; : : : ; isweep
Do j; k; l D 1; : : : ; Jmaxlocal ; Kmaxlocal ; Lmaxlocal
1q
.i/
j;k;l D D¡11¿
£¡ f .qn/C ACj ¡ 11q .i ¡ 1/j ¡ 1 C BCk ¡ 11q.i ¡ 1/k¡ 1
CCCl¡ 11q .i ¡ 1/l ¡ 1 C Ml ¡ 11q.i ¡ 1/l ¡ 1 ¡ A¡j C 11q.i ¡ 1/j C 1
¡ B¡kC 11q .i ¡ 1/k C 1 ¡ C¡l C 11q.i ¡ 1/l C 1 C Ml C 11q.i ¡ 1/l C 1
¤
EndDo
EndDo
1q
n
j;k;l D 1q.isweep/
DP-LUR uses only the nearest neighbor data due to the point-
relaxation strategy and, therefore, allows simultaneous computa-
tions on multiple processors with each processor holding the local
data. The data that lie on the borders are communicated after each