SHELXD and SHELXE
The structure solution program SHELXD (called XM in the Bruker SHELXTL system, but
identical to SHELXD except in the logo) is able to solve larger ab initio problems than
SHELXS-97, and is also useful for locating the heavy atoms or anomalous scatterers from
SIR, SAD, SIRAS or MAD data. From January 2002 SHELXD is available as source and
precompiled binaries for common operating system as part of the SHELX-97 system. XM is
available from Bruker Nonius as part of the SHELXTL system, which includes the whole of
SHELX plus programs not in the public domain such as the interactive molecular graphics
program XP and reflection data manipulation program XPREP. In this documentation both
XM and SHELXD will be referred to as SHELXD.
For the MAD, SAD, SIR etc. applications of SHELXD the location of the heavy atom sites is
only one step in the structure solution. The new program SHELXE (XE in the Bruker
SHELXTL) can read the .res file containing the heavy atom sites written by SHELXD and
estimate the native phases and the corresponding weights (figures of merit). SHELXE outputs
the phases in an XtalView format .phs file so that a map can be viewed using interactive
graphics or the phases can be improved by density modification using program such as DM,
SOLOMON, RESOLVE etc. SHELXE is robust, fast and simple to use, but it must be
emphasized that the resulting phases may be inferior to those produced by much more
sophisticated maximum likelihood programs such as SHARP, SOLVE or MLPHARE.
However in favorable cases it may even prove possible to autotrace the maps from SHELXE
directly, e.g. using wARP.
For SIR and SAD problems SHELXE starts with the centroid phases from the Harker
construction (Harker, 1956); for MAD and SIRAS an unambiguous phase can be assigned.
Sigma-A weights (Read, 1986) are used throughout. In the case of SAD and SIR a single
density truncation cycle retaining only about 7% of the density is applied to resolve the
twofold ambiguity for appropriate reflections; this is similar to the low density elimination
used by Woolfson et al. ( ) and to a density modification procedure proposed by Giacovazzo
& Siliqi (1997).
The crude density modification performed by SHELXE may be termed the sphere of
influence method. A sphere of radius 2.42Å (a typical 1,3-distance in virtually all organic and
macromolecular structures) is constructed around each pixel of the map, and the variance of
the electron density around a given pixel is calculated using 92 (or 272) optimally distributed
pixels that lie close to this sphere. The variances are sorted but instead of using them to
define a sharp solvent boundary, a fuzzy boundary is generated so that pixels with very high
sphere variances will be entirely in the 'protein' region and those with very low variances will
be entirely in the 'solvent' region and the ones in between are assigned probabilities between
0 and 100% that they are in the solvent region. In the protein region the negative density is
reset to zero and in the solvent region it is 'flipped' (Abrahams & Leslie, 1996). A pixel that
has been assigned a 60% probability of being in the solvent region is assigned a 60:40
weighted average of the densities resulting from the solvent and protein treatments.
It was anticipated that by using a little chemical knowledge (the 1,3-distance) it would be
possible to improve maps given very high resolution data, but in practice the method still
works well with 3Å data provided that the solvent content is relatively high. For very high
resolution data (better than 1.5Å) or very high solvent content (>60%) the SHELXE phases
can have rather high map correlation coefficients (>0.9) with the phases from the final
refinement. In less favorable cases it may well be possible to improve the phases further
using other more sophisticated density modification programs, especially if non-
crystallographic symmetry (NCS) can be exploited. An attempt is made to estimate realistic
weights (foms) in SHELXE so that further phase refinement using other programs is
facilitated.
SHELXE is currently a beta-test that is being made available in precompiled form without
extra license fees etc. but with an expiry date (1/1/03) to registered SHELX and SHELXTL
users. If it proves successful it will be incorporated in future SHELX and SHELXTL releases
that will have to be licensed separately.
Introduction to SHELXD
SHELXD is a stand-alone executable and does not require any other program, initialization
files or environment variables etc. The input to SHELXD consists of two files, name.ins and
name.hkl, both of which can conveniently be created using the Bruker Nonius XPREP
program. The .hkl file has the standard SHELX format and with the exception of two or three
instructions in the .ins file is very similar to the input for SHELXS. SHELXD expects ONE
and only one source of starting atoms. This can take the form:
A: Input atoms in normal SHELX format for expansion using PLOP
B: PATS for Patterson seeding of the dual-space direct methods
C: GROP and a PDB-format model for fragment seeding
D: Random atoms (used if none of the above apply)
For substructure solution using MAD data etc. option B (PATS + FIND but no PLOP) is
recommended. In each case the action is specified in the .ins file that also contains crystal
data in the usual SHELX form. The reflection data consists of an .hkl file containing F2
(HKLF 4) or F-values (HKLF 3). These may correspond to either native data for ab initio
structure solution or structure expansion, or MAD, SAD, SIR or SIRAS FA or DF values for
heavy or anomalous atom location.
Dual-space recycling (Miller et al., 1993; Miller et al., 1994; Sheldrick et al., 2001), using the
largest E-values (FIND) is followed by peaklist optimization (PLOP; Sheldrick & Gould,
1995); one or both of these commands must be present. In the case of structure expansion
only PLOP can be used and the program then stops. When the starting atoms are generated
randomly or by PATS or GROP, the calculations are repeated with new sets of starting atoms
each time. The total number of such tries may be specified with NTRY, otherwise the program
runs for ever (unless interrupted by a name.fin file).
When the final correlation coefficient CC (after PLOP) for an atomic resolution ab initio run
of SHELXD is 65% or greater, the structure is almost certainly solved. SHELXD writes the
best solution so far to a SHELX format file name.res and a PDB format file name.pdb. The
former can be examined with the interactive graphics program XP that is part of the Bruker
SHELXTL system. If XP is not available the PDB file may be displayed with RASMOL (use
the ball and stick display mode). Note that this may be done before stopping SHELXD. If the
structure is clearly solved, SHELXD may be terminated cleanly by creating a file name.fin in
the working directory.
Examples of ab initio structure solution with SHELXD
To illustrate full structure solution by ab initio methods, a test example is provided (in the egs
subdirectory on the SHELX ftp site) in the form of the files pn1a.ins and pn1a.hkl. Four
different ways of solving the structure are included in the .ins file; in order to run the various
tests it will be necessary to comment out some lines (by putting a space character at the
beginning of the line). The file is read only as far as the first HKLF instruction. This test
structure was kindly provided by Jenny Martin, University of Queensland, Australia. It
consists of (GCCSLPPCAANNPDYC), a linear polypeptide with two disulfide bridges,
giving 110 non-hydrogen peptide atoms plus 12 solvent atoms. The space group is P21 and
the resolution of the data 1.1Å. For further details see Hu et al. (1996). In the following
examples, TITL...UNIT in the normal SHELX format is assumed at the start of the .ins
file and HKLF 4 (or HKLF 3) followed by END at the end of the file. The cell contents
defined by SFAC and UNIT are only used by PLOP; in the FIND stage the atoms are
assumed to be of the same type but with occupancies proportional to the square root of the
peak height, unless occupancy refinement is used (TANG with a negative first parameter).
FIND 80
PLOP 120 140 160
NTRY 50
This will search (FIND) for 80 atoms in the dual-space stage; it is usually more efficient to
search for ca. 25% less than the total number of non-solvent atoms, especially when - as here
- some heavier atoms such as sulfur are present. In the PLOP stage on the other hand one
should specify more than the expected number of atoms because this procedure involves the
elimination of the 'wrong' atoms. One can leave NTRY out in which case the job will run
forever (unless aborted or stopped more gently by creating a name.fin file in the same
directory).
An alternative approach is to use Patterson seeding instead of random starting atoms. One can
then look for say 80 atoms as above with FIND, or alternatively first optimize the sulfur
substructure (in this case four atoms) with FIND and expand to the full structure with PLOP.
The Patterson seeding may be performed for example with a randomly oriented fixed length
vector (for a disulfide bond). Everything after a '!' sign in a SHELX .ins file is treated as a
comment.
PATS -2.06 ! S-S distance
PSMF -4 ! supersharp Patterson
FIND 4 5
MIND -1.8 ! S-S > 1.8A, calc. PATFOM
TEST 10 5
PLOP 50 80 120 160 160
NTRY 20
Alternatively the Patterson seeding may use the highest Patterson peaks as translation search
vectors:
PATS
PSMF -4
FIND 4 5
MIND -1.8
TEST 10 5
PLOP 50 80 120 160 160
NTRY 20
Patterson or fragment seeding does not have to go through the FIND stage to optimize the
atomic positions, though this is strongly recommended and has the advantage that all four
sulfurs can be used. It is also possible to go into structure expansion with PLOP directly, and
this facility can be tested using the two-atom disulfide fragment as follows. It should be noted
that two sulfur atoms are quite adequate for PLOP to expand to the full structure, but the CC
threshold (the first TEST parameter) for entering the PLOP stage needs to be reduced a little
(in the above tests, it had the default of 45% for FIND 80 and was set to 10 for FIND 4).
GROP
TEST 8 5
PLOP 30 50 80 120 160 160
NTRY 20
ATOM 1 S CYS 1 0.000 0.000 0.000 1.000 10.00
ATOM 2 S CYS 1 0.000 0.000 2.060 1.000 10.00
The two sulfur atoms are given in fixed PDB fixed format. As a further example (not
provided as test files) of seeding based on an initial fragment search, for a cyclodextrin
structure with four beta-cyclodextrins in the asymmetric unit and with data barely to atomic
resolution, the following could be tried:
GROP
FIND 240
PLOP 320 400
ATOM 1 C41 MOL 1 -3.859 4.863 7.904 1.000 10.00
ATOM 2 C31 MOL 1 -5.081 4.209 8.524 1.000 10.00
ATOM 3 C21 MOL 1 -5.211 2.740 8.155 1.000 10.00
... diglucose fragment in PDB format ... .
ATOM 21 C52 MOL 1 -0.292 4.714 7.025 1.000 10.00
ATOM 22 O52 MOL 1 -0.642 5.837 6.253 1.000 10.00
A major new facility in SHELXD for small molecules is the ability to solve merohedrally
twinned structures by ab initio methods; all that is required is to input the SHELXL
instructions TWIN and estimated BASF parameter (which is held at a fixed value throughout).
XPREP can be used to find the TWIN matrix and estimate the BASF parameter value. TWIN
and BASF are only applied at the PLOP stage, and are ignored by PATS, GROP and FIND.
Macromolecular phasing using SHELXD and SHELXE
SHELXE is intended to be run immediately after SHELXD. It picks up the .res file
containing the best substructure solution (so far) from SHELXD. Since very few parameters
are required for SHELXE they are all given on the command line. When the correlation
coefficients indicate that SHELXD has 'solved' the substructure, it can be terminated (by
writing a dummy name.fin file into the working directory - under UNIX the touch
instruction can be used for this) and TWO SHELXE jobs started. Two jobs are almost
always necessary because the heavy atom substructure and where appropriate the space group
may have to be inverted; there is a 50% chance that the heavy atom enantiomorph will be
wrong! The command lines for these two jobs are identical except that one contains the -i
switch. These two jobs may be run simultaneously because the files do not clash; the -i job
adds '_i' to the end of the first part of the filename for the output files. Often it will become
clear from the console output which heavy atom enantiomorph is correct (see examples
below) and the other job can be killed with .
Before phasing with SHELXD and SHELXE it is necessary to prepare three input files:
name-df.ins, name-df.hkl and name.hkl. The first two are read by SHELXD, the last two by
SHELXE, which also reads the file name-df.res written by SHELXD. Up to the period, the
filename can be freely chosen but must be the same for the first two files; see the examples
below. All three files can conveniently be set up using the Bruker XPREP program, but the
information below should enable other sources to be used. Note that Bruker Nonius are often
willing to provide a free demo version of XPREP (fully featured but with an expiry date),
anyone interested should contact sbyram@bruker-axs.com, trixie.wagner@bruker-axs.de or
anita.coetzee@nonius.nl.
The name-df.ins file contains (at least) the following instructions in the order given:
TITL (followed by any title on the same line)
CELL l a b c a b g (in Å and deg.: l is ignored but is standard for SHELX)
LATT and SYMM (to define the space group, see examples and the SHELX manual)
SFAC Se (or any other single element, even if there are several heavy atom types)
UNIT M (approximate number of heavy atoms per cell multiplied by 4)
SHEL 999 d (where d is the resolution at which to truncate the data)
PATS (Patterson seeding)
FIND N (number of sites to search for, should be within 20% for best results)
MIND -3.5 (minimum allowed distance between sites)
HKLF 3 (to read F rather than F2)
END
The critical parameters are d, the resolution at which to truncate the data, and N, the number
of atoms to be searched for; it may be worth trying different values of these two parameters
in difficult cases.
The optimal value of d may be estimated using XPREP, either from the mean ratio of DF to
its esd (assuming that the data have been processed so that the esds are on an absolute scale,
i.e. c2 is close to one), or from the correlation coefficient between the signed anomalous
differences for two datasets (different MAD wavelengths or in the case of SAD different
crystals). It should be noted that there is almost always an optimal value of d and it should be
larger than the resolution limit of the diffraction pattern. Often 3Å to 3.5Å gives good results
for MAD phasing. If XPREP is not available then a good rule of thumb is to set d to 0.5Å
less than the diffraction limit.
At the end of the dual-space direct methods SHELXD refines the site occupancies assuming
that all atoms are of the same type. This provides an adequate approximation in the case
where different anomalous scatterers are present (e.g. Ca2+ and S in the trypsin example
discussed below). It also shows when the actual number of sites is different from the value
input on the FIND instruction; for a selenomethionine MAD experiment there should be a
clear drop in occupancy after the last site. For halide soaks on the other hand there is often a
continuous descent to the noise level reflecting the variable occupancies of the sites. The
occupancy refinement is switched on by a negative first TANG parameter; this is the default if
there is no PLOP instruction.
The cell contents (SFAC/UNIT) should be specified correctly when SHELXD is used for full
ab initio structure solution, but for substructures a single element type should be specified
and the number of sites expected per cell multiplied by about four so that the probabilities are
calculated correctly for the minimal function and Ralpha figures of merit. Since these are
only printed as information - the correlation coefficient alone is used to decide which solution
is 'best' - the SFAC/UNIT parameters are not important for substructure solution.
For large selenomethionine substructures (which behave more like equal atom ab initio
structure solution of small molecules) it may be worth increasing the number of Patterson
peaks used for the Patterson seeding (e.g. PATS 200; the default is 100) and adding the
instructions WEED 0.3 (random omit maps) and SKIP 0.5 (uranium atom removal). The
latter two are the defaults when PLOP is present but are switched off by default if PLOP is
absent. When PATS is used, WEED produces a much smaller additional improvement in the
hit ratio than when PATS is absent. For small substructures (<10 sites), WEED and SKIP can
do more harm than good by eliminating too many correct sites at once.
The minus sign for the first MIND parameter specifies that the PATFOM figure of merit and
crossword table should be calculated. For phasing using the anomalous scattering of sulfur, a
distance of about 1.7Å is required if the resolution of the DF data (as truncated using SHEL)
permits the sulfur atoms in disulfide bridges to be resolved from each other (see trypsin
example below). The default option in the FIND stage of SHELXD is to ignore all sites on
special positions; to include possible sites on special positions, set the second MIND
parameter to -0.1. This can happen for halide soaks etc. but is not required for the two
examples below (selenomethionine cannot lie on a special position, and there are no special
positions in P212121).
It may also be worth adding NTPR 100 or NTPR 1000, otherwise the SHELXD job will
never finish. Alternatively NTPR can be left out and the job terminated by creating a name-
df.fin file.
The file name-df.hkl consists of one line per reflection, terminated by the end of the file or by
a line with all numbers zero. It is read using the FORTRAN format 3I4,2F8.2,I4; as normal
when reading floating point numbers with FORTRAN, the number of figures after the
decimal point may be varied but the numbers must be contained within the 8 character fields
and the decimal point must be present in the number. Each line consists of h, k, l, [DF or FA],
[s(DF) or s(FA)] and a, where a is the estimated phase shift in degrees that has to be added
to the heavy atom phase to give the native protein phase. DF or FA are always given as
positive numbers. In the SIR case, a is zero if the derivative F is greater than the native F and
180 if the opposite is true; for SAD, a is 90 if F+ > F- and 270 if F+ < F- For MAD or SIRAS
data, a may be anywhere in the range 0 to 360. a is only read by SHELXE, not by SHELXD.
The file name.hkl contains h, k, l, F2 and s(F2) in format 3I4,2F8.2 for the native data and is
terminated by the end of the file or by a line with all numbers zero. In a selenomethionine
MAD experiment it could either be a remote wavelength or (as in the example below) it could
be the data from the native (methionine) crystal if that diffracted to higher resolution. Usually
the same data will be used for the final refinement of the structure.
After starting SHELXD (with the command line shelxd name-df) the program first
prints a summary of all parameters used, then calculates and stores the Patterson and the
phase relations for the tangent formula. The solution with the best CC (correlation
coefficient) so far is written to the name-df.res file. One should wait until there are one or
more solutions with CC and CC(weak) at least 30 and 15 resp. and well separated from the
rest, but in practice it is worth waiting a few minutes longer in case there is an even better
solution. When it appears (from the CC v