CASTEP Error abort handling
An abnormal or premature exit from a CASTEP run can have three causes.
1. CASTEP has detected an error of some kind and chosen to perform a controlled abort of
the run. This may occur if
1. There is a syntax or other error in your input files
2. some condition has occurred during the run which prevents it from continuing.
This might be a check on the validity of the physics assumptions or a
computational constraint.c)
3. CASTEP has requested an action of the operating system (via the Fortran
run-time library) which has returned a failure status to CASTEP
2. The operating system has chosen to terminate the CASTEP run and killed it. In a batch
system this may be because it exceeded some system resource or queue cputime limit.
3. There is a bug in CASTEP and the process, or one of the parallel processes has terminated
with a "segmentation violation" or "bus error" signal (UNIX and Linux) or "access
violation" (windows).
When trying to understand the cause of the error it is important to work out which of the above
three cases has occurred. In case (1) CASTEP always writes a (hopefully) explanatory error
message into one of its stderr files. The have names of the form .nnnn.err where
is the root name of your castep run, and nnnn is a 4-digit integer showing which
parallel process issued the error message (always 0001 for a serial run). They are deleted on a
normal end-of-run exit. If any of these files contains an informational message that proves that
CASTEP chose a controlled abort. If on the other hand all of the .nnnn.err files are
empty that proves that the running CASTEP processes were killed externally, either because of an
operating system action (case 2) or a bug (case 3).
Further diagnosis: Cases (2) and (3)
To understand these cases you should look at the logfiles written by the batch job manager (if you
are using one) which should contain some information on the reason for aborting the run. These
can sometimes be verbose and cryptic; it is usually best to study the output logs of a successful run
and to look for differences. You may well have to ask your systems staff to interpret these for you.
A further indication of an external abort is the presence of "core" files, which are dumped on a
signal. These can sometimes be useful to a guru in further diagnosis of a bug.
Running out of memory
This is such a common error with plane-wave calculations that it merits a section of its own.
HEAP Memory exceeded
If any of the .nnnn.err files contain the messages
* Error in allocating /variable/ in /function/ (CASTEP versions <= 4.0.1)
* Out of RAM for /variable/ in /function/ (CASTEP versions >= 4.1)
this means that CASTEP requested some memory from the operating system (using Fortran's
ALLOCATE statement) and the request was denied, usually because available memory has been
exhausted. After checking that your input settings do not contain an error, your options are
1. to use some of CASTEP's memory-saving options eg set parameter
OPT_STRATEGY=MEMORY (or OPT_STRATEGY_BIAS to 0 or -3) and
PAGE_WVFNS=-1 or PAGE_WVFNS=/max-size/
2. to find a computer with more memory to run on, (or go to your local computer shop, buy
and install some additional memory)
3. If on a parallel system, increase the number of processors for the job. This way the total
memory needed will be distributed over a larger number of processes, and the
requirement per processor will be smaller
STACK Memory exceeded
Due to a design limitation of linux and most unix and microsoft operating systems, there is
another "memory exceeded" condition which can not be trapped by CASTEP. This occurs when
the stack memory is exhausted, and the result is the process is killed with a "segmentation fault"
on unix/linux. This is harder to diagnose, but be aware that there are O/S-enforced stack limits
which might be much smaller than the physical memory in the system. Google forprocess stack
limits stacksize for more information. the shell command ulimit -s unlimited can be used to
increase stack size (bash shells).
CASTEP error messages explained
It is intended that the error messages CASTEP writes to the .nnnn.err6 files are as far
as possible self-explanatory. Unfortunately it is not always possible to give useful "end-user"
explanations. Here are some commonly encountered abort messages with some explanation.
* ERROR: cell_read - failure to open freeform cell file /filename/
* Error model_continuation: Failed to open file /filename/
CASTEP was unable to open the input files for the run specified on the command line, probably
because there is no file of that name. Check your command lines and input files.
* Error in allocating /variable/ in /function/ (CASTEP versions <= 4.0.1)
* Out of RAM for /variable/ in /function/ (CASTEP versions >= 4.1)
This common error means that CASTEP ran out of memory. See section "Running out of
memory" for more information
* Error reading wavefunction coefficients from file in wave_read_all_ser/par
This or similar messages means that CASTEP was attempting to read a continuation file but the
read failed. This is commonly because the .check file is truncated or corrupt. The wavefunction
coefficients are fairly far down the file, after the parameters and cell data, and if the read got that
far before failing, it is likely that the file was truncated. This can happen if the previous CASTEP
run crashed or was killed while writing the .check file. Check to see if the file size is consistent
with any similar .check files you may have.
* Trapped SIGINT or SIGTERM. Exiting... (CASTEP versions <= 4.0.1)
This message is generated by an otherwise useless signal handler in earlier versions of CASTEP. It
means that CASTEP was killed by an external signal. Diagnosis should proceed as for major case
(3)
* Error check_elec_ground_state : electronic_minimisation of initial cell failed.
* Error calculate_finite_basis : Convergence failed when doing finite basis set
correction.
* Error in /subroutine/ - electronic_minimisation of current_cell failed
Any of these messages means that the SCF convergence loop did not converge in in the maximum
allowed number of iterations. If you read the end of the .castep file it ought to be obvious whether
the run only just failed to converge. In that case specifying a larger value of MAX_SCF_CYCLES
in the .param file ought to work. But sometimes it is apparent that the energy is unlikely ever to
converge, for example it may oscillate, or be decreasing linearly and slowly. This may indicate
that the system is in a poorly-bonded or co-ordinated state, and it's best to ask advice if you don't
know how to preceed.
* Error in parameters_restore: missing END_GENERAL
This can occur on a continuation run where the .check file used for restart is incompatible with the
version of CASTEP you are using. We aim for nearly full compatibility, but there are always
exceptions.
CASTEP Error abort handling
Further diagnosis: Cases (2) and (3)
Running out of memory
HEAP Memory exceeded
STACK Memory exceeded
CASTEP error messages explained