Writing this in July 2012, I have not
run WRF in about a year. Bugs get fixed. Things change. What I have
on this page may be or become out of date. Regardless, it does not
have a “solution,” it only contains things to try when WRF won't
work. In my experience, WRF is more fragile than it should be. Even
when I get something to work, sometimes it doesn't but for no
apparent reason. It has been frustrating to me. I feel your pain. I
can't make it go away. Sorry. I wish I knew more so I could help.
This page is basically just a way to give you some more ideas to try
when you feel like quitting.
If you find some tricks that help, let
me know so I can include them on this page. If you want, I'll also
include your name and/or email and/or affiliation with the
information.
CFL Errors
Looking at the code, CFL errors are
generally caused by vertical winds that are too fast for WRF to
solve. For me, they generally happen over high mountain peaks.
Even though vertical winds are slow
compared to horizontal winds, grids cells have very short vertical
dimensions compared to their horizontal dimensions. So first try to
reduce the time step. Shorter time steps mean that the wind won't
travel all the way through the grid cell in one time step iteration. (That over
simplifies the real way WRF handles such things, but the idea is
approximately correct.)
Another simple fix that is relatively
unknown, is changing the epssm value in the dynamics section of WRF's
namelist.input file. Each time step in WRF is divided up into three
smaller time steps. This allows solving the equations using longer
time steps. The three sub-time-steps are not exactly equal. If they
were, repeating numerical patterns result that look like waves of values at the accoustical freuency. A slight offset in the length of the
smaller time steps helps prevent this. The epssm controls the offset.
But sometimes the default value is not enough to prevent the patterns
or waves and the extreme values in a wave cause the CFL errors. So
try a different value for epssm. The default is 0.1 so so try 0.3
instead or a couple other values. I forget the allowed range.
Obviously for very long runs, you can't
use very short time steps, otherwise it takes too long to complete.
Haven't tried it personally, but for long climate downscaling runs,
some folks use the longer time step but frequent “restart”
intervals. When the CFL errors occur, WRF is stopped, the run is
restarted at the last good, saved restart but with a short time step. After a while and after
one or more good restart saves at the short time step, the model is
stopped again, the time step is increased back to its normal value, and the run continued.
Basically, reduce the time step only for the relatively few periods
that have errors. It takes some close monitoring, but you can decide
for yourself whether the extra people time is worth it to get an overall shorter runtime.
For me, CFL errors were more common at
the beginning of a run. Some people recommend that you do not use the
first 8 or 12 hours of a run because WRF is “spinning up.” It
takes a while for the low resolution weather data that is used to initialize
WRF to smooth out. It also takes time for clouds to develop within the model and be a
weather affecting factor. During that time, there are waves of
changes that cross the grid several times, causing less than realistic phenomenon.
If you get errors during the first part of a run, try to start the
run at a slightly earlier time; a different time might not have the
error causing conditions and things might smooth out enough by the
model time that is interesting to you.
If you run the same grids lots of
times, here are a couple things that may reduce the chances of
getting CFL errors during some of those runs. First, eliminate high peaks near the edges of
your grid, both inner and outer grids. The steepness of a peak causes more vertical winds within the model. And because of the change in
resolutions, there are sometimes “reflections” of meteorological
values off the edges of a grid. This is mostly a numerical phenomenon, but
causes slightly increased or reduced values closer to the grid
borders as the waves reflect back onto themselves. Having a high peak there can trigger additional extremes and
thus CFL errors. And since a corner has two edges, corners are bad places to have peaks. Secondly, increase the height of your grid cells. It
takes more time for vertical winds to cross a tall grid cell, so it
is less likely to cause a CFL error. Thirdly, increase the vertical
dampening. WRF has a couple ways to do this depending on the other
options chosen. Read the WRF Users Guide and figure if and how to use them. This slows down the vertical winds; maybe you don't
want that, but it helps with CFL errors. Fourthly, smooth the peaks.
The WPS process has an option and a number of passes to smooth the
terrain. WRF also has some namelist options. Figure out what is
tolerable and use those.
SIGSEGV Segmentation Faults and Stopping or Hanging
Sorry, I don't know what causes WRF to hang or stop producing output even though the run has not errored and ended.
Sometimes WRF just stops producing output. The processors on which it
is running sometimes show that they are busy; sometimes not. Sometimes
the program stops with a “segmentation fault,” SIGSEGV message.
A segmentation fault is when the program tries to access a memory
location not controlled by the program. The operating system sends a
“SIGSEGV” signal, which kills the program. Using some of the
tricks that fix the CFL errors have sometimes fixed these too.
Here are some other things that sometimes
works for me. None of them work all the time. And some of them work
for some runs but not others even if I have the same grid and
initialization data. If you are having these problems, be prepared
for frustration and missed deadlines.
First, try NOT using the multithreaded
compile options. This is the smpar option during the configure part, just before compiling. Instead, if you have multiple cores
on a node, use the dmpar option and run additional copies of WRF. Your mpirun -np or mpiexec
-np command can be for more than the number of nodes you have and
they will start more than one WRF on each node. For me, WRF is less
efficient if I use all the cores on a node. Yes, it is a waste of
resources, but it is better than nothing. Second, change the number
of nodes used. I don't know why it matters but it has made the
difference to me in getting something to run or not run. Thirdly,
just start changing options. Make some big changes until something
works. Then use that to figure out what smaller changes might work.
And let me say again that some of the things that fix the CFL errors
also sometimes help with the segmentaion faults and other program
stops. Changing time step, start time, or grid size/location are most
likely to help.
Haven't tried it, but if you use the
multithreading option in the compile (shared memory/smpar), setting
the environmental variable OMP_STACKSIZE
to 4G might help. I read that recently in an email distributed to wrf users. Maybe values other than 4G might work, depending on how much memory each node has. You probably have to put it in the job script because I think that controls
stuff at run time rather than compile time.