WRF Errors - CFL Errors, SIGSEGV Segmentation Errors, and Hanging or Stopping

Writing this in July 2012, I have not run WRF in about a year. Bugs get fixed. Things change. What I have on this page may be or become out of date. Regardless, it does not have a “solution,” it only contains things to try when WRF won't work. In my experience, WRF is more fragile than it should be. Even when I get something to work, sometimes it doesn't but for no apparent reason. It has been frustrating to me. I feel your pain. I can't make it go away. Sorry. I wish I knew more so I could help. This page is basically just a way to give you some more ideas to try when you feel like quitting.

If you find some tricks that help, let me know so I can include them on this page. If you want, I'll also include your name and/or email and/or affiliation with the information.

CFL Errors

Looking at the code, CFL errors are generally caused by vertical winds that are too fast for WRF to solve. For me, they generally happen over high mountain peaks.

Even though vertical winds are slow compared to horizontal winds, grids cells have very short vertical dimensions compared to their horizontal dimensions. So first try to reduce the time step. Shorter time steps mean that the wind won't travel all the way through the grid cell in one time step iteration. (That over simplifies the real way WRF handles such things, but the idea is approximately correct.)

Another simple fix that is relatively unknown, is changing the epssm value in the dynamics section of WRF's namelist.input file. Each time step in WRF is divided up into three smaller time steps. This allows solving the equations using longer time steps. The three sub-time-steps are not exactly equal. If they were, repeating numerical patterns result that look like waves of values at the accoustical freuency. A slight offset in the length of the smaller time steps helps prevent this. The epssm controls the offset. But sometimes the default value is not enough to prevent the patterns or waves and the extreme values in a wave cause the CFL errors. So try a different value for epssm. The default is 0.1 so so try 0.3 instead or a couple other values. I forget the allowed range.

Obviously for very long runs, you can't use very short time steps, otherwise it takes too long to complete. Haven't tried it personally, but for long climate downscaling runs, some folks use the longer time step but frequent “restart” intervals. When the CFL errors occur, WRF is stopped, the run is restarted at the last good, saved restart but with a short time step. After a while and after one or more good restart saves at the short time step, the model is stopped again, the time step is increased back to its normal value, and the run continued. Basically, reduce the time step only for the relatively few periods that have errors. It takes some close monitoring, but you can decide for yourself whether the extra people time is worth it to get an overall shorter runtime.

For me, CFL errors were more common at the beginning of a run. Some people recommend that you do not use the first 8 or 12 hours of a run because WRF is “spinning up.” It takes a while for the low resolution weather data that is  used to initialize WRF to smooth out. It also takes time for clouds to develop within the model and be a weather affecting factor. During that time, there are waves of changes that cross the grid several times, causing less than realistic phenomenon. If you get errors during the first part of a run, try to start the run at a slightly earlier time; a different time might not have the error causing conditions and things might smooth out enough by the model time that is interesting to you.

If you run the same grids lots of times, here are a couple things that may reduce the chances of getting CFL errors during some of those runs. First, eliminate high peaks near the edges of your grid, both inner and outer grids. The steepness of a peak causes more vertical winds within the model. And because of the change in resolutions, there are sometimes “reflections” of meteorological values off the edges of a grid. This is mostly a numerical phenomenon, but causes slightly increased or reduced values closer to the grid borders as the waves reflect back onto themselves. Having a high peak there can trigger additional extremes and thus CFL errors. And since a corner has two edges, corners are bad places to have peaks. Secondly, increase the height of your grid cells. It takes more time for vertical winds to cross a tall grid cell, so it is less likely to cause a CFL error. Thirdly, increase the vertical dampening. WRF has a couple ways to do this depending on the other options chosen. Read the WRF Users Guide and figure if and how to use them. This slows down the vertical winds; maybe you don't want that, but it helps with CFL errors. Fourthly, smooth the peaks. The WPS process has an option and a number of passes to smooth the terrain. WRF also has some namelist options. Figure out what is tolerable and use those.

SIGSEGV Segmentation Faults and Stopping or Hanging

Sorry, I don't know what causes WRF to hang or stop producing output even though the run has not errored and ended. Sometimes WRF just stops producing output. The processors on which it is running sometimes show that they are busy; sometimes not. Sometimes the program stops with a “segmentation fault,”  SIGSEGV message. A segmentation fault is when the program tries to access a memory location not controlled by the program. The operating system sends a “SIGSEGV” signal, which kills the program. Using some of the tricks that fix the CFL errors have sometimes fixed these too.

Here are some other things that sometimes works for me. None of them work all the time. And some of them work for some runs but not others even if I have the same grid and initialization data. If you are having these problems, be prepared for frustration and missed deadlines.

First, try NOT using the multithreaded compile options. This is the smpar option during the configure part, just before compiling. Instead, if you have multiple cores on a node, use the dmpar option and run additional copies of WRF. Your mpirun -np or mpiexec -np command can be for more than the number of nodes you have and they will start more than one WRF on each node. For me, WRF is less efficient if I use all the cores on a node. Yes, it is a waste of resources, but it is better than nothing. Second, change the number of nodes used. I don't know why it matters but it has made the difference to me in getting something to run or not run. Thirdly, just start changing options. Make some big changes until something works. Then use that to figure out what smaller changes might work. And let me say again that some of the things that fix the CFL errors also sometimes help with the segmentaion faults and other program stops. Changing time step, start time, or grid size/location are most likely to help.

Haven't tried it, but if you use the multithreading option in the compile (shared memory/smpar), setting the environmental variable OMP_STACKSIZE to 4G might help. I read that recently in an email distributed to wrf users. Maybe values other than 4G might work, depending on how much memory each node has. You probably have to put it in the job script because I think that controls stuff at run time rather than compile time.