WRF Benchmarks and Profiling

Here Are My Benchmark Results

I ran some runtime tests of WRF on a single, four core processor to see how different compile options help or don't help with run times. The short version of the conclusion is that now WRF can effectively use shared memory on multiple core machines. Previous versions seemed to not do so well. BUT BE WARNED: THE WRF USER PAGE SAYS SOME PHYSICS OPTIONS CAN NOT BE USED WITH SHARED MEMORY. For my tests, shared memory helped a lot. If you are not familiar with the terms "shared  memory" and "distributed memory," I have a simplified explanation below.  Every processor and memory and motherboard combination will have different characteristics. I understand hardware with a HPC programmer's understanding (meaning I not only know what L2 cache is, but I know what cache lines are and why they should be managed) but I do not keep up with current hardware so don't try to use this as a recommendation. It's just my experience with my hardware. I used two runs for each combination tested. They are for the same set of domains and options in the same location, southern Idaho, but for two different times of the same day. The first is 9:00 GMT, the second is 19:00 GMT. I was surprised that there was so much difference. And I expected the daytime, with radiation calculations, to be longer, but the opposite was true. The runtimes are minutes and seconds. Farther down, I provide more details about my namelist and my hardware, but let's get to the timed tests:

23:13 and 16:33

Distributed memory (dmpar):

 mpiexec -n ?
 Times
 no MPI, wrf.exe
 23:22 and 16:40
 1  23:13 and 16:32
 2  18:54 and 14:42
 3  failed to run
 4  16:52 and 14:11


Shared Memory (smpar)

 OMP_NUM_THREADS
 Times
 1  23:13 and 16:24
 2  14:12 and 10:08
 3  11:12 and 7:40
 4  10:04 and 7:01


Distributed and Shared Memory (dm+sm)

 OMP_NUM_THREADS  mpiexec -n ?
 Times
 1  2  18:54 and 14:50
   4  16:59 and 13:57
 2  2  14:36 and 11:18
   4  failed to run
 4  1  10:06 and 6:50
   2  12:18 and 9:10
   4  12:3 and 9:29


If you didn't already know, you can control the number of threads the shared memory code can use by setting an environment variable, i.e. export OMP_NUM_THREADS=4. My processor has four cores, so setting it to 4 is redundant. But I did a couple tests with OMP_NUM_THREADS not set and got very similar results as setting it to 4. If you are wondering why I sometimes limited the tests to using less than the maximum processing power, there are two reason. First, it is a way to help check that the other results make sense. Secondly, I often use the same computer to run other interactive programs while I have a long (several hours) WRF run going. I would whether penalize the WRF run time so the other programs don't have to compete with it while I'm sitting and waiting for them to do something.


What is Shared and Distributed Memory?

As you dig deeper into the details of how computers do things, they get more complicated. This is my attempt to give a simple and practical explanation of how shared and distributed memory compile options are implemented in WRF. Computers that are not PCs can have different, exotic configurations. First, a PC can have more than one processor in it. And each processor can have one or more cores. For practical purposes, think of each core as a different processor. But they all share the same memory mainly because they are in the same box. If all of them are working on the same WRF domain, they split the work between them. For instance, your domain has rows, columns, and vertical levels. Perhaps each core would take a different piece and handle the calculations within it. But some parts of the simulation have to be done step by step by one processor, so doubling the number of cores will not result in halving the runtime. The software used to divide up the work (and do intricate details to keep things straight) is called OpenMP. During compilation, you may see references to it as OMP. During compilation, it looks like most things are compiled without OMP, but don't worry; If something compiled with OMP calls something compiled without OMP, which happens a lot, the called portion is still effectively split up among cores (programmers: please don't complain about the semantics of that last statement).


Distributed memory is a little bit similar, but the differences are important. With distributed memory, your WRF domain is divided up into pieces amongst more than one box. The processors in each box have their own memory; it is distributed, not shared. The problem is that while WRF in one box is working on its piece of your domain, it also has to share the information that is changing along the dividing lines with the other computers. Weather moves. To share that information, the values have to be gathered, bundled, and sent to the computer(s) that need it after every time step. The time needed to do the sending prevents the run time from being halved when you double the number of computers. This is where an expensive network helps. Most business networks use ethernet. It can send a lot of information fast, but has high latency. That means, that even for very short messages, it is relatively slow. Expensive networks like Myrinet and Infiniband have a lower latency. If you're going to connect a lot of computers together to run WRF, you might be better off buying less computers and more expensive networking so the computers can be used more effectively. Anyone who doesn't understand how much oversimplifying I am doing should definitely get HPC cluster professionals to help make such decisions.


A computer with multiple cores can be set up and run in distributed memory mode. It will go through some of the motions of sending information to the other cores, but the data will not actually go through the hardware and network. It stays in memory. In some instances this could be helpful because of L2 cache management issues. If you have an operational grid you typically use, distributed memory within a single box is worth trying once to see if it helps performance. In some instances, sharing a run between two networked computers could more than halve the run time. This would happen if the domains are too big to fit in one computer's memory. Being too big causes "virtual memory" to be used, and that is much slower than sending part of the data across the network.


Profiling WRF to Determine How Run Time is Spent

If you are not aware of it, additions can be added to programs so that when they are run, you can tell where in the program it spends its time. That is called profiling. I used gfortran and gcc to compile WRF. I added the -pg options in the configure.wrf file to make these additions. Then I ran WRF and got the results. My motivation was that I wanted to do some high resolution runs for wind forecasting. I wondered whether I could turn off or modify some physics and dynamics options to improve run times. I need accurate wind forecasts and someone once told me to turn off microphysics to help runtimes.  But microphysics only accounted for around 5% of processor use. And someone once said to reduce the radiation calls with "radt" in the namelist, but that seemed to have no real effect when I changed it from 10 minutes to 20 minutes.The big hogs were one or two advection routines and a subroutine called sintb. Changing some of the advection options made small changes to the times and to the names of which advection routines were run, but they generally accounted for close to 40% of the processor time. The sintb routine took about 16-18%. I think it has to do something with in-column movement, so it may be a type of advection. It was hard to tell from the uncommented code. I might be able to save a few percent here and there with my physics and dynamics options choices, but not enough to matter much. I was using the same namelist as with the shared and distributed memory tests (but run serially) and that is listed farther down below. Here is a typical sample of the output from the profiler:


Sample #1:

  %   cumulative   self              self     total          
 time   seconds   seconds    calls   s/call   s/call  name   
 21.36    185.04   185.04    10416     0.02     0.02  __module_advect_em_MOD_advect_scalar_mono
 19.01    349.68   164.64    26040     0.01     0.01  __module_advect_em_MOD_advect_scalar
 17.79    503.81   154.13  1018872     0.00     0.00  sintb_
  4.77    545.12    41.31  1740512     0.00     0.00  __module_mp_lin_MOD_clphy1d
  3.49    575.39    30.27    15624     0.00     0.00  __module_big_step_utilities_em_MOD_horizontal_diffusion
  3.36    604.50    29.11     5208     0.01     0.01  __module_big_step_utilities_em_MOD_calc_cq
  2.17    623.29    18.79    12152     0.00     0.00  __module_small_step_em_MOD_advance_uv
  2.08    641.32    18.03    31248     0.00     0.00  __module_em_MOD_rk_update_scalar
  1.97    658.41    17.09    12152     0.00     0.00  __module_small_step_em_MOD_advance_mu_t
  1.81    674.07    15.66    12152     0.00     0.00  __module_small_step_em_MOD_advance_w
  1.23    684.76    10.69    57440     0.00     0.00  __module_bl_ysu_MOD_ysu2d
  1.00    693.40     8.64    20028     0.00     0.00  __module_bc_MOD_relax_bdytend

Sample #2:

  %   cumulative   self              self     total          
 time   seconds   seconds    calls   s/call   s/call  name   
 37.76    328.08   328.08    36456     0.01     0.01  __module_advect_em_MOD_advect_scalar
 18.81    491.50   163.42  1018872     0.00     0.00  sintb_
  5.18    536.52    45.02  1740512     0.00     0.00  __module_mp_lin_MOD_clphy1d
  4.35    574.29    37.77     5208     0.01     0.01  __module_big_step_utilities_em_MOD_calc_cq
  4.00    609.06    34.77    15624     0.00     0.00  __module_big_step_utilities_em_MOD_horizontal_diffusion
  2.46    630.43    21.37    31248     0.00     0.00  __module_em_MOD_rk_update_scalar
  2.19    649.42    18.99    12152     0.00     0.00  __module_small_step_em_MOD_advance_uv
  2.04    667.16    17.74    12152     0.00     0.00  __module_small_step_em_MOD_advance_mu_t
  1.74    682.32    15.16    12152     0.00     0.00  __module_small_step_em_MOD_advance_w
  1.26    693.26    10.94    57440     0.00     0.00  __module_bl_ysu_MOD_ysu2d
  1.18    703.55    10.29    20028     0.00     0.00  __module_bc_MOD_relax_bdytend


My Computer and My namelist.input File

My computer is an Intel 4 core processor, Q9550, running at 2.83 GHz. It has 12 MB of L2 cache. I have 4 GB of memory running at 1333 MHz.

During the serial/shared/distributed tests, I noticed that when WRF was not running, 675 MB of memory was used by the operating system. During the serial and shared memory runs, used memory went up to 750 MB. During the distributed runs (all on the one machine) the used memory went up to 1200 MB for 4 MPI instances; a little less for less instances.


And here is the namelist.input file with some of the uninteresting parts removed:

 &time_control
 run_hours                           =2,
 start_year                          =2009, 2009, 2009,
 start_month                         =01, 01, 01,
 start_day                           =04, 04, 04,
 start_hour                          =19, 19, 19,
 end_year                            =2009, 2009, 2009,
 end_month                           =01, 01, 01,
 end_day                             =04, 04, 04,
 end_hour                            =21, 21, 21,
 interval_seconds                    =3600,,
 /

 &domains
 time_step                           = 54,
 max_dom                             =3,
 s_we                                = 1,1,1,
 e_we                                =53,37,31,,
 s_sn                                = 1,1,1,
 e_sn                                =53,37,31,,
 s_vert                              = 1,     1,     1,
 e_vert                              = 44,    44,    44,
 num_metgrid_levels                  = 38
 p_top_requested                     = 10000.00
 dx                                  = 9000,3000,1000
 dy                                  = 9000,3000,1000
 grid_id                             = 1,     2,     3,
 parent_id                           =1,1,2,,
 i_parent_start                      =1,21,14,,
 j_parent_start                      =1,21,14,,
 parent_grid_ratio                   =1,3,3,,
 parent_time_step_ratio              = 1,     3,     3,
 feedback                            = 1,
 smooth_option                       = 0
 eta_levels   = 1.0000, 0.9960, 0.9920,
                0.9880, 0.9840,
                0.9797, 0.9742, 0.9675,
                0.9596, 0.9505,
                0.9399, 0.9269, 0.9115,
                0.8937, 0.8735,
                0.8460, 0.8300, 0.8120, 0.7920,
                0.7680, 0.7360, 0.7020, 0.6660, 0.6290,
                0.5915, 0.5536, 0.5153, 0.4773, 0.4400,
                0.4040, 0.3695, 0.3375, 0.3085, 0.2645,
                0.2305, 0.2035, 0.1792, 0.1539, 0.1272,
                0.0995, 0.0713, 0.0429, 0.0145, 0.0000,
 /

 &physics
 mp_physics                          = 2,     2,     2,
 ra_lw_physics                       = 1,     1,     1,
 ra_sw_physics                       = 1,     1,     1,
 radt                                = 20,    20,    20,
 sf_sfclay_physics                   = 1,     1,     1,
 sf_surface_physics                  = 1,     1,     1,
 bl_pbl_physics                      = 1,     1,     1,
 bldt                                = 0,     0,     0,
 cu_physics                          = 1,     1,     0,
 cudt                                = 5,     5,     5,
 isfflx                              = 1,
 ifsnow                              = 0,
 icloud                              = 1,
 surface_input_source                = 1,
 num_soil_layers                     = 5,
 sf_urban_physics                    = 0,
 mp_zero_out                         = 0,
 maxiens                             = 1,
 maxens                              = 3,
 maxens2                             = 3,
 maxens3                             = 16,
 ensdim                              = 144,
 /

 &fdda
 /

 &dynamics
 w_damping                           = 0,
 diff_opt                            = 1,
 km_opt                              = 4,
 base_temp                           = 290.
 damp_opt                            = 0,
 zdamp                               = 5000.,  5000.,  5000.,
 dampcoef                            = 0.01,   0.01,   0.01
 khdif                               = 0,      0,      0,
 kvdif                               = 0,      0,      0,
 smdiv                               = 0.1,    0.1,    0.1,
 emdiv                               = 0.01,   0.01,   0.01,
 epssm                               = 0.1,    0.1,    0.1
 time_step_sound                     = 4,      4,      4,
 h_mom_adv_order                     = 5,      5,      5,
 v_mom_adv_order                     = 3,      3,      3,
 h_sca_adv_order                     = 5,      5,      5,
 v_sca_adv_order                     = 3,      3,      3,
 non_hydrostatic                     = .true., .true., .true.,
 moist_adv_opt                       = 1, 1, 1,
 scalar_adv_opt                      = 1, 1, 1,
 chem_adv_opt                        = 0, 0, 0,
 tke_adv_opt                         = 1, 1, 1,
 /