Comparing WRF Physics Options


It is fairly common to use three main measures of forecast accuracy: bias, mean absolute error (MAE), and root mean squared error (RMSE).For a reference on calculating these, you can try this link: A Protocol for standardizing the Performance Evaluation of Short-Term Wind Power Prediction Models. Please note that "bias" is defined as the mean error. Error is defined as the measured value minus the predicted value. This means that a negative bias indicates that the predictions were generally too high, which is counter intuitive. To help avoid this confusion, I use the term "mean error" so that the intuitive interpretation of "bias" is avoided.

A technique is used here that I have not seen mentioned elsewhere. However, it seems likely that someone has used these before. I coined the terms "Mean Absolute Bias Corrected Error (MABCE" and "Root Mean Squared Bias Corrected Error (RMSBCE)." These statistics indicate that before the MAE or RMSE are calculated, any bias that has been identified is removed from the individual forecasted values; the errors are "bias corrected." The reasoning is that when using forecasted values for predicting wind or generated electrical power, any bias that has been previously identified would be removed. For practical purposes, the details of what comes out of the model are less important than the quality of forecast that would actually be passed on to the utility companies. That passed on forecast would have any known biases removed, so the statistics that compare forecasts should also have any known biases removed.

PLEASE NOTE: For most of the data shown, I do something that should raise a red flag. I calculate the bias from the same set of data from which I later remove the bias. This is simply because I have not accumulated measurements and forecasts over a long enough period of time to calculate bias separately. I would if I could, and I will as time and data progresses. And perhaps my shortcut is irrelevant. The bias corrected values are not much different than the noncorrected values.

Testing Boundary Layer and Surface Layer Schemes

The main WRF physics options that were tested were the planetary boundary layer schemes and the surface layer schemes. These are controlled by the WRF namelist values bl_pbl_physics and sf_sfclay_physics, respectively. Some of the work was done using WRF version which was release in 2008. The work done on that release indicated that the MYJ boundary layer scheme (bl_pbl_physics = 2) and the ETA surface layer scheme (sf_sfclay_physics, = 2) were the best for the initialization data with which I was testing. I used archived RUC data, which uses fixed pressure levels at every 25 mb. Most of these tests used 198 forecasts, which seemed reasonable, though not all could be matched up to sodar data that is being used as measured values. When graphed, there was enough variability between errors from one forecast hour to another that it indicated more forecasts/measurements should be used. For testing the new options in WRF 3.1, the number of forecasts were doubled to 396. This has also been done for some of the WRF version options so like comparisons could be put on the same graph. For 396 forecasts, approximately 370 could be matched up to sodar reading per forecast hour. Occasionally, the sodar either does not record wind speeds or the measurements are of poor quality and I discard them.


This is the graph of mean error. It seems that the RUC data has a positive error, which means the wind speeds are too low. Over time the model not only corrects this error, but seems to create a negative error., wind speeds too high. All the schemes shown seem to have approximately similar values in the nine hour forecast framed allowed with RUC data. One this that does stand out is the combination that uses different options for radiation, which is the lime green line. It uses two radiation options, called RRTMG, that are new to WRF version 3.1: ra_lw_physics = 4 and ra_sw_physics = 4. All other runs use Rapid Radiative Transfer Model (RRTM) for longwave option (ra_lw_physics = 1) and Dudhia scheme for shortwave (ra_sw_physics = 1). These schemes get to the same range as the other, but faster.

This graph shows the mean absolute error for various options. The re seem to be two groups. The new MYNN2.5 and MYNN3 PBL options (bl_pbl_physiscs = 5 and bl_pbl_physics = 6) seem to stray off with larger errors than the others. This is true using both the new MYNN surface layer option (sf_sfclay_physics = 5) and when the older ETA surface layer physics is used (sf_sfclay_physics = 2). According to the WRF Users Guide, page 5-40, MYNN3 should only be used with MYNN surface layer. MYNN2.5 can use MYNN, ETA, and MM5 surface layer options. I have no explanation for why the  older MYJ/ETA combination joins this group when the new RRTMG radiation options were used. The Other group which shows better performance is the older MYJ/ETA combination and the new QNSE PBL scheme with the new QNSE surface layer scheme. Using the MYJ PBL scheme with the new QNSE surface layer scheme gave very similar results as the MYJ/ETA combination. Please notice that the wind speed scale starts at 2.0 meters per second. The errors shown fall within a fairly narrow range. The first part of the forecast seems to be heavily influenced by the initialization data and its bias. As the bias is changed to the model bias, the error levels off with only a slight increase over time. The model seems to have a decent amount of skill, otherwise the increase over time would be greater.

For completeness, here are the MABCE and RMSE graphs for the same data. The pattern is basically the same.

I hope to add a few more lines of some other options from WRF version 3.0 that are being rerun for completeness sake. The previous tests were all done in version and the results may be different if bug fixes changed them. Also, I have accumulated more sodar data. The change in season may affect some results.

As mentioned, these comparisons were done with RUC data for initialization and lateral boundary layer conditions. Similar tests have been started with NAM data. NAM data comes out less frequently but seems to be of better quality. As part of an initial check of WRF results, please see this page: WRF vs Initialization Data.

There have also been some tests with varying the grid dimensions, however the best results occurred with minimal grid sizes, so I used those to have reduced run times. There was not a lot of error difference with the sizes tested, but it would be wise to test that fact for any new domain.