September 1, 2006
Recent Innovations in Simulation Input Modeling
1. Introduction
Despite the ready availability of software packages for automating the input modeling process, creating accurate input models can still be difficult for some simulations. For example, adequate input data might initially be lacking. A single stream of input data might originate from multiple sources, necessitating a difficult mixture input model. The modeling software might provide an incomplete picture of the input model uncertainties and the resulting output variability. The software might generate inconclusive results. Or, the software might simply not include an appropriate model for the input data.
The purpose of this research paper is to find out how these input modeling challenges can be overcome by using recently developed approaches to input modeling. The approaches that are investigated consist of Bayesian probability methods, quantile statistical techniques, and an interactive Maple-based probability language.
2. Using Bayesian Approaches to Enhance Input Modeling
Beginning around 1997, the simulation research community has taken a significant interest in Bayesian methods for input modeling. The foundation of Bayesian statistics is Bayes' Law. In the general form of this law,
Pr(A) is the prior unconditional probability of A and Pr(A|B), the result of Bayes' Law, is the posterior probability of A—that is, the probability of A given B. The two factors used to convert the prior probability of A to the posterior probability of A are Pr(B|A), the conditional probability of B given A (also known as the likelihood), and Pr(B), the unconditional probability of B.
Bayes' Law can be interpreted as expressing the relationship between a prior estimated probability (A) and a posterior refined probability (A|B) that has been made more accurate in the light of subsequently applied data (B). A Bayesian approach can thus be used to create or refine simulation input models or to analyze input model uncertainty. The general technique is to use Bayes' Law to convert an initial probability distribution associated with the input model (the prior distribution, Pr(A)) to a more accurate probability distribution (the posterior distribution, Pr(A|B)), based upon input data that is sampled from the real-world system being modeled (B).
(Henderson, 2003, section 5.2)
2.1 Bayesian Refinement of Input Models for Long-Term Projects
Accurate simulation is important for large-scale civil engineering projects. For example, simulation may be required to estimate schedules or budgets or to explore various scenarios by changing inputs or other simulation assumptions. However, it can be difficult to create an accurate input model for a construction project that has not yet begun, assuming that there is an initial lack of reliable data.
For a long-term project that is repetitive in nature, such as the construction of a tunnel, using a Bayesian approach to repeatedly refine the input model can help solve this problem. This approach begins by constructing an initial input model based on educated assumptions, input from industry experts, and approximations (this is the subjective data). Because of the unique nature of each project, these input assumptions will tend to be only rough estimates and will therefore lead to relatively inaccurate simulation results. However, once the project begins, data on the actual progress of the project becomes available (this is the observed data).
Bayesian updating combines the subjective data with the observed data to successively refine the input model and improve the accuracy of the simulation. More specifically, for an input model the term data refers to the input random-variable distributions and their parameters, and Bayes' Law is used to refine each distribution. In this context, Chung, Mohamed, and AbouRizk (2004) express Bayes' Law in the following more specialized form,
f''(θ) = kL(θ) f'(θ)
where f '(θ) is the prior probability density function (PDF) of the distribution (that is, the distribution that is perhaps only roughly estimated) and f ''(θ) is the posterior PDF (that is, the distribution that has been updated in light of the observed data). L(θ) is the likelihood of the observed data (equivalent to Pr(B|A) in the general form of Bayes' Law) and k is the normalizing constant (equivalent to 1/Pr(B) in the general form). θ represents the random variable.
Chung et al. (2004) applied this Bayesian approach to the simulation of an actual tunnel project (in the city of Edmonton, Alberta). They used Bayes' Law to update the distribution of a particular random input variable (the tunnel penetration rate). The results showed that incorporating updated data acquired even during the early stages of the actual project significantly improved the predictive accuracy of the simulation. However, they concluded that the earliest data that should be used for updating the input model is that acquired when the project is 9% completed and more than 50 samples have been gathered.
(Chung et al., 2004)
2.2 Bayesian Techniques for Creating Mixture Models
When a stream of input data originates from more than one source (that is, the random variable comes from a finite mixture distribution), we must create a mixture input model. An example of a stream of input data originating from multiple sources would be the processing times for a series of vehicles arriving at a toll gate, where the vehicles consist of a mixture of private cars, light vans, and heavy goods vehicles.
Fitting a mixture data model to the original input data is a difficult task (it is classified as a non-regular/non-standard statistical problem). It is necessary to determine how many components the input stream includes and to identify each of them. Also, a mixture model has such a high degree of flexibility that there is a danger of over-fitting the model to the input data (that is, fitting the model to minor fluctuations in the input data that are due more to random variability than to the underlying distribution).
Cheng and Currie (2003) propose a computer-intensive Bayesian approach for fitting a mixture model to the sampled multiple-source input data. As in the Bayesian technique discussed in the previous section, Bayes' Law is used to convert a less inaccurate prior distribution to a more accurate posterior distribution based on a sample of actual input data. In this case, however, Bayes' Law is used just once to convert an unfitted prior distribution to a fitted posterior distribution on the basis of data sampled from mixed input (as opposed to using Bayes' Law repeatedly to capture changing data while the simulated process takes place in the real world, as with Chung et al., 2004). In conjunction with Bayes' Law, Cheng and Currie use importance sampling, which is a variance reducing approach that samples more frequently those input values that have a greater impact on the results of the simulation.
(Cheng & Currie, 2003)
2.3 Bayesian Estimation of Input Uncertainties and the Resulting Output Variability
Zouaoui and Wilson (2001 1, 2001 2) use a Bayesian approach not simply to fit a distribution to input data, as in the methods discussed in the previous two sections, but rather to determine the input model uncertainties and to estimate the effects of these uncertainties on the variability of the simulation output.
To estimate input model uncertainties, they begin with the prior plausibility of each candidate input model and with prior probability distributions for each of the input model parameters. They then use Bayes' Law to combine these prior probabilities with sampled input data to generate posterior input model probabilities and posterior input model parameter distributions. These more accurate posterior probabilities yield valid information on the input model uncertainties.
Zouaoui and Wilson (2001 1, 2001 2) then use what they term the Bayesian Simulation Replication Algorithm to estimate the mean output value of a simulation (known as the posterior mean response) and to accurately assess its variability. This algorithm thus provides not only point (that is, mean value) estimates, but also confidence-interval estimates to help the modeler interpret the point estimates.
In estimating the simulation output variability, the algorithm takes into account all three types of input uncertainty, which are the sources of the output variability:
- Input model parameter uncertainty, which arises because the actual values of the parameters for the potential input models are unknown. (This uncertainty component is estimated by doing runs that use different input parameters, which are drawn from the posterior parameter distributions mentioned above.)
- Stochastic uncertainty, which arises from the dependence of the input values on random numbers. (This uncertainty component is estimated by performing multiple replications of simulations using the same parameters.)
- Input model uncertainty, which arises because the exact distribution types may be unknown. (This uncertainty component is estimated by combining the information from 1. and 2. with the posterior model probabilities that were mentioned above.)
To minimize variance in the results, the algorithm uses the maximum number of runs allowed by computing-budget and time constraints
(Zouaoui and Wilson, 2001 1; Zouaoui and Wilson, 2001 2)
3. Using Quantile Statistical Methods to Improve Distribution Fittings
Commonly used distribution fitting software often creates inconclusive results, generating several distributions that differ only slightly in the test statistic value. These software packages usually determine the distribution that has the best fit by using standard, objective, single-number goodness-of-fit tests such as Chi-Squared (for discrete models) or Kolmogorov-Smirnov (for continuous models). The problem, however, is that the numeric results used to rank various distributions may differ only slightly among the distributions, and the differences may be due more to random variability in the sample data than to significant differences in the fits of the models.
To make the most accurate choice of input model for a given sample of input data, Gupta and Parzen (2004) recommend a two part procedure: First, use distribution fitting software to obtain a preliminary list of candidate models. Then, use the visual, subjective, and yet more precise quantile statistical methods to make a final choice from among the models that have been assigned similar rankings by the fitting software.
There are several types of quantile analysis, some of which are quite sophisticated and complex. The basics, however, are fairly straightforward: If a random variable X has a cumulative distribution function F(x), then the q-quantile of F(x) is the value of X whose F(x) is equal to q. For example, if F(x_{1}) = .75 (that is, the probability that X <= x_{1} is .75), then the .75-quantile of F(x) = x_{1}.
If we want to visually test the goodness-of-fit of a particular fitted distribution to the sampled, empirical data, we can take each value of X in the empirical data and find its F(x). For example, assume that x=130 and F(x) = .45. The .45-quantile of the empirical data would therefore be 130. We now find the .45-quantile of the fitted distribution. For a perfect fit, it would also be 130. With a typical less-than-perfect fit, however, it will differ—for example, it might be 145. Next, we plot the two quantile values. In the example, this would be the point (130,145). Because the two quantiles differ, this point will not lie on the 45^{o} (y = x) line. We repeat the plot for multiple values of X in the empirical data. The greater the divergence of these points from the 45^{o} line, the worst the fit. This type of plot is known as a Q-Q plot, which is one of several different types of quantile-based plots.
An advantage of the Q-Q plot over simply comparing the cumulative distribution function (CDF) plots of the empirical data vs. the fitted distribution is that a Q-Q plot tends to magnify the divergence between the two CDF plots in the critical low and high ends of the range of X values. In these areas, the CDF plots tend to be more horizontal than vertical, and the horizontal distance between the two CDF plots tends to be much larger than the vertical distance. Because the Q-Q plot clearly shows the differences in x-values corresponding to a specific F(x) value, it magnifies differences between the two F(x) curves at the low and high ends.
Another common quantile-related approach is to take each X value in the empirical data and plot the F(x) for the empirical data vs. the F(x) for the fitted distribution. Again, the greater the divergence of the plotted points from the 45^{o} line, the worse the fit. This type of plot is known as a P-P plot, and it tends to magnify the divergence between the two CDF plots in the central area of the range of X values (the reasons are analogous to those given for the magnification of the divergence at the ends of the range by a Q-Q plot).
The modeler can choose an even more sophisticated quantile-related plot, such as a plot of the quantile/quartile (QIQ) function used by Gupta and Parzen (2004). This is a fairly complex function that is normalized to show skewness, tail-behavior, and other aspects of the distributions.
Overall, we can choose from a variety of quantile methods that focus on different aspects of the comparison between the empirical data distribution and a potential fitted distribution. Although using quantile analysis for the entire distribution selection process would be overly complex and time-consuming, the ability of these techniques to zoom in on specific aspects of the distribution comparison makes them ideal tools for making a final choice from among the highest ranked distributions suggested by a distribution-fitting software package, especially when the rankings are close.
(Gupta and Parzen 2004; Leemis, 2003)
4. A Probability Language for Solving Input Modeling Problems
Evans and Leemis (2000) present a Maple-based probability tool named APPL (A Probability Programming Language), which can be used to help find an appropriate standard or non-standard distribution to fit sample input data. Like the quantile statistical methods discussed in the previous section, APPL serves primarily to overcome the limitations of the input modeling software package that is used—in this case, to help the modeler find an input model for difficult-to-fit sample data when the modeling software does not include an appropriate model.
Unlike a typical input modeling package, APPL works in interactive rather than batch mode and calculates exact probabilities. APPL includes types for more than 45 different discrete and continuous random variables and has procedures for estimating parameters, plotting cumulative distribution functions (CDFs), carrying out goodness-of-fit tests, and performing other probability tasks.
Evans and Leemis (2000) describe examples of 18 specific probability problems that can be solved using APPL. As a simple example that demonstrates the basic workings of the language (taken from Evans and Leemis), suppose that X is a Uniform(0,1) random variable. We can calculate the probability that the sum of 8 independent identically-distributed random variable values from X will lie between 7/2 and 11/2, by using the following APPL code:
n := 8;
X := UniformRV(0, 1);
Y := ConvolutionIID(X, n);
CDF(Y, 11 / 2) - CDF(Y, 7 / 2);
The second line creates the random variable X. The third line creates another random variable Y for the sum of 8 independent identically-distributed random variable values drawn from X. And the final line uses the cumulative distribution function of Y to calculate the probability that a Y random variable will lie between 7/2 and 11/2. Evans and Leemis point out that the resulting answer (3580151 / 5160960) is exact, while the two standard methods for solving this problem—Monte Carlo simulation and the central limit theorem—would provide only approximations.
Probability tasks that APPL can perform that are useful for input modeling include the following:
- Performing statistical tests on the randomness of values generated by a random number generator
- Generating a plot of the coefficient of variation vs. skewness, for the sample data and for several candidate distributions, to help find a distribution that fits the data
- Calculating maximum likelihood estimators (MLEs) to select the best parameters for a particular distribution
- Using the method of moments to estimate the best parameter values for a particular distribution
Thus APPL, although it requires hand coding, is a highly flexible tool that provides exact solutions to distribution fitting problems, and can be useful for finding an appropriate distribution and set of distribution parameters for sample data that a more automated input modeling package fails to fit.
(Evans & Leemis, 2000)
5. Conclusion
How then can recent innovations in input modeling techniques be used to solve the input modeling problems that were listed in the Introduction?
The problem of inadequate input data can be alleviated by using repeated applications of Bayes' Law to gradually refine the input model, provided that additional empirical input data becomes available periodically. When a single stream of input data originates from more than one source, it might be possible to create an adequate mixture input model by using a computer-intensive Bayesian approach to fit the sampled input data. Bayesian probability can also be used to evaluate input model uncertainties and to estimate the effects of these uncertainties on the output data variability, helping the modeler realistically interpret the results of the simulation.
When the results of input modeling software are inconclusive, one or more quantile statistical methods can be used like a microscope to zoom in on various aspects of the model fit. These methods allow the modeler to make an informed final choice from among the input models suggested by the modeling software.
Finally, the APPL probability language provides a powerful tool for overcoming almost any lack in the input modeling software. It can be used to solve individual input modeling problems and to find an adequate input model when the input modeling software does not include one.
6. Reference List
Cheng, R.C.H. & Currie, C.S.M. (2003). Prior and candidate models in the Bayesian analysis of finite mixtures. In S. Chick, P.J. Sánchez, D. Ferrin, and D.J. Morrice (Eds.), Proceedings of the 2003 Winter Simulation Conference. (pp. 392-398). Winter Simulation Conference.
Chung, T.H., Mohamed, Y.; AbouRizk, S. (2004). Simulation input updating using bayesian techniques. In R.G. Ingalls, M.D. Rossetti, J.S. Smith, and B.A. Peters (Eds.), Proceedings of the 2004 Winter Simulation Conference. (pp. 180-185). Winter Simulation Conference.
Evans, D.L. & Leemis, L. (2000). Input modeling using a computer algebra system. In J. A. Joines, R. R. Barton, K. Kang, and P. A. Fishwick (Eds.), Proceedings of the 2000 Winter Simulation Conference. (pp. 577-586). San Diego: Society for Computer Simulation International.
Gupta, A. & Parzen, E. (2004). Input modeling using quantile statistical methods. In R.G. Ingalls, M.D. Rossetti, J.S. Smith, and B.A. Peters (Eds.), Proceedings of the 2004 Winter Simulation Conference. (pp. 716-724). Winter Simulation Conference.
Henderson, S.G. (2003). Input model uncertainty: why do we care and what should we do about it? In S. Chick, P.J. Sánchez, D. Ferrin, and D.J. Morrice (Eds.), Proceedings of the 2003 Winter Simulation Conference. (pp. 90-100). Winter Simulation Conference.
Leemis, L. (2003). Input modeling. In S. Chick, P.J. Sánchez, D. Ferrin, and D.J. Morrice (Eds.), Proceedings of the 2003 Winter Simulation Conference. (pp. 14-24). Winter Simulation Conference.
Zouaoui, F. & Wilson, J.R. (2001 1). Accounting for parameter certainty in simulation input modeling. In B.A. Peters, J.S. Smith, D.J. Medeiros, and M.W. Rohrer (Eds.), Proceedings of the 2001 Winter Simulation Conference. (pp. 354-363). Washington, DC: IEEE Computer Society.
Zouaoui, F. & Wilson, J.R. (2001 2). Accounting for input model and parameter uncertainty in simulation. In B.A. Peters, J.S. Smith, D.J. Medeiros, and M.W. Rohrer (Eds.), Proceedings of the 2001 Winter Simulation Conference. (pp. 290-299). Washington, DC: IEEE Computer Society.