Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas

Handbook Economic Forecasting Volume 1, Notas de estudo de Estatística

séries temporais, estatística

Tipologia: Notas de estudo

2014

Compartilhado em 08/01/2014

carlos-chang-1
carlos-chang-1 🇧🇷

3.8

(6)

64 documentos

Pré-visualização parcial do texto

Baixe Handbook Economic Forecasting Volume 1 e outras Notas de estudo em PDF para Estatística, somente na Docsity! rp CELA G PA Gi deu (ol (o FORECASTING AN E licies Ce Ra Clive W.J. Granger PCR a du NORTH-HOLLAND HANDBOOK OF ECONOMIC FORECASTING VOLUME 1 North-Holland is an imprint of Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK First edition 2006 Copyright © 2006 Elsevier B.V. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: permissions@elsevier.com. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN-13: 978-0-444-51395-3 ISBN-10: 0-444-51395-7 ISSN: 0169-7218 (Handbooks in Economics series) ISSN: 1574-0706 (Handbook of Economic Forecasting series) For information on all North-Holland publications visit our website at books.elsevier.com Printed and bound in The Netherlands 06 07 08 09 10 10 9 8 7 6 5 4 3 2 1 INTRODUCTION TO THE SERIES The aim of the Handbooks in Economics series is to produce Handbooks for various branches of economics, each of which is a definitive source, reference, and teaching supplement for use by professional researchers and advanced graduate students. Each Handbook provides self-contained surveys of the current state of a branch of economics in the form of chapters prepared by leading specialists on various aspects of this branch of economics. These surveys summarize not only received results but also newer devel- opments, from recent journal articles and discussion papers. Some original material is also included, but the main goal is to provide comprehensive and accessible surveys. The Handbooks are intended to provide not only useful reference volumes for profes- sional collections but also possible supplementary readings for advanced courses for graduate students in economics. KENNETH J. ARROW and MICHAEL D. INTRILIGATOR PUBLISHER’S NOTE For a complete overview of the Handbooks in Economics Series, please refer to the listing at the end of this volume. v This page intentionally left blank Contents of the Handbook ix Author Index I-1 Subject Index I-19 This page intentionally left blank CONTENTS OF VOLUME 1 Introduction to the Series v Contents of the Handbook vii PART 1: FORECASTING METHODOLOGY Chapter 1 Bayesian Forecasting JOHN GEWEKE AND CHARLES WHITEMAN 3 Abstract 4 Keywords 4 1. Introduction 6 2. Bayesian inference and forecasting: A primer 7 2.1. Models for observables 7 2.2. Model completion with prior distributions 10 2.3. Model combination and evaluation 14 2.4. Forecasting 19 3. Posterior simulation methods 25 3.1. Simulation methods before 1990 25 3.2. Markov chain Monte Carlo 30 3.3. The full Monte 36 4. ’Twas not always so easy: A historical perspective 41 4.1. In the beginning, there was diffuseness, conjugacy, and analytic work 41 4.2. The dynamic linear model 43 4.3. The Minnesota revolution 44 4.4. After Minnesota: Subsequent developments 49 5. Some Bayesian forecasting models 53 5.1. Autoregressive leading indicator models 54 5.2. Stationary linear models 56 5.3. Fractional integration 59 5.4. Cointegration and error correction 61 5.5. Stochastic volatility 64 6. Practical experience with Bayesian forecasts 68 6.1. National BVAR forecasts: The Federal Reserve Bank of Minneapolis 69 6.2. Regional BVAR forecasts: Economic conditions in Iowa 70 References 73 xi xiv Contents of Volume 1 2. Specification testing and model evaluation in-sample 207 2.1. Diebold, Gunther and Tay approach – probability integral transform 208 2.2. Bai approach – martingalization 208 2.3. Hong and Li approach – a nonparametric test 210 2.4. Corradi and Swanson approach 212 2.5. Bootstrap critical values for the V1T and V2T tests 216 2.6. Other related work 220 3. Specification testing and model selection out-of-sample 220 3.1. Estimation and parameter estimation error in recursive and rolling estimation schemes – West as well as West and McCracken results 221 3.2. Out-of-sample implementation of Bai as well as Hong and Li tests 223 3.3. Out-of-sample implementation of Corradi and Swanson tests 225 3.4. Bootstrap critical for the V1P,J and V2P,J tests under recursive estimation 228 3.5. Bootstrap critical for the V1P,J and V2P,J tests under rolling estimation 233 Part III: Evaluation of (Multiple) Misspecified Predictive Models 234 4. Pointwise comparison of (multiple) misspecified predictive models 234 4.1. Comparison of two nonnested models: Diebold and Mariano test 235 4.2. Comparison of two nested models 238 4.3. Comparison of multiple models: The reality check 242 4.4. A predictive accuracy test that is consistent against generic alternatives 249 5. Comparison of (multiple) misspecified predictive density models 253 5.1. The Kullback–Leibler information criterion approach 253 5.2. A predictive density accuracy test for comparing multiple misspecified models 254 Acknowledgements 271 Part IV: Appendices and References 271 Appendix A: Assumptions 271 Appendix B: Proofs 275 References 280 PART 2: FORECASTING MODELS Chapter 6 Forecasting with VARMA Models HELMUT LÜTKEPOHL 287 Abstract 288 Keywords 288 1. Introduction and overview 289 1.1. Historical notes 290 1.2. Notation, terminology, abbreviations 291 2. VARMA processes 292 2.1. Stationary processes 292 2.2. Cointegrated I (1) processes 294 2.3. Linear transformations of VARMA processes 294 Contents of Volume 1 xv 2.4. Forecasting 296 2.5. Extensions 305 3. Specifying and estimating VARMA models 306 3.1. The echelon form 306 3.2. Estimation of VARMA models for given lag orders and cointegrating rank 311 3.3. Testing for the cointegrating rank 313 3.4. Specifying the lag orders and Kronecker indices 314 3.5. Diagnostic checking 316 4. Forecasting with estimated processes 316 4.1. General results 316 4.2. Aggregated processes 318 5. Conclusions 319 Acknowledgements 321 References 321 Chapter 7 Forecasting with Unobserved Components Time Series Models ANDREW HARVEY 327 Abstract 330 Keywords 330 1. Introduction 331 1.1. Historical background 331 1.2. Forecasting performance 333 1.3. State space and beyond 334 2. Structural time series models 335 2.1. Exponential smoothing 336 2.2. Local level model 337 2.3. Trends 339 2.4. Nowcasting 340 2.5. Surveys and measurement error 343 2.6. Cycles 343 2.7. Forecasting components 344 2.8. Convergence models 347 3. ARIMA and autoregressive models 348 3.1. ARIMA models and the reduced form 348 3.2. Autoregressive models 350 3.3. Model selection in ARIMA, autoregressive and structural time series models 350 3.4. Correlated components 351 4. Explanatory variables and interventions 352 4.1. Interventions 354 4.2. Time-varying parameters 355 5. Seasonality 355 5.1. Trigonometric seasonal 356 xvi Contents of Volume 1 5.2. Reduced form 357 5.3. Nowcasting 358 5.4. Holt–Winters 358 5.5. Seasonal ARIMA models 358 5.6. Extensions 360 6. State space form 361 6.1. Kalman filter 361 6.2. Prediction 363 6.3. Innovations 364 6.4. Time-invariant models 364 6.5. Maximum likelihood estimation and the prediction error decomposition 368 6.6. Missing observations, temporal aggregation and mixed frequency 369 6.7. Bayesian methods 369 7. Multivariate models 370 7.1. Seemingly unrelated times series equation models 370 7.2. Reduced form and multivariate ARIMA models 371 7.3. Dynamic common factors 372 7.4. Convergence 376 7.5. Forecasting and nowcasting with auxiliary series 379 8. Continuous time 383 8.1. Transition equations 383 8.2. Stock variables 385 8.3. Flow variables 387 9. Nonlinear and non-Gaussian models 391 9.1. General state space model 392 9.2. Conditionally Gaussian models 394 9.3. Count data and qualitative observations 394 9.4. Heavy-tailed distributions and robustness 399 9.5. Switching regimes 401 10. Stochastic volatility 403 10.1. Basic specification and properties 404 10.2. Estimation 405 10.3. Comparison with GARCH 405 10.4. Multivariate models 406 11. Conclusions 406 Acknowledgements 407 References 408 Chapter 8 Forecasting Economic Variables with Nonlinear Models TIMO TERÄSVIRTA 413 Abstract 414 Keywords 415 Contents of Volume 1 xix 5. Bayesian model averaging 535 5.1. Fundamentals of Bayesian model averaging 536 5.2. Survey of the empirical literature 541 6. Empirical Bayes methods 542 6.1. Empirical Bayes methods for large-n linear forecasting 543 7. Empirical illustration 545 7.1. Forecasting methods 545 7.2. Data and comparison methodology 547 7.3. Empirical results 547 8. Discussion 549 References 550 Chapter 11 Forecasting with Trending Data GRAHAM ELLIOTT 555 Abstract 556 Keywords 556 1. Introduction 557 2. Model specification and estimation 559 3. Univariate models 563 3.1. Short horizons 565 3.2. Long run forecasts 575 4. Cointegration and short run forecasts 581 5. Near cointegrating models 586 6. Predicting noisy variables with trending regressors 591 7. Forecast evaluation with unit or near unit roots 596 7.1. Evaluating and comparing expected losses 596 7.2. Orthogonality and unbiasedness regressions 598 7.3. Cointegration of forecasts and outcomes 599 8. Conclusion 600 References 601 Chapter 12 Forecasting with Breaks MICHAEL P. CLEMENTS AND DAVID F. HENDRY 605 Abstract 606 Keywords 606 1. Introduction 607 2. Forecast-error taxonomies 609 2.1. General (model-free) forecast-error taxonomy 609 2.2. VAR model forecast-error taxonomy 613 3. Breaks in variance 614 3.1. Conditional variance processes 614 xx Contents of Volume 1 3.2. GARCH model forecast-error taxonomy 616 4. Forecasting when there are breaks 617 4.1. Cointegrated vector autoregressions 617 4.2. VECM forecast errors 618 4.3. DVAR forecast errors 620 4.4. Forecast biases under location shifts 620 4.5. Forecast biases when there are changes in the autoregressive parameters 621 4.6. Univariate models 622 5. Detection of breaks 622 5.1. Tests for structural change 622 5.2. Testing for level shifts in ARMA models 625 6. Model estimation and specification 627 6.1. Determination of estimation sample for a fixed specification 627 6.2. Updating 630 7. Ad hoc forecasting devices 631 7.1. Exponential smoothing 631 7.2. Intercept corrections 633 7.3. Differencing 634 7.4. Pooling 635 8. Non-linear models 635 8.1. Testing for non-linearity and structural change 636 8.2. Non-linear model forecasts 637 8.3. Empirical evidence 639 9. Forecasting UK unemployment after three crises 640 9.1. Forecasting 1992–2001 643 9.2. Forecasting 1919–1938 645 9.3. Forecasting 1948–1967 645 9.4. Forecasting 1975–1994 647 9.5. Overview 647 10. Concluding remarks 648 Appendix A: Taxonomy derivations for Equation (10) 648 Appendix B: Derivations for Section 4.3 650 References 651 Chapter 13 Forecasting Seasonal Time Series ERIC GHYSELS, DENISE R. OSBORN AND PAULO M.M. RODRIGUES 659 Abstract 660 Keywords 661 1. Introduction 662 2. Linear models 664 2.1. SARIMA model 664 2.2. Seasonally integrated model 666 Contents of Volume 1 xxi 2.3. Deterministic seasonality model 669 2.4. Forecasting with misspecified seasonal models 672 2.5. Seasonal cointegration 677 2.6. Merging short- and long-run forecasts 681 3. Periodic models 683 3.1. Overview of PAR models 683 3.2. Modelling procedure 685 3.3. Forecasting with univariate PAR models 686 3.4. Forecasting with misspecified models 688 3.5. Periodic cointegration 688 3.6. Empirical forecast comparisons 690 4. Other specifications 691 4.1. Nonlinear models 691 4.2. Seasonality in variance 696 5. Forecasting, seasonal adjustment and feedback 701 5.1. Seasonal adjustment and forecasting 702 5.2. Forecasting and seasonal adjustment 703 5.3. Seasonal adjustment and feedback 704 6. Conclusion 705 References 706 PART 4: APPLICATIONS OF FORECASTING METHODS Chapter 14 Survey Expectations M. HASHEM PESARAN AND MARTIN WEALE 715 Abstract 716 Keywords 716 1. Introduction 717 2. Concepts and models of expectations formation 720 2.1. The rational expectations hypothesis 721 2.2. Extrapolative models of expectations formation 724 2.3. Testable implications of expectations formation models 727 2.4. Testing the optimality of survey forecasts under asymmetric losses 730 3. Measurement of expectations: History and developments 733 3.1. Quantification and analysis of qualitative survey data 739 3.2. Measurement of expectations uncertainty 744 3.3. Analysis of individual responses 745 4. Uses of survey data in forecasting 748 4.1. Forecast combination 749 4.2. Indicating uncertainty 749 4.3. Aggregated data from qualitative surveys 751 5. Uses of survey data in testing theories: Evidence on rationality of expectations 754 xxiv Contents of Volume 1 9.2. Examples 937 10. Review of the recent literature on the performance of leading indicators 945 10.1. The performance of the new models with real time data 946 10.2. Financial variables as leading indicators 947 10.3. The 1990–1991 and 2001 US recessions 949 11. What have we learned? 951 References 952 Chapter 17 Forecasting with Real-Time Macroeconomic Data DEAN CROUSHORE 961 Abstract 962 Keywords 962 1. An illustrative example: The index of leading indicators 963 2. The real-time data set for macroeconomists 964 How big are data revisions? 967 3. Why are forecasts affected by data revisions? 969 Experiment 1: Repeated observation forecasting 971 Experiment 2: Forecasting with real-time versus latest-available data samples 972 Experiment 3: Information criteria and forecasts 974 4. The literature on how data revisions affect forecasts 974 How forecasts differ when using first-available data compared with latest-available data 974 Levels versus growth rates 976 Model selection and specification 977 Evidence on the predictive content of variables 978 5. Optimal forecasting when data are subject to revision 978 6. Summary and suggestions for further research 980 References 981 Chapter 18 Forecasting in Marketing PHILIP HANS FRANSES 983 Abstract 984 Keywords 984 1. Introduction 985 2. Performance measures 986 2.1. What do typical marketing data sets look like? 986 2.2. What does one want to forecast? 991 3. Models typical to marketing 992 3.1. Dynamic effects of advertising 993 3.2. The attraction model for market shares 997 3.3. The Bass model for adoptions of new products 999 3.4. Multi-level models for panels of time series 1001 Contents of Volume 1 xxv 4. Deriving forecasts 1003 4.1. Attraction model forecasts 1004 4.2. Forecasting market shares from models for sales 1005 4.3. Bass model forecasts 1006 4.4. Forecasting duration data 1008 5. Conclusion 1009 References 1010 Author Index I-1 Subject Index I-19 This page intentionally left blank Chapter 1 BAYESIAN FORECASTING JOHN GEWEKE and CHARLES WHITEMAN Department of Economics, University of Iowa, Iowa City, IA 52242-1000 Contents Abstract 4 Keywords 4 1. Introduction 6 2. Bayesian inference and forecasting: A primer 7 2.1. Models for observables 7 2.1.1. An example: Vector autoregressions 8 2.1.2. An example: Stochastic volatility 9 2.1.3. The forecasting vector of interest 9 2.2. Model completion with prior distributions 10 2.2.1. The role of the prior 10 2.2.2. Prior predictive distributions 11 2.2.3. Hierarchical priors and shrinkage 12 2.2.4. Latent variables 13 2.3. Model combination and evaluation 14 2.3.1. Models and probability 15 2.3.2. A model is as good as its predictions 15 2.3.3. Posterior predictive distributions 17 2.4. Forecasting 19 2.4.1. Loss functions and the subjective decision maker 20 2.4.2. Probability forecasting and remote clients 22 2.4.3. Forecasts from a combination of models 23 2.4.4. Conditional forecasting 24 3. Posterior simulation methods 25 3.1. Simulation methods before 1990 25 3.1.1. Direct sampling 26 3.1.2. Acceptance sampling 27 3.1.3. Importance sampling 29 3.2. Markov chain Monte Carlo 30 3.2.1. The Gibbs sampler 31 Handbook of Economic Forecasting, Volume 1 Edited by Graham Elliott, Clive W.J. Granger and Allan Timmermann © 2006 Elsevier B.V. All rights reserved DOI: 10.1016/S1574-0706(05)01001-3 4 J. Geweke and C. Whiteman 3.2.2. The Metropolis–Hastings algorithm 33 3.2.3. Metropolis within Gibbs 34 3.3. The full Monte 36 3.3.1. Predictive distributions and point forecasts 37 3.3.2. Model combination and the revision of assumptions 39 4. ’Twas not always so easy: A historical perspective 41 4.1. In the beginning, there was diffuseness, conjugacy, and analytic work 41 4.2. The dynamic linear model 43 4.3. The Minnesota revolution 44 4.4. After Minnesota: Subsequent developments 49 5. Some Bayesian forecasting models 53 5.1. Autoregressive leading indicator models 54 5.2. Stationary linear models 56 5.2.1. The stationary AR(p) model 56 5.2.2. The stationary ARMA(p, q) model 57 5.3. Fractional integration 59 5.4. Cointegration and error correction 61 5.5. Stochastic volatility 64 6. Practical experience with Bayesian forecasts 68 6.1. National BVAR forecasts: The Federal Reserve Bank of Minneapolis 69 6.2. Regional BVAR forecasts: economic conditions in Iowa 70 References 73 Abstract Bayesian forecasting is a natural product of a Bayesian approach to inference. The Bayesian approach in general requires explicit formulation of a model, and condition- ing on known quantities, in order to draw inferences about unknown ones. In Bayesian forecasting, one simply takes a subset of the unknown quantities to be future values of some variables of interest. This chapter presents the principles of Bayesian forecasting, and describes recent advances in computational capabilities for applying them that have dramatically expanded the scope of applicability of the Bayesian approach. It describes historical developments and the analytic compromises that were necessary prior to re- cent developments, the application of the new procedures in a variety of examples, and reports on two long-term Bayesian forecasting exercises. Keywords Markov chain Monte Carlo, predictive distribution, probability forecasting, simulation, vector autoregression Ch. 1: Bayesian Forecasting 5 JEL classification: C530, C110, C150 8 J. Geweke and C. Whiteman time units t = 1, 2, . . . . The history of the sequence at time t is given by Yt = {ys}ts=1. The sample space for yt is ψt , that for Yt is t , and ψ0 = 0 = {∅}. A model, A, specifies a corresponding sequence of probability density functions (1)p(yt | Yt−1, θA,A) in which θA is a kA × 1 vector of unobservables, and θA ∈ A ⊆ Rk . The vector θA includes not only parameters as usually conceived, but also latent variables conve- nient in model formulation. This extension immediately accommodates non-standard distributions, time varying parameters, and heterogeneity across observations; Albert and Chib (1993), Carter and Kohn (1994), Fruhwirth-Schnatter (1994) and DeJong and Shephard (1995) provide examples of this flexibility in the context of Bayesian time series modeling. The notation p(·) indicates a generic probability density function (p.d.f.) with re- spect to Lebesgue measure, and P(·) the corresponding cumulative distribution function (c.d.f.). We use continuous distributions to simplify the notation; extension to discrete and mixed continuous–discrete distributions is straightforward using a generic mea- sure ν. The probability density function (p.d.f.) for YT , conditional on the model and unobservables vector θA, is (2)p(YT | θA,A) = T∏ t=1 p(yt | Yt−1, θA,A). When used alone, expressions like yt and YT denote random vectors. In Equations (1) and (2) yt and YT are arguments of functions. These uses are distinct from the observed values themselves. To preserve this distinction explicitly, denote observed yt by yot and observed YT by YoT . In general, the superscript o will denote the observed value of a random vector. For example, the likelihood function is L(θA; YoT , A) ∝ p(YoT | θA,A). 2.1.1. An example: Vector autoregressions Following Sims (1980) and Litterman (1979) (which are discussed below), vector au- toregressive models have been utilized extensively in forecasting macroeconomic and other time series owing to the ease with which they can be used for this purpose and their apparent great success in implementation. Adapting the notation of Litterman (1979), the VAR specification for p(yt | Yt−1, θA,A) is given by (3)yt = BDDt + B1yt−1 + B2yt−2 + · · · + Bmyt−m + εt where A now signifies the autoregressive structure, Dt is a deterministic component of dimension d , and εt iid ∼ N(0,). In this case, θA = (BD,B1, . . . ,Bm,). Ch. 1: Bayesian Forecasting 9 2.1.2. An example: Stochastic volatility Models with time-varying volatility have long been standard tools in portfolio allocation problems. Jacquier, Polson and Rossi (1994) developed the first fully Bayesian approach to such a model. They utilized a time series of latent volatilities h = (h1, . . . , hT )′: (4)h1 | ( σ 2η , φ,A ) ∼ N [ 0, σ 2η /( 1 − φ2)], (5)ht = φht−1 + σηηt (t = 2, . . . , T ). An observable sequence of asset returns y = (y1, . . . , yT )′ is then conditionally inde- pendent, (6)yt = β exp(ht/2)εt ; (εt , ηt ) ′ | A iid∼ N(0, I2). The (T + 3) × 1 vector of unobservables is (7)θA = ( β, σ 2η , φ, h1, . . . , hT )′ . It is conventional to speak of (β, σ 2η , φ) as a parameter vector and h as a vector of latent variables, but in Bayesian inference this distinction is a matter only of language, not substance. The unobservables h can be any real numbers, whereas β > 0, ση > 0, and φ ∈ (−1, 1). If φ > 0 then the observable sequence {y2t } exhibits the positive serial correlation characteristic of many sequences of asset returns. 2.1.3. The forecasting vector of interest Models are means, not ends. A useful link between models and the purposes for which they are formulated is a vector of interest, which we denote ω ∈ ⊆ Rq . The vector of interest may be unobservable, for example the monetary equivalent of a change in welfare, or the change in an equilibrium price vector, following a hypothetical policy change. In order to be relevant, the model must not only specify (1), but also (8)p(ω | YT , θA,A). In a forecasting problem, by definition, {y′T+1, . . . , y′T+F } ∈ ω′ for some F > 0. In some cases ω′ = (y′T+1, . . . , y′T+F ) and it is possible to express p(ω | YT , θA) ∝ p(YT+F | θA,A) in closed form, but in general this is not so. Suppose, for example, that a stochastic volatility model of the form (5)–(6) is a means to the solution of a financial decision making problem with a 20-day horizon so that ω = (yT+1, . . . , yT+20)′. Then there is no analytical expression for p(ω | YT , θA,A) with θA defined as it is in (7). If ω is extended to include (hT+1, . . . , hT+20)′ as well as (yT+1, . . . , yT+20)′, then the expression is simple. Continuing with an analytical approach then confronts the original problem of integrating over (hT+1, . . . , hT+20)′ to obtain p(ω | YT , θA,A). But it also highlights the fact that it is easy to simulate from this extended definition of ω in a way that is, today, obvious: ht | ( ht−1, σ 2η , φ,A ) ∼ N ( φht−1, σ 2η ) , 10 J. Geweke and C. Whiteman yt | (ht , β,A) ∼ N [ 0, β2 exp(ht ) ] (t = T + 1, . . . , T + 20). Since this produces a simulation from the joint distribution of (hT+1, . . . , hT+20)′ and (yT+1, . . . , yT+20)′, the “marginalization” problem simply amounts to discarding the simulated (hT+1, . . . , hT+20)′. A quarter-century ago, this idea was far from obvious. Wecker (1979), in a paper on predicting turning points in macroeconomic time series, appears to have been the first to have used simulation to access the distribution of a problematic vector of inter- est ω or functions of ω. His contribution was the first illustration of several principles that have emerged since and will appear repeatedly in this survey. One is that while producing marginal from joint distributions analytically is demanding and often impos- sible, in simulation it simply amounts to discarding what is irrelevant. (In Wecker’s case the future yT+s were irrelevant in the vector that also included indicator variables for turning points.) A second is that formal decision problems of many kinds, from point forecasts to portfolio allocations to the assessment of event probabilities can be solved using simulations of ω. Yet another insight is that it may be much simpler to introduce intermediate conditional distributions, thereby enlarging θA, ω, or both, retaining from the simulation only that which is relevant to the problem at hand. The latter idea was fully developed in the contribution of Tanner and Wong (1987). 2.2. Model completion with prior distributions The generic model for observables (2) is expressed conditional on a vector of unob- servables, θA, that includes unknown parameters. The same is true of the model for the vector of interest ω in (8), and this remains true whether one simulates from this dis- tribution or provides a full analytical treatment. Any workable solution of a forecasting problem must, in one way or another, address the fact that θA is unobserved. A similar issue arises if there are alternative models A – different functional forms in (2) and (8) – and we return to this matter in Section 2.3. 2.2.1. The role of the prior The Bayesian strategy is dictated by the first principle, which demands that we work with p(ω | YT , A). Given that p(YT | θA,A) has been specified in (2) and p(ω | YT , θA) in (8), we meet the requirements of the first principle by specifying (9)p(θA | A), because then p(ω | YT , A) ∝ ∫ A p(θA | A)p(YT | θA,A)p(ω | YT , θA,A) dθA. The density p(θA | A) defines the prior distribution of the unobservables. For many practical purposes it proves useful to work with an intermediate distribution, the poste- Ch. 1: Bayesian Forecasting 13 Equivalently, this idea could be expressed by introducing the hyperparameter θ∗, then taking (12)θ∗ | A ∼ N(0, ρh−1) followed by (13)θi | ( θ∗, A ) ∼ N [ θ∗, (1 − ρ)h−1], (14)yt | (θ1, . . . , θr , A) ∼ p(yt | θ1, . . . , θr ) (t = 1, . . . , T ). This idea could then easily be merged with the strategy for handling the Student-t dis- tribution, allowing some outliers among θi (a Student-t distribution conditional on θ∗), thicker tails in the distribution of θ∗, or both. The application of hierarchical priors in (12)–(13) is an example of shrinkage. The concept is familiar in non-Bayesian treatments as well (for example, ridge regression) where its formal motivation originated with James and Stein (1961). In the Bayesian setting shrinkage is toward a common unknown mean θ∗, for which a posterior distrib- ution will be determined by the data, given the prior. This idea has proven to be vital in forecasting problems in which there are many parameters. Section 4 reviews its application in vector autoregressions and its critical role in turning mediocre into superior forecasts in that model. Zellner and Hong (1989) used this strategy in forecasting growth rates of output for 18 different countries, and it proved to minimize mean square forecast error among eight competing treatments of the same model. More recently Tobias (2001) applied the same strategy in developing predictive intervals in the same model. Zellner and Chen (2001) approached the problem of forecasting US real GDP growth by disaggregating across sectors and employing a prior that shrinks sector parameters toward a common but unknown mean, with a payoff similar to that in Zellner and Hong (1989). In forecasting long-run returns to over 1,000 initial public offerings Brav (2000) found a prior with shrinkage toward an unknown mean essential in producing superior results. 2.2.4. Latent variables Latent variables, like the volatilities ht in the stochastic volatility model of Section 2.1.2, are common in econometric modelling. Their treatment in Bayesian inference is no dif- ferent from the treatment of other unobservables, like parameters. In fact latent variables are, formally, no different from hyperparameters. For the stochastic volatility model Equations (4)–(5) provide the distribution of the latent variables (hyperparameters) con- ditional on the parameters, just as (12) provides the hyperparameter distribution in the illustration of shrinkage. Conditional on the latent variables {ht }, (6) indicates the ob- servables distribution, just as (14) indicates the distribution of observables conditional on the parameters. In the formal generalization of this idea the complete model provides a conventional prior distribution p(θA | A), and then the distribution of a vector of latent variables z 14 J. Geweke and C. Whiteman conditional on θA, p(z | θA,A). The observables distribution typically involves both z and θA: p(YT | z, θA,A). Clearly one could also have a hierarchical prior distribution for θA in this context as well. Latent variables are convenient, but not essential, devices for describing the dis- tribution of observables, just as hyperparameters are convenient but not essential in constructing prior distributions. The convenience stems from the fact that the likeli- hood function is otherwise awkward to express, as the reader can readily verify for the stochastic volatility model. In these situations Bayesian inference then has to con- front the problem that it is impractical, if not impossible, to evaluate the likelihood function or even to provide an adequate numerical approximation. Tanner and Wong (1987) provided a systematic method for avoiding analytical integration in evaluating the likelihood function, through a simulation method they described as data augmenta- tion. Section 5.2.2 provides an example. This ability to use latent variables in a routine and practical way in conjunction with Bayesian inference has spawned a generation of Bayesian time series models useful in prediction. These include state space mixture models [see Carter and Kohn (1994, 1996) and Gerlach, Carter and Kohn (2000)], discrete state models [see Albert and Chib (1993) and Chib (1996)], component models [see West (1995) and Huerta and West (1999)] and factor models [see Geweke and Zhou (1996) and Aguilar and West (2000)]. The last paper provides a full application to the applied forecasting problem of foreign exchange portfolio allocation. 2.3. Model combination and evaluation In applied forecasting and decision problems one typically has under consideration not a single model A, but several alternative models A1, . . . , AJ . Each model is comprised of a conditional observables density (1), a conditional density of a vector of interest ω (8) and a prior density (9). For a finite number of models, each fully articulated in this way, treatment is dictated by the principle of explicit formulation: extend the formal probability treatment to include all J models. This extension requires only attaching prior probabilities p(Aj ) to the models, and then conducting inference and addressing decision problems conditional on the universal model specification{ p(Aj ), p(θAj | Aj), p(YT | θAj , Aj ), p(ω | YT , θAj , Aj ) } (15)(j = 1, . . . , J ). The J models are related by their prior predictions for a common set of observables YT and a common vector of interest ω. The models may be quite similar: some, or all, of them might have the same vector of unobservables θA and the same functional form for p(YT | θA,A), and differ only in their specification of the prior density p(θA | Aj). At the other extreme some of the models in the universe might be simple or have a few unobservables, while others could be very complex with the number of unobservables, which include any latent variables, substantially exceeding the number of observables. There is no nesting requirement. Ch. 1: Bayesian Forecasting 15 2.3.1. Models and probability The penultimate objective in Bayesian forecasting is the distribution of the vector of interest ω, conditional on the data YoT and the universal model specification A ={A1, . . . , AJ }. Given (15) the formal solution is (16)p ( ω | YoT , A ) = J∑ j=1 p ( ω | YoT , Aj ) p ( Aj | YoT ) , known as model averaging. In expression (16), (17)p ( Aj | YoT , A ) = p(YoT | Aj )p(Aj | A)/p(YoT | A) (18)∝ p ( YoT | Aj ) p(Aj | A). Expression (17) is the posterior probability of model Aj . Since these probabilities sum to 1, the values in (18) are sufficient. Of the two components in (18) the second is the prior probability of model Aj . The first is the marginal likelihood (19)p ( YoT | Aj ) = ∫ Aj p ( YoT | θAj , Aj ) p(θAj | Aj) dθAj . Comparing (19) with (10), note that (19) is simply the prior predictive density, evaluated at the realized outcome YoT – the data. The ratio of posterior probabilities of the models Aj and Ak is (20) P(Aj | YoT ) P (Ak | YoT ) = P(Aj ) P (Ak) · p(Y o T | Aj) p(YoT | Ak) , known as the posterior odds ratio in favor of model Aj versus model Ak . It is the prod- uct of the prior odds ratio P(Aj | A)/P (Ak | A), and the ratio of marginal likelihoods p(YoT | Aj)/p(YoT | Ak), known as the Bayes factor. The Bayes factor, which may be interpreted as updating the prior odds ratio to the posterior odds ratio, is independent of the other models in the universe A = {A1, . . . , AJ }. This quantity is central in sum- marizing the evidence in favor of one model, or theory, as opposed to another one, an idea due to Jeffreys (1939). The significance of this fact in the statistics literature was recognized by Roberts (1965), and in econometrics by Leamer (1978). The Bayes factor is now a practical tool in applied statistics; see the reviews of Draper (1995), Chatfield (1995), Kass and Raftery (1996) and Hoeting et al. (1999). 2.3.2. A model is as good as its predictions It is through the marginal likelihoods p(YoT | Aj) (j = 1, . . . , J ) that the observed outcome (data) determines the relative contribution of competing models to the poste- rior distribution of the vector of interest ω. There is a close and formal link between a model’s marginal likelihood and the adequacy of its out-of-sample predictions. To es- tablish this link consider the specific case of a forecasting horizon of F periods, with 18 J. Geweke and C. Whiteman repeated experiment. Contrasts between ỸT and YoT are the basis of assessing the exter- nal validity of the model, or set of models, upon which inference has been conditioned. If one is able to simulate unobservables θ (m)A from the posterior distribution (more on this in Section 3) then the simulation Ỹ(m)T follows just as the simulation of Y (m) T in (11). The process can be made formal by identifying one or more subsets S of the range T of YT . For any such subset P(ỸT ∈ S | YoT , A) can be evaluated using the simulation approximation M−1 ∑M m=1 IS(Ỹ (m) T ). If P(ỸT ∈ S | YoT , A) = 1 − α, α being a small positive number, and YoT /∈ S, there is evidence of external inconsistency of the model with the data. This idea goes back to the notion of “surprise” discussed by Good (1956): we have observed an event that is very unlikely to occur again, were the time series “experiment” to be repeated, independently, many times. The essentials of this idea were set out by Rubin (1984) in what he termed “model monitoring by posterior predictive checks”. As Rubin emphasized, there is no formal method for choosing the set S (see, however, Section 2.4.1 below). If S is defined with reference to a scalar function g as {ỸT : g1  g(ỸT )  g2} then it is a short step to reporting a “p-value” for g(YoT ). This idea builds on that of the probability integral transform introduced by Rosenblatt (1952), stressed by Dawid (1984) in prequential forecasting, and formalized by Meng (1994); see also the comprehensive survey of Gelman et al. (1995). The purpose of posterior predictive exercises of this kind is not to conduct hypothesis tests that lead to rejection or non-rejection of models; rather, it is to provide a diagnostic that may spur creative thinking about new models that might be created and brought into the universe of models A = {A1, . . . , AJ }. This is the idea originally set forth by Box (1980). Not all practitioners agree: see the discussants in the symposia in Box (1980) and Bayarri and Berger (1998), as well as the articles by Edwards, Lindman and Savage (1963) and Berger and Delampady (1987). The creative process dictates the choice of S, or of g(ỸT ), which can be quite flexible, and can be selected with an eye to the ultimate application of the model, a subject to which we return in the next section. In general the function g(ỸT ) could be a pivotal test statistic (e.g., the difference between the first order statistic and the sample mean, divided by the sample standard deviation, in an i.i.d. Gaussian model) but in the most interesting and general cases it will not (e.g., the point estimate of a long-memory coefficient). In checking external validity, the method has proven useful and flexible; for example see the recent work by Koop (2001) and Geweke and McCausland (2001) and the texts by Lancaster (2004, Section 2.5) and Geweke (2005, Section 5.3.2). Brav (2000) utilizes posterior predictive analysis in examining alternative forecasting models for long-run returns on financial assets. Posterior predictive analysis can also temper the forecasting exercise when it is clear that there are features g(ỸT ) that are poorly described by the combination of mod- els considered. For example, if model averaging consistently under- or overestimates P(ỸT ∈ S | YoT , A), then this fact can be duly noted if it is important to the client. Since there is no presumption that there exists a true model contained within the set of models considered, this sort of analysis can be important. For more details, see Draper (1995) who also provides applications to forecasting the price of oil. Ch. 1: Bayesian Forecasting 19 2.4. Forecasting To this point we have considered the generic situation of J competing models relating a common vector of interest ω to a set of observables YT . In forecasting problems (y′T+1, . . . , y′T+F ) ∈ ω. Sections 2.1 and 2.2 showed how the principle of explicit formulation leads to a recursive representation of the complete probability structure, which we collect here for ease of reference. For each model Aj , a prior model proba- bility p(Aj | A), a prior density p(θAj | Aj) for the unobservables θAj in that model, a conditional observables density p(YT | θAj , Aj ), and a vector of interest density p(ω | YT , θAj , Aj ) imply p {[ Aj , θAj (j = 1, . . . , J ) ] ,YT ,ω | A } = J∑ j=1 p(Aj | A) · p(θAj | Aj) · p(YT | θAj , Aj ) · p(ω | YT , θAj , Aj ). The entire theory of Bayesian forecasting derives from the application of the principle of relevant conditioning to this probability structure. This leads, in order, to the posterior distribution of the unobservables in each model (24)p ( θAj | YoT , Aj ) ∝ p(θAj | Aj)p ( YoT | θAj ,Aj ) (j = 1, . . . , J ), the predictive density for the vector of interest in each model (25)p ( ω | YoT , Aj ) = ∫ Aj p ( θAj | YoT , Aj ) p ( ω | YoT , θAj ) dθAj , posterior model probabilities p ( Aj | YoT , A ) ∝ p(Aj | A) · ∫ Aj p ( YoT | θAj , Aj ) p(θAj | Aj) dθAj (26)(j = 1 . . . , J ), and, finally, the predictive density for the vector of interest, (27)p ( ω | YoT , A ) = J∑ j=1 p ( ω | YoT , Aj ) p ( Aj | YoT , A ) . The density (25) involves one of the elements of the recursive formulation of the model and consequently, as observed in Section 2.2.2, simulation from the correspond- ing distribution is generally straightforward. Expression (27) involves not much more than simple addition. Technical hurdles arise in (24) and (26), and we shall return to a general treatment of these problems using posterior simulators in Section 3. Here we emphasize the incorporation of the final product (27) in forecasting – the decision of what to report about the future. In Sections 2.4.1 and 2.4.2 we focus on (24) and (25), suppressing the model subscripting notation. Section 2.4.3 returns to issues associated with forecasting using combinations of models. 20 J. Geweke and C. Whiteman 2.4.1. Loss functions and the subjective decision maker The elements of Bayesian decision theory are isomorphic to those of the classical theory of expected utility in economics. Both Bayesian decision makers and economic agents associate a cardinal measure with all possible combinations of relevant random elements in their environment – both those that they cannot control, and those that they do. The latter are called actions in Bayesian decision theory and choices in economics. The mapping to a cardinal measure is a loss function in the Bayesian decision theory and a utility function in economics, but except for a change in sign they serve the same purpose. The decision maker takes the Bayes action that minimizes the expected value of his loss function; the economic agent makes the choice that maximizes the expected value of her utility function. In the context of forecasting the relevant elements are those collected in the vector of interest ω, and for a single model the relevant density is (25). The Bayesian formulation is to find an action a (a vector of real numbers) that minimizes (28)E [ L(a,ω) | YoT , A ] = ∫ ∫ A L(a,ω)p ( ω | YoT , A ) dω. The solution of this problem may be denoted a(YoT , A). For some well-known special cases these solutions take simple forms; see Bernardo and Smith (1994, Section 5.1.5) or Geweke (2005, Section 2.5). If the loss function is quadratic, L(a,ω) = (a −ω)′Q(a − ω), where Q is a positive definite matrix, then a(YoT , A) = E(a | YoT , A); point fore- casts that are expected values assume a quadratic loss function. A zero-one loss function takes the form L(a,ω; ε) = 1−∫ Nε(a) (ω), where Nε(a) is an open ε-neighborhood of a. Under weak regularity conditions, as ε → 0, a → arg maxω p(ω | YoT , A). In practical applications asymmetric loss functions can be critical to effective fore- casting; for one such application see Section 6.2 below. One example is the linear-linear loss function, defined for scalar ω as (29)L(a, ω) = (1 − q) · (a − ω)I(−∞,a)(ω) + q · (ω − a)I(a,∞)(ω), where q ∈ (0, 1); the solution in this case is a = P−1(q | YoT , A), the qth quantile of the predictive distribution of ω. Another is the linear-exponential loss function studied by Zellner (1986): L(a, ω) = exp[r(a − ω)]− r(a − ω) − 1, where r = 0; then (28) is minimized by a = −r−1 log{E[exp(−rω)] | YoT , A}; if the density (25) is Gaussian, this becomes a = E(ω | YoT , A)− (r/2)var(ω | YoT , A). The extension of both the quantile and linear-exponential loss functions to the case of a vector function of interest ω is straightforward. Ch. 1: Bayesian Forecasting 23 In contrast with the predictive density, the minimization problem (31) requires a loss function, and different loss functions will lead to different solutions, other things the same, as emphasized by Weiss (1996). The problem (31) is a special case of the frequentist formulation of the forecasting problem described at the end of Section 2.4.1. As such, it inherits the internal inconsis- tencies of this approach, often appearing as challenging problems. In their recent survey of density forecasting using this approach Tay and Wallis (2000, p. 248) pinpointed the challenge, if not its source: “While a density forecast can be seen as an acknowledge- ment of the uncertainty in a point forecast, it is itself uncertain, and this second level of uncertainty is of more than casual interest if the density forecast is the direct object of at- tention . . . . How this might be described and reported is beginning to receive attention.” 2.4.3. Forecasts from a combination of models The question of how to forecast given alternative models available for the purpose is a long and well-established one. It dates at least to the 1963 work of Barnard (1963) in a paper that studied airline data. This was followed by a series of influential papers by Granger and coauthors [Bates and Granger (1969), Granger and Ramanathan (1984), Granger (1989)]; Clemen (1989) provides a review of work before 1990. The papers in this and the subsequent forecast combination literature all addressed the question of how to produce a superior forecast given competing alternatives. The answer turns in large part on what is available. Producing a superior forecast, given only competing point forecasts, is distinct from the problem of aggregating the information that produced the competing alternatives [see Granger and Ramanathan (1984, p. 198) and Granger (1989, pp. 168–169)]. A related, but distinct, problem is that of combining probability distributions from different and possibly dependent sources, taken up in a seminal paper by Winkler (1981). In the context of Section 2.3, forecasting from a combination of models is straight- forward. The vector of interest ω includes the relevant future observables (yT+1, . . . , yT+F ), and the relevant forecasting density is (16). Since the minimand E[L(a,ω) | YoT , A] in (28) is defined with respect to this distribution, there is no substantive change. Thus the combination of models leads to a single predictive density, which is a weighted average of the predictive densities of the individual models, the weights being propor- tional to the posterior probabilities of those models. This predictive density conveys all uncertainty about ω, conditional on the collection of models and the data, and point forecasts and other actions derive from the use of a loss function in conjunction with it. The literature acting on this paradigm has emerged rather slowly, for two reasons. One has to do with computational demands, now largely resolved and discussed in the next section; Draper (1995) provides an interesting summary and perspective on this aspect of prediction using combinations of models, along with some applications. The other is that the principle of explicit formulation demands not just point forecasts of competing models, but rather (1) their entire predictive densities p(ω | YoT , Aj ) and (2) their marginal likelihoods. Interestingly, given the results in Section 2.3.2, the latter 24 J. Geweke and C. Whiteman requirement is equivalent to a record of the one-step-ahead predictive likelihoods p(yot | Yot−1, Aj ) (t = 1, . . . , T ) for each model. It is therefore not surprising that most of the prediction work based on model combination has been undertaken using models also designed by the combiners. The feasibility of this approach was demonstrated by Zellner and coauthors [Palm and Zellner (1992), Min and Zellner (1993)] using purely analytical methods. Petridis et al. (2001) provide a successful forecasting application utilizing a combination of heterogeneous data and Bayesian model averaging. 2.4.4. Conditional forecasting In some circumstances, selected elements of the vector of future values of y may be known, making the problem one of conditional forecasting. That is, restricting attention to the vector of interest ω = (yT+1, . . . , yT+F )′, one may wish to draw inferences regarding ω treating (S1y′T+1, . . . , SF y′T+F ) ≡ Sω as known for q × p “selection” matrices (S1, . . . , SF ), which could select elements or linear combinations of elements of future values. The simplest such situation arises when one or more of the elements of y become known before the others, perhaps because of staggered data releases. More generally, it may be desirable to make forecasts of some elements of y given views that others follow particular time paths as a way of summarizing features of the joint predictive distribution for (yT+1, . . . , yT+F ). In this case, focusing on a single model, A, (25) becomes (32)p ( ω | Sω,YoT , A ) = ∫ A p ( θA | Sω,YoT , A ) p ( ω | Sω,YoT , θA ) dθA. As noted by Waggoner and Zha (1999), this expression makes clear that the conditional predictive density derives from the joint density of θA and ω. Thus it is not sufficient, for example, merely to know the conditional predictive density p(ω | YoT , θA), because the pattern of evolution of (yT+1, . . . , yT+F ) carries information about which θA are likely, and vice versa. Prior to the advent of fast posterior simulators, Doan, Litterman and Sims (1984) produced a type of conditional forecast from a Gaussian vector autoregression (see (3)) by working directly with the mean of p(ω | Sω,YoT , θ̄A), where θ̄A is the posterior mean of p(θA | YoT , A). The former can be obtained as the solution of a simple least squares problem. This procedure of course ignores the uncertainty in θA. More recently, Waggoner and Zha (1999) developed two procedures for calculating conditional forecasts from VARs according to whether the conditions are regarded as “hard” or “soft”. Under “hard” conditioning, Sω is treated as known, and (32) must be evaluated. Waggoner and Zha (1999) develop a Gibbs sampling procedure to do so. Un- der “soft” conditioning, Sω is regarded as lying in a pre-specified interval, which makes it possible to work directly with the unconditional predictive density (25), obtaining a sample of Sω in the appropriate interval by simply discarding those samples Sω which do not. The advantage to this procedure is that (25) is generally straightforward to ob- tain, whereas p(ω | Sω,YoT , θA) may not be. Ch. 1: Bayesian Forecasting 25 Robertson, Tallman and Whiteman (2005) provide an alternative to these condi- tioning procedures by approximating the relevant conditional densities. They spec- ify the conditioning information as a set of moment conditions (e.g., ESω = ω̂S; E(Sω − ω̂S)(Sω − ω̂S)′ = Vω), and work with the density (i) that is closest to the unconditional in an information-theoretic sense and that also (ii) satisfies the speci- fied moment conditions. Given a sample {ω(m)} from the unconditional predictive, the new, minimum-relative-entropy density is straightforward to calculate; the original den- sity serves as an importance sampler for the conditional. Cogley, Morozov and Sargent (2005) have utilized this procedure in producing inflation forecast fan charts from a time-varying parameter VAR. 3. Posterior simulation methods The principle of relevant conditioning in Bayesian inference requires that one be able to access the posterior distribution of the vector of interest ω in one or more models. In all but simple illustrative cases this cannot be done analytically. A posterior simula- tor yields a pseudo-random sequence {ω(1), . . . ,ω(M)} that can be used to approximate posterior moments of the form E[h(ω) | YoT , A] arbitrarily well: the larger is M , the better is the approximation. Taken together, these algorithms are known generically as posterior simulation methods. While the motivating task, here, is to provide a simula- tion representative of p(ω | YoT , A), this section will both generalize and simplify the conditioning, in most cases, and work with the density p(θ | I ), θ ∈  ⊆ Rk , and p(ω | θ , I ), ω ∈  ⊆ Rq , I denoting “information”. Consistent with the motivating problem, we shall assume that there is no difficulty in drawing ω(m) iid ∼ p(ω | θ , I ). The methods described in this section all utilize as building blocks the set of distrib- utions from which it is possible to produce pseudo-i.i.d. sequences of random variables or vectors. We shall refer to such distributions as conventional distributions. This set includes, of course, all of those found in standard mathematical applications software. There is a gray area beyond these distributions; examples include the Dirichlet (or mul- tivariate beta) and Wishart distributions. What is most important, in this context, is that posterior distributions in all but the simplest models lead almost immediately to dis- tributions from which it is effectively impossible to produce pseudo-i.i.d. sequences of random vectors. It is to these distributions that the methods discussed in this section are addressed. The treatment in this section closely follows portions of Geweke (2005, Chapter 4). 3.1. Simulation methods before 1990 The applications of simulation methods in statistics and econometrics before 1990, in- cluding Bayesian inference, were limited to sequences of independent and identically distributed random vectors. The state of the art by the mid-1960s is well summa- rized in Hammersly and Handscomb (1964) and the early impact of these methods in 28 J. Geweke and C. Whiteman Figure 1. Acceptance sampling. While Figure 1 is necessarily drawn for scalar θ it should be clear that the principle applies for vector θ of any finite order. In fact this algorithm can be implemented using a kernel k(θ | I ) of the density p(θ | I ) i.e., k(θ | I ) ∝ p(θ | I ), and this can be important in applications where the constant of integration is not known. Similarly we require only a kernel k(θ | S) of p(θ | S), and let ak = supθ∈ k(θ | I )/k(θ | S). Then for each draw m the algorithm works as follows. 1. Draw u uniform on [0, 1]. 2. Draw θ∗ ∼ p(θ | S). 3. If u > k(θ∗ | I )/akk(θ∗ | S) return to step 1. 4. Set θ (m) = θ∗. To see why the algorithm works, let ∗ denote the support of p(θ | S); a < ∞ implies  ⊆ ∗. Let cI = k(θ | I )/p(θ | I ) and cS = k(θ | S)/p(θ | S). The unconditional probability of proceeding from step 3 to step 4 is (33) ∫ ∗ { k(θ | I )/[akk(θ | S)]}p(θ | S) dθ = cI /akcS. Let A be any subset of . The unconditional probability of proceeding from step 3 to step 4 with θ ∈ A is (34) ∫ A { k(θ | I )/[akk(θ | S)]}p(θ | S) dθ = ∫ A k(θ | I ) dθ/akcS. The probability that θ ∈ A, conditional on proceeding from step 3 to step 4, is the ratio of (34) to (33), which is ∫ A k(θ | I ) dθ/cI = ∫ A p(θ | I ) dθ . Ch. 1: Bayesian Forecasting 29 Regardless of the choices of kernels the unconditional probability in (33) is cI /akcS = infθ∈ p(θ | S)/p(θ | I ). If one wishes to generate M draws of θ using ac- ceptance sampling, the expected number of times one will have to draw u, draw θ∗, and compute k(θ∗ | I )/[akk(θ∗ | S)] is M · supθ∈ p(θ | I )/p(θ | S). The computational efficiency of the algorithm is driven by those θ for which p(θ | S) has the greatest rel- ative undersampling. In most applications the time consuming part of the algorithm is the evaluation of the kernels k(θ | S) and k(θ | I ), especially the latter. (If p(θ | I ) is a posterior density, then evaluation of k(θ | I ) entails computing the likelihood function.) In such cases this is indeed the relevant measure of efficiency. Since θ (m) iid ∼ p(θ | I ), ω(m) iid∼ p(ω | I ) = ∫  p(θ | I )p(ω | θ , I ) dθ . Acceptance sampling is limited by the difficulty in finding an approximation p(θ | S) that is effi- cient, in the sense just described, and by the need to find ak = supθ∈ k(θ | I )/k(θ | S). While it is difficult to generalize, these tasks are typically more difficult the greater the number of elements of θ . 3.1.3. Importance sampling Rather than accept only a fraction of the draws from the source density, it is possible to retain all of them, and consistently approximate the posterior moment by appropri- ately weighting the draws. The probability density function of the source distribution is then called the importance sampling density, a term due to Hammersly and Hand- scomb (1964), who were among the first to propose the method. It appears to have been introduced to the econometrics literature by Kloek and van Dijk (1978). To describe the method, denote the source density by p(θ | S) with support ∗, and an arbitrary kernel of the source density by k(θ | S) = cS · p(θ | S) for any cS = 0. Denote an arbitrary kernel of the target density by k(θ | I ) = cI · p(θ | I ) for any cI = 0, the i.i.d. sequence θ (m) ∼ p(θ | S), and the sequence ω(m) drawn independently from p(ω | θ (m), I ). Define the weighting function w(θ) = k(θ | I )/k(θ | S). Then the approximation of h = E[h(ω) | I ] is (35)h(M) = ∑M m=1 w(θ (m))h(ω(m))∑M m=1 w(θ (m)) . Geweke (1989a) showed that if E[h(ω) | I ] exists and is finite, and ∗ ⊇ , then h(M) a.s.→ h. Moreover, if var[h(ω) | I ] exists and is finite, and if w(θ) is bounded above on , then the accuracy of the approximation can be assessed using the Lindeberg–Levy central limit theorem with an appropriately approximated variance [see Geweke (1989a, Theorem 2) or Geweke (2005, Theorem 4.2.2)]. In applications of importance sampling, this accuracy can be summarized in terms of the numerical standard error of h(M), its sampling standard deviation in independent runs of length M of the importance sam- pling simulation, and in terms of the relative numerical efficiency of h(M), the ratio of simulation size in a hypothetical direct simulator to that required using importance sam- pling to achieve the same numerical standard error. These summaries of accuracy can be 30 J. Geweke and C. Whiteman used with other simulation methods as well, including the Markov chain Monte Carlo algorithms described in Section 3.2. To see why importance sampling produces a simulation-consistent approximation of E[h(ω) | I ], notice that E [ w(θ) | S] = ∫  k(θ | I ) k(θ | S)p(θ | S) dθ = cI cS ≡ w. Since {ω(m)} is i.i.d. the strong law of large numbers implies (36)M−1 M∑ m=1 w ( θ (m) ) a.s.→ w. The sequence {w(θ (m)), h(ω(m))} is also i.i.d., and E [ w(θ)h(ω) | I ] = ∫  w(θ) [∫ h(ω)p(ω | θ , I ) dω ] p(θ | S) dθ = (cI /cS) ∫  ∫ h(ω)p(ω | θ , I )p(θ | I ) dω dθ = (cI /cS)E [ h(ω) | I ] = w · h. By the strong law of large numbers, (37)M−1 M∑ m=1 w ( θ (m) ) h ( ω(m) ) a.s.→ w · h. The fraction in (35) is the ratio of the left-hand side of (37) to the left-hand side of (36). One of the attractive features of importance sampling is that it requires only that p(θ | I )/p(θ | S) be bounded, whereas acceptance sampling requires that the supre- mum of this ratio (or that for kernels of the densities) be known. Moreover, the known supremum is required in order to implement acceptance sampling, whereas the bound- edness of p(θ | I )/p(θ | S) is utilized in importance sampling only to exploit a central limit theorem to assess numerical accuracy. An important application of importance sampling is in providing remote clients with a simple way to revise prior distributions, as discussed below in Section 3.3.2. 3.2. Markov chain Monte Carlo Markov chain Monte Carlo (MCMC) methods are generalizations of direct sampling. The idea is to construct a Markov chain {θ (m)} with continuous state space  and unique invariant probability density p(θ | I ). Following an initial transient or burn-in phase, the distribution of θ (m) is approximately that of the density p(θ | I ). The exact sense in which this approximation holds is important. We shall touch on this only briefly; for full detail and references see Geweke (2005, Section 3.5). We continue to assume that Ch. 1: Bayesian Forecasting 33 stronger, but still widely applicable, conditions are easier to state. For example, if for any Lebesgue measurable A with ∫ A p(θ | I ) dθ > 0 it is the case that in the Markov chain (38) P(θ (m+1) ∈ A | θ (m),G) > 0 for any θ (m) ∈ , then the Markov chain is ergodic. (Clearly neither example in Figure 2 satisfies this condition.) For this and other simple conditions see Geweke (2005, Section 4.5). 3.2.2. The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm is defined by a probability density function p(θ∗ | θ ,H) indexed by θ ∈  and with density argument θ∗. The random vector θ∗ generated from p(θ∗ | θ (m−1), H) is a candidate value for θ (m). The algorithm sets θ (m) = θ∗ with probability (39)α ( θ∗ | θ (m−1), H ) = min{ p(θ∗ | I )/p(θ∗ | θ (m−1), H) p(θ (m−1) | I )/p(θ (m−1) | θ∗,H) , 1 } ; otherwise, θ (m) = θ (m−1). Conditional on θ = θ (m−1) the distribution of θ∗ is a mixture of a continuous distribution with density given by u(θ∗ | θ ,H) = p(θ∗ | θ ,H)α(θ∗ | θ ,H), corresponding to the accepted candidates, and a discrete distribution with proba- bility mass r(θ | H) = 1 − ∫  u(θ∗ | θ ,H) dθ∗ at the point θ , which is the probability of drawing a θ∗ that will be rejected. The entire transition density can be expressed using the Dirac delta function as (40)p ( θ (m) | θ (m−1), H ) = u(θ (m) | θ (m−1), H )+ r(θ (m−1) | H )δθ (m−1)(θ (m)). The intuition behind this procedure is evident on the right-hand side of (39), and is in many respects similar to that in acceptance and importance sampling. If the transition density p(θ∗ | θ ,H) makes a move from θ (m−1) to θ∗ quite likely, relative to the target density p(θ | I ) at θ∗, and a move back from θ∗ to θ (m−1) quite unlikely, relative to the target density at θ (m−1), then the algorithm will place a low probability on actually making the transition and a high probability on staying at θ (m−1). In the same situation, a prospective move from θ∗ to θ (m−1) will always be made because draws of θ (m−1) are made infrequently relative to the target density p(θ | I ). This is the most general form of the Metropolis–Hastings algorithm, due to Hastings (1970). The Metropolis et al. (1953) form takes p(θ∗ | θ,H) = p(θ | θ∗,H), which in turn leads to a simplification of the acceptance probability: α(θ∗ | θ (m−1), H) = min[p(θ∗ | I )/p(θ (m−1) | I ), 1]. A leading example of this form is the Metropolis ran- dom walk, in which p(θ∗ | θ ,H) = p(θ∗ − θ | H) and the latter density is symmetric about 0, for example that of the multivariate normal distribution with mean 0. Another special case is the Metropolis independence chain [see Tierney (1994)] in which p(θ∗ | θ ,H) = p(θ∗ | H). This leads to α(θ∗ | θ (m−1), H) = min[w(θ∗)/w(θ (m−1)), 1], where w(θ) = p(θ | I )/p(θ | H). The independence chain is closely related to ac- ceptance sampling and importance sampling. But rather than place a low probability of acceptance or a low weight on a draw that is too likely relative to the target distribution, the independence chain assigns a low probability of transition to that candidate. 34 J. Geweke and C. Whiteman There is a simple two-step argument that motivates the convergence of the se- quence {θ (m)}, generated by the Metropolis–Hastings algorithm, to the distribution of interest. [This approach is due to Chib and Greenberg (1995).] First, note that if a transi- tion probability density function p(θ (m) | θ (m−1), T ) satisfies the reversibility condition p ( θ (m−1) | I)p(θ (m) | θ (m−1), T ) = p(θ (m) | I)p(θ (m−1) | θ (m), T ) with respect to p(θ | I ), then∫  p ( θ (m−1) | I)p(θ (m) | θ (m−1), T ) dθ (m−1) = ∫  p ( θ (m) | I)p(θ (m−1) | θ (m), T ) dθ (m−1) (41)= p(θ (m) | I) ∫  p ( θ (m−1) | θ (m), T ) dθ (m−1) = p(θ (m) | I). Expression (41) indicates that if θ (m−1) ∼ p(θ | I ), then the same is true of θ (m). The density p(θ | I ) is an invariant density of the Markov chain with transition density p(θ (m) | θ (m−1), T ). The second step in this argument is to consider the implications of the requirement that the Metropolis–Hastings transition density p(θ (m) | θ (m−1), H) be reversible with respect to p(θ | I ), p ( θ (m−1) | I)p(θ (m) | θ (m−1), H ) = p(θ (m) | I)p(θ (m−1) | θ (m),H ). For θ (m−1) = θ (m) the requirement holds trivially. For θ (m−1) = θ (m) it implies that p ( θ (m−1) | I)p(θ∗ | θ (m−1), H )α(θ∗ | θ (m−1), H ) (42)= p(θ∗ | I)p(θ (m−1) | θ∗,H )α(θ (m−1) | θ∗,H ). Suppose without loss of generality that p ( θ (m−1) | I)p(θ∗ | θ (m−1), H ) > p(θ∗ | I)p(θ (m−1) | θ∗,H ). If α(θ (m−1) | θ∗,H) = 1 and α ( θ∗ | θ (m−1), H ) = p(θ∗ | I )p(θ (m−1) | θ∗,H) p(θ (m−1) | I )p(θ∗ | θ (m−1), H) , then (42) is satisfied. 3.2.3. Metropolis within Gibbs Different MCMC methods can be combined in a variety of rich and interesting ways that have been important in solving many practical problems in Bayesian inference. One of the most important in econometric modelling has been the Metropolis within Gibbs algorithm. Suppose that in attempting to implement a Gibbs sampling algorithm, Ch. 1: Bayesian Forecasting 35 a conditional density p[θ (b) | θ (a) (a = b)] is intractable. The density is not of any known form, and efficient acceptance sampling algorithms are not at hand. This occurs in the stochastic volatility example, for the volatilities h1, . . . , hT . This problem can be addressed by applying the Metropolis–Hastings algorithm in block b of the Gibbs sampler while treating the other blocks in the usual way. Specif- ically, let p(θ∗(b) | θ ,Hb) be the density (indexed by θ ) from which candidate θ∗(b) is drawn. At iteration m, block b, of the Gibbs sampler draw θ∗(b) ∼ p(θ ∗ (b) | θ (m)a (a < b), θ (m−1)a (a  b),Hb), and set θ (m) (b) = θ∗(b) with probability α [ θ∗(b) | θ (m)a (a < b), θ (m−1)a (a  b),Hb ] = min { p[θ (m)a (a < b), θ∗b, θ (m−1)a (a > b) | I ] p[θ∗(b) | θ (m)a (a < b), θ (m−1)a (a  b),Hb] / p[θ (m)a (a < b), θ (m−1)a (a  b) | I ] p[θ (m−1)b | θ (m)a (a < b), θ∗b, θ (m−1)a (a > b),Hb] , 1 } . If θ (m)(b) is not set to θ ∗ (b), then θ (m) (b) = θ (m−1)(b) . The procedure for θ (b) is exactly the same as for a standard Metropolis step, except that θa (a = b) also enters the density p(θ | I ) and transition density p(θ | H). It is usually called a Metropolis within Gibbs step. To see that p(θ | I ) is an invariant density of this Markov chain, consider the simple case of two blocks with a Metropolis within Gibbs step in the second block. Adapting the notation of (40), describe the Metropolis step for the second block by p ( θ∗(2) | θ (1), θ (2), H2 ) = u(θ∗(2) | θ (1), θ (2), H2)+ r(θ (2) | θ (1), H2)δθ (2)(θ∗(2)) where u ( θ∗(2) | θ (1), θ (2), H2 ) = α(θ∗(2) | θ (1), θ (2), H2)p(θ∗(2) | θ (1), θ (2), H2) and (43)r(θ (2) | θ (1), H2) = 1 − ∫ 2 u ( θ∗(2) | θ (1), θ (2), H2 ) dθ∗(2). The one-step transition density for the entire chain is p ( θ∗ | θ ,G) = p(θ∗(1) | θ (2), I)p(θ∗(2) | θ (1), θ (2), H2). Then p(θ | I ) is an invariant density of p(θ∗ | θ,G) if (44) ∫  p(θ | I )p(θ∗ | θ ,G) dθ = p(θ∗ | I). To establish (44), begin by expanding the left-hand side,∫  p(θ | I )p(θ∗ | θ ,G) dθ = ∫ 2 ∫ 1 p(θ (1), θ (2) | I ) dθ (1)p ( θ∗(1) | θ (2), I ) 38 J. Geweke and C. Whiteman The first step, posterior simulation, has become practicable for most models by virtue of the innovations in MCMC methods summarized in Section 3.2. The second simulation is relatively simple, because it is part of the recursive formulation. The simulations θ (m)A from the posterior simulator will not necessarily be i.i.d. (in the case of MCMC) and they may require weighting (in the case of importance sampling) but the simulations are ergodic: i.e., so long as E[h(θA,ω) | YoT , A] exists and is finite, (52) ∑M m=1 w(m)h(θ (m) A ,ω (m))∑M m=1 w(m) a.s.→ E[h(θA,ω) | YoT , A]. The weights w(m) in (52) come into play for importance sampling. There is another important use for weighted posterior simulation, to which we return in Section 3.3.2. This full integration of sources of uncertainty by means of simulation appears to have been applied for the first time in the unpublished thesis of Litterman (1979) as discussed in Section 4. The first published full applications of simulation methods in this way in published papers appear to have been Monahan (1983) and Thompson and Miller (1986), which built on Thompson (1984). This study applied an autoregressive model of order 2 with a conventional improper diffuse prior [see Zellner (1971, p. 195)] to quarterly US unemployment rate data from 1968 through 1979, forecasting for the period 1980 through 1982. Section 4 of their paper outlines the specifics of (51) in this case. They computed posterior means of each of the 12 predictive densities, correspond- ing to a joint quadratic loss function; predictive variances; and centered 90% predictive intervals. They compared these results with conventional non-Bayesian procedures [see Box and Jenkins (1976)] that equate unknown parameters with their estimates, thus ig- noring uncertainty about these parameters. There were several interesting findings and comparisons. 1. The posterior means of the parameters and the non-Bayesian point estimates are similar: yt = 0.441 + 1.596yt−1 − 0.669yt−2 for the former and yt = 0.342 + 1.658yt−1 − 0.719yt−2 for the latter. 2. The point forecasts from the predictive density and the conventional non-Bayesian procedure depart substantially over the 12 periods, from unemployment rates of 5.925% and 5.904%, respectively, one-step-ahead, to 6.143% and 5.693%, re- spectively, 12 steps ahead. This is due to the fact that an F -step-ahead mean, conditional on parameter values, is a polynomial of order F in the parameter val- ues: predicting farther into the future involves an increasingly non-linear function of parameters, and so the discrepancy between the mean of the nonlinear function and the non-linear function of the mean also increases. 3. The Bayesian 90% predictive intervals are generally wider than the corresponding non-Bayesian intervals; the difference is greatest 12 steps ahead, where the width is 5.53% in the former and 5.09% in the latter. At 12 steps ahead the 90% intervals are (3.40%, 8.93%) and (3.15%, 8.24%). 4. The predictive density is platykurtic; thus a normal approximation of the pre- dictive density (today a curiosity, in view of the accessible representation (51)) Ch. 1: Bayesian Forecasting 39 produces a 90% predictive density that is too wide, and the discrepancy increases for predictive densities farther into the future: 5.82% rather than 5.53%, 12 steps ahead. Thompson and Miller did not repeat their exercise for other forecasting periods, and therefore had no evidence on forecasting reliability. Nor did they employ the shrinkage priors that were, contemporaneously, proving so important in the successful application of Bayesian vector autoregressions at the Federal Reserve Bank of Minneapolis. We return to that project in Section 6.1. 3.3.2. Model combination and the revision of assumptions Incorporation of uncertainty about the model itself is rarely discussed, and less fre- quently acted upon; Greene (2003) does not even mention it. This lacuna is rational in non-Bayesian approaches: since uncertainty cannot be integrated in the context of one model, it is premature, from this perspective, even to contemplate this task. Since model- specific uncertainty has been resolved, both as a theoretical and as a practical matter, in Bayesian forecasting, the problem of model uncertainty is front and center. Two vari- ants on this problem are integrating uncertainty over a well-defined set of models, and bringing additional, but similar, models into such a group in an efficient manner. Extending the expression of uncertainty to a set of J specified models is straightfor- ward in principle, as detailed in Section 2.3. From (24)–(27) it is clear that the additional technical task is the evaluation of the marginal likelihoods p ( YoT | Aj ) = ∫ Aj p ( YoT | θAj , Aj ) p(θAj | Aj) dθAj (j = 1, . . . , J ). With few exceptions simulation approximation of the marginal likelihood is not a spe- cial case of approximating a posterior moment in the model Aj . One such exception of practical importance involves models Aj and Ak with a common vector of unobserv- ables θA and likelihood p(YoT | θA,Aj ) = p(YoT | θA,Ak) but different prior densities p(θA | Aj) and p(θA | Ak). (For example, one model might incorporate a set of in- equality restrictions while the other does not.) If p(θA | Ak)/p(θA | Aj) is bounded above on the support of p(θA | Aj), and if θ (m)A ∼ p(θA | YoT , Aj ) is ergodic then (53)M−1 M∑ m=1 p ( θ (m) A | Ak )/ p ( θ (m) A | Aj ) a.s.→ p(YoT | Ak)/p(YoT | Aj ); see Geweke (2005, Section 5.2.1). For certain types of posterior simulators, simulation-consistent approximation of the marginal likelihood is also straightforward: see Geweke (1989b, Section 5) or Geweke (2005, Section 5.2.2) for importance sampling, Chib (1995) for Gibbs sampling, Chib and Jeliazkov (2001) for the Metropolis–Hastings algorithm, and Meng and Wong (1996) for a general theoretical perspective. An approach that is more general, but of- ten computationally less efficient in these specific cases, is the density ratio method of 40 J. Geweke and C. Whiteman Gelfand and Dey (1994), also described in Geweke (2005, Section 5.2.4). These ap- proaches, and virtually any conceivable approach, require that it be possible to evaluate or approximate with substantial accuracy the likelihood function. This condition is not necessary in MCMC posterior simulators, and this fact has been central to the success of these simulations in many applications, especially those with latent variables. This, more or less, defines the rapidly advancing front of attack on this important technical issue at the time of this writing. Some important and practical modifications can be made to the set of models over which uncertainty is integrated, without repeating the exercise of posterior simulation. These modifications all exploit reweighting of the posterior simulator output. One im- portant application is updating posterior distributions with new data. In a real-time forecasting situation, for example, one might wish to update predictive distributions minute-by-minute, whereas as a full posterior simulation adequate for the purposes at hand might take more than a minute (but less than a night). Suppose the posterior sim- ulation utilizes data through time T , but the predictive distribution is being formed at time T ∗ > T . Then p ( ω | YoT ∗ , A ) = ∫ A p ( θA | YoT ∗ , A ) p ( ω | θA,YoT ∗ , A ) dθA = ∫ A p ( θA | YoT , A )p(θA | YoT ∗ , A) p(θA | YoT , A) p ( ω | θA,YoT ∗ , A ) dθA ∝ ∫ A p ( θA | YoT , A ) p ( yoT+1, . . . , y o T ∗ | θA,A ) × p(ω | θA,YoT ∗ , A) dθA. This suggests that one might use the simulator output θ (m) ∼ p(θA | YoT , A), tak- ing ω(m) ∼ p(ω | θ (m)A ,YoT ∗ , A) but reweighting the simulator output to approximate E[h(ω) | YoT ∗ , A] by (54) M∑ m=1 p ( yoT+1, . . . , y o T ∗ | θ (m)A ,A ) h ( ω(m) )/ M∑ m=1 p ( yoT+1, . . . , y o T ∗ | θ (m)A ,A ) . This turns out to be correct; for details see Geweke (2000). One can show that (54) is a simulation-consistent approximation of E[h(ω) | YoT ∗ , A] and in many cases the updat- ing requires only spreadsheet arithmetic. There are central limit theorems on which to base assessments of the accuracy of the approximations; these require more advanced, but publicly available, software; see Geweke (1999) and Geweke (2005, Sections 4.1 and 5.4). The method of reweighting can also be used to bring into the fold models with the same likelihood function but different priors, or to explore the effect of modi- fying the prior, as (53) suggests. In that context Ak denotes the new model, with a prior distribution that is more informative in the sense that p(θA | Ak)/p(θA | Aj) is bounded above on the support of Aj . Reweighting the posterior simulator output Ch. 1: Bayesian Forecasting 43 this means a normal-gamma prior (conditional normal for the regression coefficients, in- verted gamma for the residual standard deviation) and a normal likelihood. As Section 2 makes clear, there is no longer need for conjugacy and simple likelihoods, as develop- ments of the past 15 years have made it possible to replace “integration by Arnold Zellner” with “integration by Monte Carlo”, in some cases using MC methods devel- oped by Zellner himself [e.g., Zellner and Min (1995); Zellner and Chen (2001)]. 4.2. The dynamic linear model In 1976, P.J. Harrison and C.F. Stevens [Harrison and Stevens (1976)] read a paper with a title that anticipates ours before the Royal Statistical Society in which they remarked that “[c]ompared with current forecasting fashions our views may well appear radical”. Their approach involved the dynamic linear model (see also Chapter 7 in this volume), which is a version of a state-space observer system: yt = x′tβ t + ut , β t = Gβ t−1 + wt with ut iid ∼ N(0,Ut ) and wt iid ∼ N(0,Wt ). Thus the slope parameters are treated as latent variables, as in Section 2.2.4. As Harrison and Stevens note, this generalizes the stan- dard linear Gaussian model (one of Zellner’s examples) by permitting time variation in β and the residual covariance matrix. Starting from a prior distribution for β0 Harri- son and Stevens calculate posterior distributions for β t for t = 1, 2, . . . via the (now) well-known Kalman filter recursions. They also discuss prediction formulae for yT+k at time T under the assumption (i) that xT+k is known at T , and (ii) xT+k is unknown at T . They note that their predictions are “distributional in nature, and derived from the current parameter uncertainty” and that “[w]hile it is natural to think of the expectations of the future variate values as “forecasts” there is no need to single out the expectation for this purpose . . . if the consequences of an error in one direction are more serious that an error of the same magnitude in the opposite direction, then the forecast can be biased to take this into account” (cf. Section 2.4.1). Harrison and Stevens take up several examples, beginning with the standard regres- sion model, the “static case”. They note that in this context, their Bayesian–Kalman filter approach amounts to a computationally neat and economical method of revising regression coefficient estimates as fresh data become available, without effectively re-doing the whole calculation all over again and without any matrix inversion. This has been previ- ously pointed out by Plackett (1950) and others but its practical importance seems to have been almost completely missed. (p. 215) Other examples they treat include the linear growth model, additive seasonal model, periodic function model, autoregressive models, and moving average models. They also consider treatment of multiple possible models, and integrating across them to obtain predictions, as in Section 2.3. 44 J. Geweke and C. Whiteman Note that the Harrison–Stevens approach generalized what was possible using Zell- ner’s (1971) book, but priors were still conjugate, and the underlying structure was still Gaussian. The structures that could be handled were more general, but the statistical as- sumptions and nature of prior beliefs accommodated were quite conventional. Indeed, in his discussion of Harrison–Stevens, Chatfield (1976) remarks that . . . you do not need to be Bayesian to adopt the method. If, as the authors suggest, the general purpose default priors work pretty well for most time series, then one does not need to supply prior information. So, despite the use of Bayes’ theorem inherent in Kalman filtering, I wonder if Adaptive Forecasting would be a better description of the method. (p. 231) The fact remains, though, that latent-variable structure of the forecasting model does put uncertainty about the parameterization on a par with the uncertainty associated with the stochastic structure of the observables themselves. 4.3. The Minnesota revolution During the mid- to late-1970’s, Christopher Sims was writing what would become “Macroeconomics and reality”, the lead article in the January 1980 issue of Economet- rica. In that paper, Sims argued that identification conditions in conventional large-scale econometric models that were routinely used in (non Bayesian) forecasting and policy exercises, were “incredible” – either they were normalizations with no basis in theory, or “based” in theory that was empirically falsified or internally inconsistent. He pro- posed, as an alternative, an approach to macroeconomic time series analysis with little theoretical foundation other than statistical stationarity. Building on the Wold decom- position theorem, Sims argued that, exceptional circumstances aside, vectors of time series could be represented by an autoregression, and further, that such representations could be useful for assessing features of the data even though they reproduce only the first and second moments of the time series and not the entire probabilistic structure or “data generation process”. With this as motivation, Robert Litterman (1979) took up the challenge of devising procedures for forecasting with such models that were intended to compete directly with large-scale macroeconomic models then in use in forecasting. Betraying a frequentist background, much of Litterman’s effort was devoted to dealing with “multicollinearity problems and large sampling errors in estimation”. These “problems” arise because in (3), each of the equations for the p variables involves m lags of each of p variables, resulting in mp2 coefficients in B1, . . . ,Bm. To these are added the parameters BD associated with the deterministic components, as well as the p(p+1) distinct parameters in . Litterman (1979) treats these problems in a distinctly classical way, introducing “re- strictions in the form of priors” in a subsection on “Biased estimation”. While he notes that “each of these methods may be given a Bayesian interpretation”, he discusses re- duction of sampling error in classical estimation of the parameters of the normal linear Ch. 1: Bayesian Forecasting 45 model (56) via the standard ridge regression estimator [Hoerl and Kennard (1970)] βkR = ( X′T XT + Ik )−1X′T YT , the Stein (1974) class βkS = ( X′T XT + X′T XT )−1X′T YT , and, following Maddala (1977), the “generalized ridge” (58)βkS = ( X′T XT + −1 )−1(X′T YT + −1θ). Litterman notes that the latter “corresponds to a prior distribution on β of N(θ, λ2) with  = σ 2/λ2”. (Both parameters σ 2 and λ2 are treated as known.) Yet Litterman’s next statement is frequentist: “The variance of this estimator is given by σ 2(X′T XT + −1)−1”. It is clear from his development that he has the “Bayesian” shrinkage in mind as a way of reducing the sampling variability of otherwise frequentist estimators. Anticipating a formulation to come, Litterman considers two shrinkage priors (which he refers to as “generalized ridge estimators”) designed specifically with lag distribu- tions in mind. The canonical distributed lag model for scalar y and x is given by (59)yt = α + β0xt + β1xt−1 + · · · + βlxt−m + ut . The first prior, due to Leamer (1972), shrinks the mean and variance of the lag co- efficients at the same geometric rate with the lag, and covariances between the lag coefficients at a different geometric rate according to the distance between them: Eβi = υρi, cov(βi, βj ) = λ2ω|i−j |ρi+j−2 with 0 < ρ,ω < 1. The hyperparameters ρ, and ω control the decay rates, while υ and λ control the scale of the mean and variance. The spirit of this prior lives on in the “Minnesota” prior to be discussed presently. The second prior is Shiller’s (1973) “smoothness” prior, embodied by (60)R[β1, . . . , βm]′ = w, w ∼ N ( 0, σ 2wIm−2 ) where the matrix R incorporates smoothness restrictions by “differencing” adjacent lag coefficients; for example, to embody the notion that second differences between lag coefficients are small (that the lag distribution is quadratic), R is given by R = ⎡⎢⎢⎢⎣ 1 −2 1 0 0 . . . 0 0 1 −2 1 0 0 . . . 0 . . . . . . . . . . . . 0 0 . . . 1 −2 1 ⎤⎥⎥⎥⎦ . Having introduced these priors, Litterman dismisses the latter, quoting Sims: “. . . the whole notion that lag distributions in econometrics ought to be smooth is . . . at best 48 J. Geweke and C. Whiteman same as OLS applied equation-by-equation. In the special case of diagonal , however, equation-by-equation calculations are sufficient to compute the posterior mean of the VAR parameters. Thus Litterman’s (1979) “loss of efficiency” argument suggests that a perceived computational burden in effect forced him to make unpalatable assumptions regarding the off-diagonal elements of . Litterman also sidestepped another computational burden (at the time) of treating the elements of the prior as unknown. Indeed, the use of estimated residual standard deviations in the specification of the prior is an example of the “empirical” Bayesian approach. He briefly discussed the difficulties associated with treating the parameters of the prior as unknown, but argued that the required numerical integration of the re- sulting distribution (the diffuse prior version of which is Zellner’s (57) above) was “not feasible”. As is clear from Section 2 above (and 5 below), ten years later, feasibility was not a problem. Litterman implemented his scheme on a three-variable VAR involving real GNP, M1, and the GNP price deflator using a quarterly sample from 1954:1 to 1969:4, and a forecast period 1970:1 to 1978:1. In undertaking this effort, he introduced a recursive evaluation procedure. First, he estimated the model (obtained B̃) using data through 1969:4 and made predictions for 1 through K steps ahead. These were recorded, the sample updated to 1970:1, the model re-estimated, and the process was repeated for each quarter through 1977:4. Various measures of forecast accuracy (mean absolute er- ror, root mean squared error, and Theil’s U – the ratio of the root mean squared error to that of a no-change forecast) were then calculated for each of the forecast horizons 1 through K . Estimation was accomplished by the Kalman filter, though it was used only as a computational device, and none of its inherent Bayesian features were utilized. Litterman’s comparison to McNees’s (1975) forecast performance statistics for several large-scale macroeconometric models suggested that the forecasting method worked well, particularly at horizons of about two to four quarters. In addition to traditional measures of forecast accuracy, Litterman also devoted sub- stantial effort to producing Fair’s (1980) “estimates of uncertainty”. These are measures of forecast accuracy that embody adjustments for changes in the variances of the fore- casts over time. In producing these measures for his Bayesian VARs, Litterman antici- pated much of the essence of posterior simulation that would be developed over the next fifteen years. The reason is that Fair’s method decomposes forecast uncertainty into sev- eral sources, of which one is the uncertainty due to the need to estimate the coefficients of the model. Fair’s version of the procedure involved simulation from the frequentist sampling distribution of the coefficient estimates, but Litterman explicitly indicated the need to stochastically simulate from the posterior distribution of the VAR parameters as well as the distribution of the error terms. Indeed, he generated 50 (!) random samples from the (equation-by-equation, empirical Bayes’ counterpart to the) predictive den- sity for a six variable, four-lag VAR. Computations required 1024 seconds on the CDC Cyber 172 computer at the University of Minnesota, a computer that was fast by the standards of the time. Ch. 1: Bayesian Forecasting 49 Doan, Litterman and Sims (1984, DLS) built on Litterman, though they retained the equation-by-equation mode of analysis he had adopted. Key innovations included ac- commodation of time variation via a Kalman filter procedure like that used by Harrison and Stevens (1976) for the dynamic linear model discussed above, and the introduc- tion of new features of the prior to reflect views that sums of own lag coefficients in each equation equal unity, further reflecting the random walk prior. [Sims (1992) sub- sequently introduced a related additional feature of the prior reflecting the view that variables in the VAR may be cointegrated.] After searching over prior hyperparameters (overall tightness, degree of time varia- tion, etc.) DLS produced a “prior” involving small time variation and some “bite” from the sum-of-lag coefficients restriction that improved pseudo-real time forecast accuracy modestly over univariate predictions for a large (10 variable) model of macroeconomic time series. They conclude the improvement is “. . . substantial relative to differences in forecast accuracy ordinarily turned up in comparisons across methods, even though it is not large relative to total forecast error.” (pp. 26–27) 4.4. After Minnesota: Subsequent developments Like DLS, Kadiyala and Karlsson (1993) studied a variety of prior distributions for macroeconomic forecasting, and extended the treatment to full system-wide analysis. They began by noting that Litterman’s (1979) equation-by-equation formulation has an interpretation as a multivariate analysis, albeit with a Gaussian prior distribution for the VAR coefficients characterized by a diagonal, known, variance-covariance matrix. (In fact, this “known” covariance matrix is data determined owing to the presence of estimated residual standard deviations in Equation (61).) They argue that diagonality is a more troublesome assumption (being “rarely supported by data”) than the one that the covariance matrix is known, and in any case introduce four alternatives that relax them both. Horizontal concatenation of equations of the form (63) and then vertically stacking (vectorizing) yields the Kadiyala and Karlsson (1993) formulation (64)yT = (Ip ⊗ XT )b + UT where now yT = vec(Y1T ,Y2T , . . . ,YpT ), b = vec(β1,β2, . . . ,βp), and UT = vec(u1T ,u2T , . . . ,upT ). Here UT ∼ N(0,⊗IT ). The Minnesota prior treats var(uiT ) as fixed (at the unrestricted OLS estimate σ̂i) and  as diagonal, and takes, for autore- gression model A, βi | A ∼ N(β i , i ) where β i and i are the prior mean and covariance hyperparameters. This formulation results in the Gaussian posteriors βi | yT , A ∼ N ( β̄i , ̄i ) 50 J. Geweke and C. Whiteman where (recall (58)) β̄ i = ̄i ( −1i β i + σ̂−1i X′T YiT ) , ̄i = ( −1i + σ̂−1i X′T XT )−1 . Kadiyala and Karlsson’s first alternative is the “normal-Wishart” prior, which takes the VAR parameters to be Gaussian conditional on the innovation covariance matrix, and the covariance matrix not to be known but rather given by an inverted Wishart random matrix: b |  ∼ N(b, ⊗), (65)  ∼ IW(, α) where the inverse Wishart density for  given degrees of freedom parameter α and “shape” is proportional to ||−(α+p+1)/2 exp{−0.5tr−1} [see, e.g., Zellner (1971, p. 395)]. This prior is the natural conjugate prior for b,. The posterior is given by b | , yT, A ∼ N ( b̄, ⊗ ̄),  | yT, A ∼ IW ( ̄, T + α) where the posterior parameters b̄, ̄, and ̄ are simple (though notationally cumber- some) functions of the data and the prior parameters b, , and . Simple functions of interest can be evaluated analytically under this posterior, and for more complicated functions, evaluation by posterior simulation is trivial given the ease of sampling from the inverted Wishart [see, e.g., Geweke (1988)]. But this formulation has a drawback, noted long ago by Rothenberg (1963), that the Kronecker structure of the prior covariance matrix enforces an unfortunate symmetry on ratios of posterior variances of parameters. To take an example, suppress deterministic components (d = 0) and consider a 2-variable, 1-lag system (p = 2, m = 1): y1t = B1,11y1t−1 + B1,12y2t−1 + ε1t , y2t = B1,21y1t−1 + B1,22y2t−1 + ε2t . Let  = [ψij ] and ̄ = [σ̄ij ]. Then the posterior covariance matrix for b = (B1,11 B1,12 B1,21 B1,22) ′ is given by  ⊗ ̄ = ⎡⎢⎢⎣ ψ11σ̄11 ψ11σ̄12 ψ12σ̄11 ψ12σ̄12 ψ11σ̄21 ψ11σ̄22 ψ12σ̄21 ψ12σ̄22 ψ21σ̄11 ψ21σ̄12 ψ22σ̄11 ψ22σ̄12 ψ21σ̄21 ψ21σ̄22 ψ22σ̄21 ψ22σ̄22 ⎤⎥⎥⎦ , so that var(B1,11)/var(B1,21) = ψ11σ̄11/ψ22σ̄11 = var(B1,12)/var(B1,22) = ψ11σ̄22/ψ22σ̄22. Ch. 1: Bayesian Forecasting 53 generator. Subsequently, Sims and Zha (1998) showed how to adopt an informative Gaussian prior for CD,C1, . . . ,Cm|C0 together with a general (diffuse or informative) prior for C0 and concluded with the “hope that this will allow the transparency and re- producibility of Bayesian methods to be more widely available for tasks of forecasting and policy analysis” (p. 967). 5. Some Bayesian forecasting models The vector autoregression (VAR) is the best known and most widely applied Bayesian economic forecasting model. It has been used in many contexts, and its ability to im- prove forecasts and provide a vehicle for communicating uncertainty is by now well established. We return to a specific application of the VAR illustrating these qualities in Section 6. In fact Bayesian inference is now widely undertaken with many models, for a variety of applications including economic forecasting. This section surveys a few of the models most commonly used in economics. Some of these, for example ARMA and fractionally integrated models, have been used in conjunction with methods that are not only non-Bayesian but are also not likelihood-based because of the intractability of the likelihood function. The technical issues that arise in numerical maximization of the likelihood function, on the one hand, and the use of simulation methods in comput- ing posterior moments, on the other, are distinct. It turns out, in these cases as well as in many other econometric models, that the Bayesian integration problem is easier to solve than is the non-Bayesian optimization problem. We provide some of the details in Sections 5.2 and 5.3 below. The state of the art in inference and computation is an important determinant of which models have practical application and which do not. The rapid progress in posterior sim- ulators since 1990 is an increasingly important influence in the conception and creation of new models. Some of these models would most likely never have been substantially developed, or even emerged, without these computational tools, reviewed in Section 3. An example is the stochastic volatility model, introduced in Section 2.1.2 and discussed in greater detail in Section 5.5 below. Another example is the state space model, often called the dynamic linear model in the statistics literature, which is described briefly in Section 4.2 and in more detail in Chapter 7 of this volume. The monograph by West and Harrison (1997) provides detailed development of the Bayesian formulation of this model, and that by Pole, West and Harrison (1994) is devoted to the practical aspects of Bayesian forecasting. These models all carry forward the theme so important in vector autoregressions: priors matter, and in particular priors that cope sensibly with an otherwise profligate pa- rameterization are demonstrably effective in improving forecasts. That was true in the earliest applications when computational tools were very limited, as illustrated in Sec- tion 4 for VARs, and here for autoregressive leading indicator models (Section 5.1). This fact has become even more striking as computational tools have become more sophisti- cated. The review of cointegration and error correction models (Section 5.4) constitutes 54 J. Geweke and C. Whiteman a case study in point. More generally models that are preferred, as indicated by Bayes factors, should lead to better decisions, as measured by ex post loss, for the reasons developed in Sections 2.3.2 and 2.4.1. This section closes with such a comparison for time-varying volatility models. 5.1. Autoregressive leading indicator models In a series of papers [Garcia-Ferer et al. (1987), Zellner and Hong (1989), Zellner, Hong and Gulati (1990), Zellner, Hong and Min (1991), Min and Zellner (1993)] Zellner and coauthors investigated the use of leading indicators, pooling, shrinkage, and time- varying parameters in forecasting real output for the major industrialized countries. In every case the variable modeled was the growth rate of real output; there was no pre- sumption that real output is cointegrated across countries. The work was carried out entirely analytically, using little beyond what was available in conventional software at the time, which limited attention almost exclusively to one-step-ahead forecasts. A prin- cipal goal of these investigations was to improve forecasts significantly using relatively simple models and pooling techniques. The observables model in all of these studies is of the form (68)yit = α0 + 3∑ s=1 αsyi,t−s + β ′zi,t−1 + εit , εit iid∼ N ( 0, σ 2 ) , with yit denoting the growth rate in real GNP or real GDP between year t −1 and year t in country i. The vector zi,t−1 comprises the leading indicators. In Garcia-Ferer et al. (1987) and Zellner and Hong (1989) zit consisted of real stock returns in country i in years t−1 and t , the growth rate in the real money supply between years t−1 and t , and world stock return defined as the median real stock return in year t over all countries in the sample. Attention was confined to nine OECD countries in Garcia-Ferer et al. (1987). In Zellner and Hong (1989) the list expanded to 18 countries but the original group was reported separately, as well, for purposes of comparison. The earliest study, Garcia-Ferer et al. (1987), considered five different forecasting procedures and several variants on the right-hand-side variables in (68). The period 1954–1973 was used exclusively for estimation, and one-step-ahead forecast errors were recorded for each of the years 1974 through 1981, with estimates being updated before each forecast was made. Results for root mean square forecast error, expressed in units of growth rate percentage, are given in Table 1. The model LI1 includes only the two stock returns in zit ; LI2 adds the world stock return and LI3 adds also the growth rate in the real money supply. The time varying parameter (TVP) model utilizes a conven- tional state-space representation in which the variance in the coefficient drift is σ 2/2. The pooled models constrain the coefficients in (68) to be the same for all countries. In the variant “Shrink 1” each country forecast is an equally-weighted average of the own country forecast and the average forecast for all nine countries; unequally-weighted Ch. 1: Bayesian Forecasting 55 Table 1 Summary of forecast RMSE for 9 countries in Garcia-Ferer et al. (1987) Estimation method (None) OLS TVP Pool Shrink 1 Growth rate = 0 3.09 Random walk growth rate 3.73 AR(3) 3.46 AR(3)-LI1 2.70 2.52 3.08 AR(3)-LI2 2.39 2.62 AR(3)-LI3 2.23 1.82 2.22 1.78 Table 2 Summary of forecast RMSE for 18 countries in Zellner and Hong (1989) Estimation method (None) OLS Pool Shrink 1 Shrink 2 Growth rate = 0 3.07 Random walk growth rate 3.02 Growth rate = Past average 3.09 AR(3) 3.00 AR(3)-LI3 2.62 2.14 2.32 2.13 averages (unreported here) produce somewhat higher root mean square error of fore- cast. The subsequent study by Zellner and Hong (1989) extended this work by adding nine countries, extending the forecasting exercise by three years, and considering an alterna- tive shrinkage procedure. In the alternative, the coefficient estimates are taken to be a weighted average of the least squares estimates for the country under consideration, and the pooled estimates using all the data. The study compared several weighting schemes, and found that a weight of one-sixth on the country estimates and five-sixths on the pooled estimates minimized the out-of-sample forecast root mean square error. These results are reported in the column “Shrink 2” in Table 2. Garcia-Ferer et al. (1987) and Zellner and Hong (1989) demonstrated the returns both to the incorporation of leading indicators and to various forms of pooling and shrinkage. Combined, these two methods produce root mean square errors of forecast somewhat smaller than those of considerably more complicated OECD official fore- casts [see Smyth (1983)], as described in Garcia-Ferer et al. (1987) and Zellner and Hong (1989). A subsequent investigation by Min and Zellner (1993) computed formal posterior odds ratios between the most competitive models. Consistent with the results described here, they found that odds rarely exceeded 2 : 1 and that there was no sys- tematic gain from combining forecasts. 58 J. Geweke and C. Whiteman unobservables. In Marriott et al. (1996) the augmented data are ε0 = (ε0, . . . , ε1−p)′ and u0 = (u0, . . . , u1−q)′. Then [see Marriott et al. (1996, pp. 245–246)] (71)p(ε1, . . . , εT | φ, θ , h, ε0,u0) = (2π)−T/2hT/2 exp [ −h T∑ t=1 (εt − μt)2/2 ] with (72)μt = p∑ s=1 φsεt−s − t−1∑ s=1 θs(εt−s − μt−s) − q∑ s=t θsεt−s . (The second summation is omitted if t = 1, and the third is omitted if t > q.) The data augmentation scheme is feasible because the conditional posterior density of u0 and ε0, (73)p(ε0,u0 | φ, θ , h,XT , yT ) is that of a Gaussian distribution and is easily computed [see Newbold (1974)]. The product of (73) with the density corresponding to (71)–(72) yields a Gaussian kernel for the presample ε0 and u0. A draw from this distribution becomes one step in a Gibbs sampling posterior simulation algorithm. The presence of (73) prevents the posterior conditional distribution of φ and θ from being Gaussian. This complication may be handled just as it was in the case of the AR(p) model, using a Metropolis within Gibbs step. There are a number of variants on these approaches. Chib and Greenberg (1994) show that the data augmentation vector can be reduced to max(p, q + 1) elements, with some increase in complexity. As an alternative to enforcing stationarity in the Metropolis within Gibbs step, the transformation of φ to the corresponding vector of partial auto- correlations [see Barndorff-Nielsen and Schou (1973)] may be inverted and the Jacobian computed [see Monahan (1984)], thus transforming Sp to a unit hypercube. A similar treatment can restrict the roots of 1 −∑qs=1 θszs to the exterior of the unit circle [see Marriott et al. (1996)]. There are no new essential complications introduced in extending any of these mod- els or posterior simulators from univariate (ARMA) to multivariate (VARMA) models. On the other hand, VARMA models lead to large numbers of parameters as the number of variables increases, just as in the case of VAR models. The BVAR (Bayesian Vector Autoregression) strategy of using shrinkage prior distributions appears not to have been applied in VARMA models. The approach has been, instead, to utilize exclusion restric- tions for many parameters, the same strategy used in non-Bayesian approaches. In a Bayesian set-up, however, uncertainty about exclusion restrictions can be incorporated in posterior and predictive distributions. Ravishanker and Ray (1997a) do exactly this, in extending the model and methodology of Marriott et al. (1996) to VARMA models. Corresponding to each autoregressive coefficient φijs there is a multiplicative Bernoulli random variable γijs , indicating whether that coefficient is excluded, and similarly for Ch. 1: Bayesian Forecasting 59 each moving average coefficient θijs there is a Bernoulli random variable δijs : yit = n∑ j=1 p∑ s=1 γijsφijsyj,t−s + n∑ j=1 q∑ s=1 θijsδijsεj,t−s + εit (i = 1, . . . , n). Prior probabilities on these random variables may be used to impose parsimony, both globally and also differentially at different lags and for different variables; independent Bernoulli prior distributions for the parameters γijs and δijs , embedded in a hierarchical prior with beta prior distributions for the probabilities, are the obvious alternatives to ad hoc non-Bayesian exclusion decisions, and are quite tractable. The conditional posterior distributions of the γijs and δijs are individually conditionally Bernoulli. This strategy is one of a family of similar approaches to exclusion restrictions in regression models [see George and McCulloch (1993) or Geweke (1996b)] and has also been employed in univariate ARMA models [see Barnett, Kohn and Sheather (1996)]. The posterior MCMC sampling algorithm for the parameters φijs and δijs also proceeds one parameter at a time; Ravishanker and Ray (1997a) report that this algorithm is computationally efficient in a three-variable VARMA model with p = 3, q = 1, applied to a data set with 75 quarterly observations. 5.3. Fractional integration Fractional integration, also known as long memory, first drew the attention of econo- mists because of the improved multi-step-ahead forecasts provided by even the simplest variants of these models as reported in Granger and Joyeux (1980) and Porter-Hudak (1982). In a fractionally integrated model (1 − L)dyt = ut , where (1 − L)d = ∞∑ j=0 ( d j ) (−L)j = ∞∑ j=1 (−1)j(d − 1) (j − 1)(d − j − 1)L j and ut is a stationary process whose autocovariance function decays geometrically. The fully parametric version of this model typically specifies (74)φ(L)(1 − L)d(yt − μ) = θ(L)εt , with φ(L) and θ(L) being polynomials of specified finite order and εt being serially uncorrelated; most of the literature takes εt iid ∼ N(0, σ 2). Sowell (1992a, 1992b) first de- rived the likelihood function and implemented a maximum likelihood estimator. Koop et al. (1997) provided the first Bayesian treatment, employing a flat prior distribution for the parameters in φ(L) and θ(L), subject to invertibility restrictions. This study used importance sampling of the posterior distribution, with the prior distribution as the source distribution. The weighting function w(θ) is then just the likelihood function, evaluated using Sowell’s computer code. The application in Koop et al. (1997) used quarterly US real GNP, 1947–1989, a standard data set for fractionally integrated mod- els, and polynomials in φ(L) and θ(L) up to order 3. This study did not provide any 60 J. Geweke and C. Whiteman evaluation of the efficiency of the prior density as the source distribution in the impor- tance sampling algorithm; in typical situations this will be poor if there are a half-dozen or more dimensions of integration. In any event, the computing times reported3 indicate that subsequent more sophisticated algorithms are also much faster. Much of the Bayesian treatment of fractionally integrated models originated with Ravishanker and coauthors, who applied these methods to forecasting. Pai and Ravi- shanker (1996) provided a thorough treatment of the univariate case based on a Metropolis random-walk algorithm. Their evaluation of the likelihood function differs from Sowell’s. From the autocovariance function r(s) corresponding to (74) given in Hosking (1981) the Levinson–Durbin algorithm provides the partial regression coeffi- cients φkj in (75)μt = E(yt | Yt−1) = t−1∑ j=1 φt−1j yt−j . The likelihood function then follows from (76)yt | Yt−1 ∼ N ( μt , ν 2 t ) , ν2t = [ r(0)/σ 2 ] t−1∏ j=1 [ 1 − (φjj )2]. Pai and Ravishanker (1996) computed the maximum likelihood estimate as discussed in Haslett and Raftery (1989). The observed Fisher information matrix is the variance matrix used in the Metropolis random-walk algorithm, after integrating μ and σ 2 ana- lytically from the posterior distribution. The study focused primarily on inference for the parameters; note that (75)–(76) provide the basis for sampling from the predictive distribution given the output of the posterior simulator. A multivariate extension of (74), without cointegration, may be expressed (L)D(L)(yt − μ) = (L)εt in which yt is n × 1, D(L) = diag[(1 − L)d1 , . . . , (1 − L)dn ], (L) and (L) are n × n matrix polynomials in L of specified order, and εt iid∼ N(0, ). Ravishanker and Ray (1997b, 2002) provided an exact Bayesian treatment and a forecasting application of this model. Their approach blends elements of Marriott et al. (1996) and Pai and Ravishanker (1996). It incorporates presample values of zt = yt − μ and the pure fractionally integrated process at = D(L)−1εt as latent variables. The autocovariance function Ra(s) of at is obtained recursively from ra(0)ij = σij (1 − di − dj ) (1 − di)(1 − dj ) , r a(s)ij = −1 − di − s s − dj r a(s − 1)ij . 3 Contrast Koop et al. (1997, footnote 12) with Pai and Ravishanker (1996, p. 74). Ch. 1: Bayesian Forecasting 63 Table 3 Comparison of forecast RMSE in Shoesmith (1995) Horizon 1 quarter 8 quarters 20 quarters VAR/I1 1.33 1.00 1.14 ECM 1.28 0.89 0.91 BVAR/I1 0.97 0.96 0.85 BECM 0.89 0.72 0.45 BVAR/I0 0.95 0.87 0.59 BECM/5Z 0.99 1.02 0.88 This experiment incorporated the Minnesota prior utilizing the mixed estimation methods described in Section 4.3, appropriate at the time to the investigation of the relative contributions of error correction and shrinkage in improving forecasts. More recent work has employed modern posterior simulators. A leading example is Villani (2001), which examined the inflation forecasting model of the central bank of Sweden. This model is expressed in error correction form (77)yt = μ+ αβ ′yt−1 + p∑ s=1 syt−s + εt , εt iid∼ N(0, ). It incorporates GDP, consumer prices and the three-month treasury rate, both Swedish and weighted averages of corresponding foreign series, as well as the trade-weighted exchange rate. Villani limits consideration to models in which β is 7 × 3, based on the bank’s experience. He specifies four candidate coefficient vectors: for example, one based on purchasing power parity and another based on a Fisherian interpretation of the nominal interest rate given a stationary real rate. This forms the basis for compet- ing models that utilize various combinations of these vectors in β, as well as unknown cointegrating vectors. In the most restrictive formulations three vectors are specified and in the least restrictive all three are unknown. Villani specifies conventional uninfor- mative priors for α, β and , and conventional Minnesota priors for the parameters s of the short-run dynamics. The posterior distribution is sampled using a Gibbs sampler blocked in μ, α, β, {s} and . The paper utilizes data from 1972:2 through 1993:3 for inference. Of all of the combinations of cointegrating vectors, Villani finds that the one in which all three are unrestricted is most favored. This is true using both likelihood ratio tests and an informal version (necessitated by the improper priors) of posterior odds ratios. This unrestricted specification (“β empirical” in the table below), as well as the most restricted one (“β specified”), are carried forward for the subsequent forecasting exercise. This exercise compares forecasts over the period 1994–1998, reporting forecast root mean square er- rors for the means of the predictive densities for price inflation (“Bayes ECM”). It also computes forecasts from the maximum likelihood estimates, treating these estimates as 64 J. Geweke and C. Whiteman Table 4 Comparison of forecast RMSE in Villani (2001) β specified empirical Bayes ECM 0.485 0.488 ML unrestricted ECM 0.773 0.694 ML restricted ECM 0.675 0.532 known coefficients (“ML unrestricted ECM”), and finds the forecast root mean square error. Finally, it constrains many of the coefficients to zero, using conventional step- wise deletion procedures in conjunction with maximum likelihood estimation, and again finds the forecast root mean square error. Taking averages of these root mean square er- rors over forecasting horizons of one to eight quarters ahead yields comparison given in Table 4. The Bayesian ECM produces by far the lowest root mean square error of forecast, and results are about the same whether the restricted or unrestricted version of the cointegrating vectors are used. The forecasts based on restricted maximum likeli- hood estimates benefit from the additional restrictions imposed by stepwise deletion of coefficients, which is a crude from of shrinkage. In comparison with Shoesmith (1995), Villani (2001) has the further advantage of having used a full Monte Carlo simulation of the predictive density, whose mean is the Bayes estimate given a squared-error loss function. These findings are supported by other studies that have made similar comparisons. An earlier literature on regional forecasting, of which the seminal paper is Lesage (1990), contains results that are broadly consistent but not directly comparable because of the differences in variables and data. Amisano and Serati (1999) utilized a three-variable VAR for Italian GDP, consumption and investment. Their approach was closer to mixed estimation than to full Bayesian inference. They employed not only a conventional Min- nesota prior for the short-run dynamics, but also applied a shrinkage prior to the factor loading vector α in (77). This combination produced a smaller root mean square error, for forecasts from one to twenty quarters ahead, than either a traditional VAR with a Minnesota prior, or an ECM that shrinks the short-run dynamics but not α. 5.5. Stochastic volatility In classical linear processes, for example the vector autoregression (3), conditional means are time varying but conditional variances are not. By now it is well established that for many time series, including returns on financial assets, conditional variances in fact often vary greatly. Moreover, in the case of financial assets, conditional vari- ances are fundamental to portfolio allocation. The ARCH family of models provides conditional variances that are functions of past realizations, likelihood functions that are relatively easy to evaluate, and a systematic basis for forecasting and solving the Ch. 1: Bayesian Forecasting 65 allocation problem. Stochastic volatility models provide an alternative approach, first motivated by autocorrelated information flows [see Tauchen and Pitts (1983)] and as discrete approximations to diffusion processes utilized in the continuous time asset pric- ing literature [see Hull and White (1987)]. The canonical univariate model, introduced in Section 2.1.2, is yt = β exp(ht/2)εt , ht = φht−1 + σηηt , (78)h1 ∼ N [ 0, σ 2η / ( 1 − φ2)], (εt , ηt )′ iid∼ N(0, I2). Only the return yt is observable. In the stochastic volatility model there are two shocks per time period, whereas in the ARCH family there is only one. As a consequence the stochastic volatility model can more readily generate extreme realizations of yt . Such a realization will have an impact on the variance of future realizations if it arises because of an unusually large value of ηt , but not if it is due to large εt . Because ht is a latent process not driven by past realizations of yt , the likelihood function cannot be evaluated directly. Early applications like Taylor (1986) and Melino and Turnbull (1990) used method of moments rather than likelihood-based approaches. Jacquier, Polson and Rossi (1994) were among the first to point out that the formu- lation of (78) in terms of latent variables is, by contrast, very natural in a Bayesian formulation that exploits a MCMC posterior simulator. The key insight is that condi- tional on the sequence of latent volatilities {ht }, the likelihood function for (78) factors into a component for β and one for σ 2η and φ. Given an inverted gamma prior distribution for β2 the posterior distribution of β2 is also inverted gamma, and given an independent inverted gamma prior distribution for σ 2η and a truncated normal prior distribution for φ, the posterior distribution of (σ 2η , φ) is the one discussed at the start of Section 5.2. Thus, the key step is sampling from the posterior distribution of {ht } conditional on {yot } and the parameters (β, σ 2η , φ). Because {ht } is a first order Markov process, the conditional distribution of a single ht given {hs, s = t}, {yt } and (β, σ 2η , φ) depends only on ht−1, ht+1, yt and (β, σ 2η , φ). The log-kernel of this distribution is (79)− (ht − μt) 2 2σ 2η /(1 + φ2) − ht 2 − y 2 t exp(−ht ) 2β2 with μt = φ(ht+1 + ht−1) 1 + φ2 − σ 2η 2(1 + φ2) . Since the kernel is non-standard, a Metropolis-within-Gibbs step can be used for the draw of each ht . The candidate distribution in Jacquier, Polson and Rossi (1994) is in- verted gamma, with parameters chosen to match the first two moments of the candidate density and the kernel. There are many variants on this Metropolis-within-Gibbs step. Shephard and Pitt (1997) took a second-order Taylor series expansion of (79) about ht = μt , and then 68 J. Geweke and C. Whiteman Table 5 Realized utility for alternative hedging strategies White noise GARCH-t Stoch. vol. RW hedge Marginal likelihood −4305.9 −4043.4 −4028.5∑ Ut (γ = −10) −2.24 −0.01 3.10 3.35∑ Ut (γ = −2) 0.23 7.42 7.69 6.73∑ Ut (γ = 0) 5.66 7.40 9.60 7.56 stochastic volatility model, against the six models other than GARCH(1, 1)-t consid- ered, are all over 100.) Given the output of the posterior simulators, solving the optimal hedging problem is a simple and straightforward calculus problem, as described in Sec- tion 3.3.1. The performance of any sequence of hedging decisions {Ht } over the period T + 1, . . . , T + F can be evaluated by the ex post realized utility T+F∑ t=T+1 Ut = γ−1 T+F∑ t=T+1 [ (1 − Ht)St+1 + HtFt ] /St . The article undertook this exercise for all of the models considered as well as some benchmark ad hoc decision rules. In addition to the GARCH(1, 1)-t and stochastic volatility models, the exercise included a benchmark model in which the exchange re- turn st is Gaussian white noise. The best-performing ad hoc decision rule is the random walk strategy, which sets the hedge ratio to one (zero) if the foreign currency depreci- ated (appreciated) in the previous period. The comparisons are given in Table 5. The stochastic volatility model leads to higher realized utility than does the GARCH- t model in all cases, and it outperforms the random walk hedge model except for the most risk-averse utility function. Hedging strategies based on the white noise model are always inferior. Model combination would place almost all weight on the stochastic volatility model, given the Bayes factors, and so the decision based on model combina- tion, discussed in Sections 2.4.3 and 3.3.2, leads to the best outcome. 6. Practical experience with Bayesian forecasts This section describes two long-term experiences with Bayesian forecasting: The Fed- eral Reserve Bank of Minneapolis national forecasting project, and The Iowa Economic Forecast produced by The University of Iowa Institute for Economic Research. This is certainly not an exhaustive treatment of the production usage of Bayesian forecasting methods; we describe these experiences because they are well documented [Litterman (1986), McNees (1986), Whiteman (1996)] and because we have personal knowledge of each. Ch. 1: Bayesian Forecasting 69 6.1. National BVAR forecasts: The Federal Reserve Bank of Minneapolis Litterman’s thesis work at the University of Minnesota (“the U”) was coincident with his employment as a research assistant in the Research Department at the Federal Reserve Bank of Minneapolis (the “Bank”). In 1978 and 1979, he wrote a computer program, “Predict” to carry out the calculations described in Section 4. At the same time, Thomas Doan, also a graduate student at the U and likewise a research assistant at the Bank, was writing code to carry out regression, ARIMA, and other calculations for staff econo- mists. Thomas Turner, a staff economist at the Bank, had modified a program written by Christopher Sims, “Spectre”, to incorporate regression calculations using complex arithmetic to facilitate frequency-domain treatment of serial correlation. By the sum- mer of 1979, Doan had collected his own routines in a flexible shell and incorporated the features of Spectre and Predict (in most cases completely recoding their routines) to pro- duce the program RATS (for “Regression Analysis of Time Series”). Indeed, Litterman (1979) indicates that some of the calculations for his paper were carried out in RATS. The program subsequently became a successful Doan-Litterman commercial venture, and did much to facilitate the adoption of BVAR methods throughout academics and business. It was in fact Litterman himself who was responsible for the Bank’s focus on BVAR forecasts. He had left Minnesota in 1979 to take a position as Assistant Professor of Economics at M.I.T., but was hired back to the Bank two years later. Based on work carried out while a graduate student and subsequently at M.I.T., in 1980 Litterman began issuing monthly forecasts using a six-variable BVAR of the type described in Section 4. The six variables were: real GNP, the GNP price deflator, real business fixed investment, the 3-month Treasury bill rate, the unemployment rate, and the money supply (M1). Upon his return to the Bank, the BVAR for these variables [described in Litterman (1986)] became known as the “Minneapolis Fed model”. In his description of five years of monthly experience forecasting with the BVAR model, Litterman (1986) notes that unlike his competition at the time – large, expensive commercial forecasts produced by the likes of Data Resources Inc. (DRI), Wharton Econometric Forecasting Associates (WEFA), and Chase – his forecasts were produced mechanically, without judgemental adjustment. The BVAR often produced forecasts very different from the commercial predictions, and Litterman notes that they were sometimes regarded by recipients (Litterman’s mailing list of academics, which in- cluded both of us) as too “volatile” or “wild”. Still, his procedure produced real time forecasts that were “at least competitive with the best forecasts commercially avail- able” [Litterman (1986, p. 35)]. McNees’s (1986) independent assessment, which also involved comparisons with an even broader collection of competitors was that Litter- man’s BVAR was “generally the most accurate or among the most accurate” for real GNP, the unemployment rate, and investment. The BVAR price forecasts, on the other hand, were among the least accurate. Subsequent study by Litterman resulted in the addition of an exchange rate measure and stock prices that improved, at least experimentally, the performance of the model’s 70 J. Geweke and C. Whiteman price predictions. Other models were developed as well; Litterman (1984) describes a 46-variable monthly national forecasting model, while Amirizadeh and Todd (1984) describe a five-state model of the 9th Federal Reserve District (that of the Minneapolis Fed) involving 3 or 4 equations per state. Moreover, the models were used regularly in Bank discussions, and reports based on them appeared regularly in the Minneapolis Fed Quarterly Review [e.g., Litterman (1984), Litterman (1985)]. In 1986, Litterman left the Bank to go to Goldman–Sachs. This required dissolution of the Doan–Litterman joint venture, and Doan subsequently formed Estima, Inc. to further develop and market RATS. It also meant that forecast production fell to staff economists whose research interests were not necessarily focused on the further devel- opment of BVARs [e.g., Roberds and Todd (1987), Runkle (1988), Miller and Runkle (1989), Runkle (1989, 1990, 1991)]. This, together with the pain associated with ex- plaining the inevitable forecast errors, caused enthusiasm for the BVAR effort at the Bank to wane over the ensuing half dozen years, and the last Quarterly Review “out- look” article based on a BVAR forecast appeared in 1992 [Runkle (1992)]. By the spring of 1993, the Bank’s BVAR efforts were being overseen by a research assistant (albeit a quite capable one), and the authors of this paper were consulted by the leadership of the Bank’s Research Department regarding what steps were required to ensure academic currency and reliability of the forecasting effort. The cost – our advice was to employ a staff economist whose research would be complementary to the production of forecasts – was regarded as too high given the configuration of economists in the department, and development of the forecasting model and procedures at the Bank effectively ceased. Cutting-edge development of Bayesian forecasting models reappeared relatively soon within the Federal Reserve System. In 1995, Tao Zha, who had written a Minnesota the- sis under the direction of Chris Sims, moved from the University of Saskatchewan to the Federal Reserve Bank of Atlanta, and began implementing the developments de- scribed in Sims and Zha (1998, 1999) to produce regular forecasts for internal briefing purposes. These efforts, which utilize the over-identified procedures described in Sec- tion 4.4, are described in Robertson and Tallman (1999a, 1999b) and Zha (1998), but there is no continuous public record of forecasts comparable to Litterman’s “Five Years of Experience”. 6.2. Regional BVAR forecasts: economic conditions in Iowa In 1990, Whiteman became Director of the Institute for Economic Research at the Uni- versity of Iowa. Previously, the Institute had published forecasts of general economic conditions and had produced tax revenue forecasts for internal use of the state’s De- partment of Management by judgmentally adjusting the product of a large commercial forecaster. These forecasts had not been especially accurate and were costing the state tens of thousands of dollars each year. As a consequence, an “Iowa Economic Forecast” model was constructed based on BVAR technology, and forecasts using it have been issued continuously each quarter since March 1990.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved