Wildfire smoke is one of the most significant concerns of human and environmental health, associated with its substantial impacts on air quality, weather, and climate. However, biomass burning emissions and smoke remain among the largest sources of uncertainties in air quality forecasts. In this study, we evaluate the smoke emissions and plume forecasts from 12 state-of-the-art air quality forecasting systems during the Williams Flats fire in Washington State, US, August 2019, which was intensively observed during the Fire Influence on Regional to Global Environments and Air Quality (FIREX-AQ) field campaign. Model forecasts with lead times within 1 d are intercompared under the same framework based on observations from multiple platforms to reveal their performance regarding fire emissions, aerosol optical depth (AOD), surface PM.sub.2.5, plume injection, and surface PM.sub.2.5 to AOD ratio. The comparison of smoke organic carbon (OC) emissions suggests a large range of daily totals among the models, with a factor of 20 to 50. Limited representations of the diurnal patterns and day-to-day variations of emissions highlight the need to incorporate new methodologies to predict the temporal evolution and reduce uncertainty of smoke emission estimates. The evaluation of smoke AOD (sAOD) forecasts suggests overall underpredictions in both the magnitude and smoke plume area for nearly all models, although the high-resolution models have a better representation of the fine-scale structures of smoke plumes. The models driven by fire radiative power (FRP)-based fire emissions or assimilating satellite AOD data generally outperform the others. Additionally, limitations of the persistence assumption used when predicting smoke emissions are revealed by substantial underpredictions of sAOD on 8 August 2019, mainly over the transported smoke plumes, owing to the underestimated emissions on 7 August. In contrast, the surface smoke PM.sub.2.5 (sPM.sub.2.5) forecasts show both positive and negative overall biases for these models, with most members presenting more considerable diurnal variations of sPM.sub.2.5 . Overpredictions of sPM.sub.2.5 are found for the models driven by FRP-based emissions during nighttime, suggesting the necessity to improve vertical emission allocation within and above the planetary boundary layer (PBL). Smoke injection heights are further evaluated using the NASA Langley Research Center's Differential Absorption High Spectral Resolution Lidar (DIAL-HSRL) data collected during the flight observations. As the fire became stronger over 3-8 August, the plume height became deeper, with a day-to-day range of about 2-9 km a.g.l. However, narrower ranges are found for all models, with a tendency of overpredicting the plume heights for the shallower injection transects and underpredicting for the days showing deeper injections. The misrepresented plume injection heights lead to inaccurate vertical plume allocations along the transects corresponding to transported smoke that is 1 d old. Discrepancies in model performance for surface PM.sub.2.5 and AOD are further suggested by the evaluation of their ratio, which cannot be compensated for by solely adjusting the smoke emissions but are more attributable to model representations of plume injections, besides other possible factors including the evolution of PBL depths and aerosol optical property assumptions. By consolidating multiple forecast systems, these results provide strategic insight on pathways to improve smoke forecasts.