Snigdha Das

Manuscripts  &  Publications

Deep generative conditional density regression for inference in complex surveys

Snigdha Das, Dipankar Bandyopadhyay, Debdeep Pati

In preparation, 2026+

Complex survey data arise from unequal probability sampling designs, where naive application of methods that rely on independent and identically distributed assumptions can lead to biased estimation and invalid inference. Moreover, many public health datasets arising from large-scale surveys require modeling the entire conditional distribution of outcomes rather than just conditional means. Modern deep generative models, particularly generative adversarial networks (GANs), provide powerful tools for flexible high-dimensional distributional modeling, but remain largely unexplored in survey applications. In this work, we propose a survey-adjusted GAN-based conditional density regression framework grounded on the generative representation of conditional distributions via noise outsourcing. To account for complex survey designs, we incorporate random weights centered at the observed sampling weights into the GAN objective through a bootstrap-based resampling scheme. The resulting procedure enables efficient conditional sampling while simultaneously addressing sampling bias and model uncertainty. Our approach allows for high-dimensional covariates and multivariate responses, accommodates nonlinear and multimodal conditional structures, and naturally produces Monte Carlo estimates of conditional functionals such as means, quantiles, and prediction intervals. We establish theoretical guarantees showing that, under regularity conditions, the weighted conditional generator converges to the target design-adjusted conditional distribution. Through simulations and applications to a motivating periodontal disease data from NHANES, we demonstrate improved distributional recovery and predictive uncertainty compared to existing conditional density estimators.

Scalable efficient inference in complex surveys through targeted resampling of weights

Snigdha Das, Dipankar Bandyopadhyay, Debdeep Pati

Submitted, 2025+

Survey data often arises from complex sampling designs, such as stratified or multistage sampling, with unequal inclusion probabilities. When sampling is informative, traditional inference methods yield biased estimators and poor coverage. Classical pseudo-likelihood based methods provide accurate asymptotic inference but lack finite-sample uncertainty quantification and the ability to integrate prior information. Existing Bayesian approaches, like the Bayesian pseudo-posterior estimator and weighted Bayesian bootstrap, have limitations; the former struggles with uncertainty quantification, while the latter is computationally intensive and sensitive to bootstrap replicates. To address these challenges, we propose the Survey-adjusted Weighted Likelihood Bootstrap (S-WLB), which resamples weights from a carefully chosen distribution centered around the underlying sampling weights. S-WLB is computationally efficient, theoretically consistent, and delivers finite-sample uncertainty intervals which are proven to be asymptotically valid. We demonstrate its performance through simulations and applications to nationally representative survey datasets like NHANES and NSDUH.
  • 2025 ASA SRMS/GSS/SSS Student Paper Award (Honorable Mention).
  • 2025 SETCASA Student Poster Award (Silver Prize).

A monotone single index model for spatially referenced multistate current status data

Snigdha Das, Minwoo Chae, Debdeep Pati, Dipankar Bandyopadhyay

Biometrics, 2025

Assessment of multistate disease progression is commonplace in biomedical research, such as, in periodontal disease (PD). However, the presence of multistate current status endpoints, where only a single snapshot of each subject’s progression through disease states is available at a random inspection time after a known starting state, complicates the inferential framework. In addition, these endpoints can be clustered, and spatially associated, where a group of proximally located teeth (within subjects) may experience similar PD status, compared to those distally located. Motivated by a clinical study recording PD progression, we propose a Bayesian semiparametric accelerated failure time model with an inverse-Wishart proposal for accommodating (spatial) random effects, and flexible errors that follow a Dirichlet process mixture of Gaussians. For clinical interpretability, the systematic component of the event times is modeled using a monotone single index model, with the (unknown) link function estimated via a novel integrated basis expansion and basis coefficients endowed with constrained Gaussian process priors. In addition to establishing parameter identifiability, we present scalable computing via a combination of elliptical slice sampling, fast circulant embedding techniques, and smoothing of hard constraints, leading to straightforward estimation of parameters, and state occupation and transition probabilities. Using synthetic data, we study the finite sample properties of our Bayesian estimates, and their performance under model misspecification. We also illustrate our method via application to the real clinical PD dataset.

Blocked Gibbs sampler for hierarchical Dirichlet processes

Snigdha Das, Yabo Niu, Yang Ni, Bani K. Mallick, Debdeep Pati

Journal of Computational and Graphical Statistics, 2024

Posterior computation in hierarchical Dirichlet process (HDP) mixture models is an active area of research in nonparametric Bayes inference of grouped data. Existing literature almost exclusively focuses on the Chinese restaurant franchise (CRF) analogy of the marginal distribution of the parameters, which can mix poorly and has a quadratic complexity with the sample size. A recently developed slice sampler allows for efficient blocked updates of the parameters, but is shown to be statistically unstable in our article. We develop a blocked Gibbs sampler that employs a truncated approximation of the underlying random measures to sample from the posterior distribution of HDP, which produces statistically stable results, is highly scalable with respect to sample size, and is shown to have good mixing. The heart of the construction is to endow the shared concentration parameter with an appropriately chosen gamma prior that allows us to break the dependence of the shared mixing proportions and permits independent updates of certain log-concave random variables in a block. En route, we develop an efficient rejection sampler for these random variables leveraging piece-wise tangent-line approximations.
  • 2022 Joe Newton Poster Award (Bronze Prize), Conference on Advances in Data Science, TAMU.