Subsampling under distributional constraints

Abstract

Some complex models are frequently employed to describe physical and mechanical phenomena. In this setting, we have an input X$$ X $$ in a general space, and an output Y=f(X)$$ Y=f(X) $$ where f$$ f $$ is a very complicated function, whose computational cost for every new input is very high, and may be also very expensive. We are given two sets of observations of X$$ X $$, S1$$ {S}_1 $$ and S2$$ {S}_2 $$ of different sizes such that only fS1$$ f\left({S}_1\right) $$ is available. We tackle the problem of selecting a subset S3⊂S2$$ {S}_3\subset {S}_2 $$ of smaller size on which to run the complex model f$$ f $$, and such that the empirical distribution of fS3$$ f\left({S}_3\right) $$ is close to that of fS1$$ f\left({S}_1\right) $$. We suggest three algorithms to solve this problem and show their efficiency using simulated datasets and the Airfoil self-noise data set.