Sampling is now a very important data management tool, to such an extent that an interface for database sampling is included in the latest SQL standard. In this article we reconsider in depth what at first may seem like a very simple problem—computing the error of a sampling-based guess for the answer to a GROUP BY query over a multitable join. The difficulty when sampling for the answer to such a query is that the same sample will be used to guess the result of the query for each group, which induces correlations among the estimates. Thus, from a statistical point-of-view it is very problematic and even dangerous to use traditional methods such as confidence intervals for communicating estimate accuracy to the user. We explore ways to address this problem, and pay particular attention to the computational aspects of computing “safe” confidence intervals.
Friday, August 1, 2008
ACM Transactions on Database Systems (TODS) Volume 33, issue 3, page 16