A judiciously normalized database schema can increase data interpretability, reduce data size, and improve data integrity. However, real world data sets are often stored or shared in a denormalized state. We examine the problem of automatically creating a good schema for a denormalized table, approaching it as an unsupervised machine learning problem which must learn an optimal schema from the data. This diers from past rule-based approaches that focus on normalization into a canonical form. We dene a principled schema optimization criterion, based on Occam’s razor, that is robust to noise and extensible—allowing users to easily specify desirable properties of the resulting schema. We develop an efficient learning algorithm for this criterion and empirically demonstrate that it is 3 to 100 times faster than previous work and produces higher quality schemas with 1/5th the errors.
Sunday, June 12, 2022
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA