A judiciously normalized database schema can increase data interpretability, reduce data size, and improve data integrity. However, real world data sets are often stored or shared in a denormalized state. We examine the problem of automatically creating a good schema for a denormalized table, approaching it as an unsupervised machine learning problem which must learn an optimal schema from the data. This differs from past rule-based approaches that focus on normalization into a canonical form. We define a principled schema optimization criterion, based on Occam's razor, that is robust to noise and extensible---allowing users to easily specify desirable properties of the resulting schema. We develop an efficient learning algorithm for this criterion and empirically demonstrate that it produces higher quality schemas with 1/5th the errors and is 3 to 100 times faster than previous work.
Wednesday, September 15, 2021