Spreadsheets are a critical and widely-used data management tool. Converting spreadsheet data into relational tables would bring benefits to a number of fields, including public policy, public health, and economics. Research to date has focused on designing domainspecific languages to describe transformation processes or automatically converting a specific type of spreadsheets. To handle a larger variety of spreadsheets, we have to identify various spreadsheet properties, which correspond to a series of transformation programs that contribute towards a general framework that converts spreadsheets to relational tables. In this paper, we focus on the problem of spreadsheet property detection. We propose a hybrid approach of building a variety of spreadsheet property detectors to reduce the amount of required human labeling effort. Our approach integrates an active learning framework with crude, easy-to-write, user-provided rules to save human labeling effort by generating additional high-quality labeled data especially in the initial training stage. Using a bagginglike technique, Our approach can also tolerate lower-quality userprovided rules. Our experiments show that when compared to a standard active learning approach, we reduced the training data needed to reach the performance plateau by 34–44% when a human provides relatively high-quality rules, and by a comparable amount with low-quality rules. A study on a large-scale web-crawled spreadsheet dataset demonstrates that it is crucial to detect a variety of spreadsheet properties in order to transform a large portion of the spreadsheets into a relational form.
Monday, November 6, 2017
CIKM’17 , November 6–10, 2017, Singapore, Singapore