E= MC 3: managing uncertain enterprise data in a cluster-computing environment


Modern enterprises must manage uncertain data for purposes of risk assessment and decisionmaking under uncertainty. The Monte Carlo approach embodied in the MCDB system of Jampani et al. is well suited for such a task. MCDB can support industrial strength business-intelligence queries over uncertain warehouse data. Moreover, MCDB’s extensible approach to specifying uncertainty can also capture complex stochastic prediction models, allowing sophisticated “what-if” analyses within the DBMS. The MCDB computations can be highly CPU intensive, but offer the potential for massive parallelization. To realize this potential, we provide a new system, called MC3 (Monte Carlo Computation on a Cluster), that extends the MCDB approach to the map-reduce processing framework. MC3 can exploit the robustness and scalability of map-reduce, and can handle data stored in non-relational formats. We show how MCDB query plans over “tuple bundles” can be translated to sequences of map-reduce operations over nested data, and describe different parallelization schemes. We also provide
and analyze several novel distributed algorithms for adding pseudorandom number seeds to tuple bundles. These algorithms ensure statistical correctness of the Monte-Carlo computations while minimizing the seed length. Our experiments show that MC3 can scale well for a variety of workloads.

Fei Xu, Kevin Beyer, Vuk Ercegovac, Peter J Haas, Eugene J Shekita
Publication Date: 
Monday, June 29, 2009
Publication Information: 
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (June 29–July 2, 2009, Providence, Rhode Island))