Traditional (Non-Deep Learning) Data Preparation

My goal is to solve real-world data preparation problems. Hence, a lot of my time has been involved in understanding what are data errors from different real-world scenarios, through interacting with several organizations, such as MIT VPF, Merck, TAMR, Informatica, Intel, and many others.

Error Detection
We have collected many real-world data errors "Detecting Data Errors: Where are we and what needs to be done?" [PVLDB’16], and devised methods to detect different types of data errors,

Trusted Data Repairing
In practice, users are hesitant to see their data being automatically repaired, unless these repairs are ensured to be correct or explainable. My works on trusted data repairing include:

A Commodity Data Cleaning System
There was no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. I led the project NADEEF, which provides a unified programing interface for declaratively specifying what are data errors and (possibly) how to fix them, and a core that holistically handles the detection and repairing of data errors.

Data Preparation as a Service
Collaborated with MIT, we are building Data Civilizer [CIDR'17] [MIT News] with a suite of prebuilt tools for solving end-to-end data preparation problems. During interacting with real-world scenarios, we have also developed the following new components.

Collaborative Data Preparation
In data prep, oftentimes, human-in-the-loop is not enough. We need crowd-in-the-loop to collaboratively clean and annotate data. Collaborated with University of Wisconsin-Madison, we are building such systems.