Traditional (Non-Deep Learning) Data Preparation
My goal is to solve real-world data preparation problems. Hence, a lot of my time has been involved in understanding what are data errors from different real-world scenarios, through interacting with several organizations, such as MIT VPF, Merck, TAMR, Informatica, Intel, and many others.
Error DetectionWe have collected many real-world data errors "Detecting Data Errors: Where are we and what needs to be done?" [PVLDB’16], and devised methods to detect different types of data errors,
- Raha: A Configuration-Free Error Detection System [SIGMOD’19] [code]
- Discovering Mis-Categorized Entities [ICDE’18] [a Chrome add-on for cleaning your Google Scholar entries]
- FAHES: A Robust Disguised Missing Values Detector [KDD’18] [code]
- UGuide – User-Guided Discovery of FD-Detectable Errors [SIGMOD’17]
Trusted Data RepairingIn practice, users are hesitant to see their data being automatically repaired, unless these repairs are ensured to be correct or explainable. My works on trusted data repairing include:
- Rule-based methods:
- Using master data:
- Using knowledge bases:
- Interactive and Deterministic Data Cleaning: A Tossed Stone Raises a Thousand Ripples [SIGMOD'16]
A Commodity Data Cleaning SystemThere was no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. I led the project NADEEF, which provides a unified programing interface for declaratively specifying what are data errors and (possibly) how to fix them, and a core that holistically handles the detection and repairing of data errors.
- NADEEF: A Commodity Data Cleaning System [SIGMOD'13] [code]
- BigDansing: A System for Big Data Cleansing [SIGMOD'15]
Data Preparation as a ServiceCollaborated with MIT, we are building Data Civilizer [CIDR'17] [MIT News] with a suite of prebuilt tools for solving end-to-end data preparation problems. During interacting with real-world scenarios, we have also developed the following new components.
- Data discovery: we have solved the problem of linking the datasets in a data lake for data discovery, driven by real-world scenarios from Merck and Scotiabank.
- Seeping Semantics: Linking Datasets using Word Embeddings for Data Discovery [ICDE'18]
- Interpretable entity resolution: we propose to use program synthesis by examples to discover rules on entity matching and entity consolidation, driven by TAMR use cases that their customers require their entity resolution solutions to be interpretable.
- Relational table storage and query co-optimization: we develop deductive program synthesis algorithms for co-optimizing data storage and query plan, to improve the efficiency of pre-defined workflows with relatively static data and parameterized queries.
- Data debugging: we study the problem of debugging data (not code) problems in data science pipelines, by working with Massachusetts General Hospital (MGH), Intel MIT lab, and All Chicago (an organization that helps homeless people in Chicago).
- Dagger: A Data (not code) Debugger [CIDR'20]
Collaborative Data PreparationIn data prep, oftentimes, human-in-the-loop is not enough. We need crowd-in-the-loop to collaboratively clean and annotate data. Collaborated with University of Wisconsin-Madison, we are building such systems.
- CoClean: We built an Overleaf-like platform, on top of Python Pandas DataFrame, that enables multiple users to collaboratively clean the same dataset, which handles user synchronization and annotation aggregation, and allows customization.