Deep Learning for Data PreparationIn this line of research, I want to explore the potential of deep learning for data prep. Hopefully we can see the revolution of data prep empowered by deep learning, similar to the fields (e.g., NLP and image processing) that have been revolutionized by deep learning.
RPT: Relational Pre-trained Transformer Is Almost All You Need for Democratizing Data PreparationMy goal is to automate human-easy but computer-hard data preparation tasks that currently heavily involve data scientists, practitioners, and crowd workers. I propose a GPT-like tool for relational data preparation, called RPT at PVLDB'21 (Relational Pre-trained Transformer). RPT is a denoising autoencoder for tuple-to-X models (“X” could be tuple, token, label, JSON, and so on) that supports a wide range of data preparation tasks such as data cleaning, auto-completion, schema matching, entity resolution, value normalization, data transformation, data annotation, information extraction, and so forth. RPT is pre-trained for a tuple-to-tuple model with fill-in-the-blank style denoising objectives, and can be fine-tuned for multiple data preparation tasks with different output formats (i.e., tuple-to-X).
Deep Learning for Entity ResolutionDespite the efforts in 70+ years in all aspects of entity resolution, there is still a high demand for democratizing entity resolution – by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions.
- Distributed representations of tuples for entity resolution (the DeepER PVLDB'18 paper and code): With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts).