Research: Towards Data Democratization

Data preparation – the process of turning big data into good data – is a crucial step of data science and machine learning. An oft-cited statistic is that data scientists spend at least 80% of their time on data preparation, including discovering data sets from a large data repository such as data warehouses and data lakes, integrating the discovered data sets from multiple sources into a single and unified data set, enriching a data set with other data sets, cleaning the data set by correcting erroneous values or imputing missing values, and transforming the data into a uniform representation, prior to downstream data science or machine learning (ML) tasks.

Good result = Good tools + Good data. On one hand, machine learning and data sciences tools are (almost) becoming a commodity. On the other hand, real-life data sets are becoming bigger and messier. Consequently, the main bottleneck for practitioners is how to get good data from bigger and messier data.

Towards data democratization. My research goal is data democratization, which allows anyone to access the good data required for any data science and machine learning task.

My studies on data preparation can be categorized into four main themes: