An Overview of Data Preparation

When business men brag about good insight obtained from big data, using machine learning, data mining, or data visualization techniques, people pay less attention about the costly and painful process to prepare the data that is used for these data analytics. In fact, data scientists routinely report that the majority (at least 80%) of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand.

Generally speaking, data preparation consists of four (iterative) steps:

  • Data Discovery is to find all relevant data sources and to decide which one to use.
  • Data Stitching is to stitch (e.g., join or union) different information together.
  • Data Integration is to put all relevant datasets together and remove duplicated information.
  • Data Cleaning is to clean the (integrated) data, such as string normalization, missing value imputation, and so on.

Please check my publication page about my work for data preparation.