Self-automatic Data Visualization for Interpretable Data Science

My initial goal was to help users understand the tables discovered from a data lake beyond eyeballing, in particular for the Data Civilizer project. I started to explore (automatic) data visualization from 2018, in collaboration with Tsinghua University. However, the overwhelming choices of interactive data visualization tools (e.g., Tableau and D3) only allow experts to create good visualizations. Non-experts have poor choices for visualization recommendation systems, which allow anyone to create good visualizations automatically or simply like a Google search.

For a general introduction about data visualization for data preparation, please check out:

Visualization Recommendation
DeepEye is among the first ML-based visualization recommendation systems. It tackles two problems:
  1. Visualization recognition: whether a visualization for a given dataset is interesting, from an understanding of human perception; and
  2. Visualization ranking: given two visualizations, which one is “better”.
DeepEye tackles (1) by training binary classifiers (decision trees and SVM) to decide whether a particular visualization is good, and (2) by using a supervised learning-to-rank model to rank good visualizations.

The initial DeepEye paper is at ICDE'18. You can read more from our SIGMOD'18 demo paper and an online demo. The code for DeepEye-APIs is available. We have also adopted DeepEye for COVID-19 data analysis at here, as well as a paper at IEEE Data Bulletin'20.

Natural Language to Visualization
A common concern for visualization recommendation systems is that, they may recommend visualizations that could be worse than nothing by misleading users, simply because it is hard to guess a user’s query intent. We extended DeepEye to support Google-like keyword search at EDBT'18, such as “show me the trend of flight delays”, using NLP semantic parsers.

Apparently, the state-of-the-art natural language techniques are deep learning based, but a big obstacle for advancing the field of NL2VIS is the lack of benchmarks. We propose the first NL2VIS benchmark, called nvBench at SIGMOD'21. Based on this benchmark, we further propose a Transformer-based sequence-to-sequence model that translates natural language queries to targeted visualizations VIS'21.

COVID-19 Dashboards
I have worked on a few COVID-19 dashboards.