Nan Tang

汤南

Qatar Computing Research Institute
Email: ntang@hbku.edu.qa

About Me

Short Bio

I am a scientist at Qatar Computing Research Institute (2011-), a research fellow at University of Edinburgh (2010--2011), a scientific staff member with the CWI (Dutch National Research Center for Mathematics and Computer Science) (2008--2010), and I got my PhD degree from The Chinese University of Hong Kong (2007).

Research Directions

Data Preparation From Big Data to Good Data

An oft-cited statistic is that data scientists spend at least 80% of their time on data preparation, including discovering data sets from a large data repository such as data warehouses and data lakes, integrating the discovered data sets from multiple sources into a single and unified data set, enriching a data set with other data sets, cleaning the data set by correcting erroneous values or imputing missing values, and transforming the data into a uniform representation.

Data-centric AI From Absolute Good Data to Relative Good Data

AI = Data + Models. Along with the maturity of machine learning and deep learning models, for most practitioners, models have become commodities. The quality of the models depends heavily on the data they learn from. Traditional data preparation tasks mainly target at preparing absolute good data, with regards to the "ground truth". However, absolute good data is not enough for AI; for example, the train data and test data have different distributions, or the train data is good but not enough to train a robust model. For AI, what is more desired is relative good data, with regards to the test data.

Self-automatic Data Visualization From Good Data to Good Insights

The overwhelming choices of interactive data visualization tools (e.g., Tableau and D3) only allow experts to create good visualizations. Non-experts have poor choices for visualization recommendation systems. My goal is to allow anyone to create easily good visualizations, so as to get good insights from good data.

News

ACM SIGMOD International Conference on Management of Data (SIGMOD 2023)
(3 research papers, 1 tutorial; acceptance Rate: 100%)

Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration
(Accept, Accept, Accept = Direct Accept without Revision!!!)

HybridPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Learned Data-aware Image Representations of Line Charts for Similarity Search

Demystifying Artificial Intelligence for Data Preparation (Tutorial)

International Conference on Very Large Data Bases (VLDB 2023)
(2 research papers out of 2 submissions; acceptance Rate: 100%)

Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning

Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks

Projects

Data Preparation - From Big Data to Good Data

Foundations Models

Demystifying foundation models for data preparation

Table Learning via Language Models

Learning table/tuple representations for data preparation

Generic Data matching

Cross-modality data matching over tuple/text/graph/*

Data-centric AI - From Absolute Good Data to Relative Good Data

Data acquisition

Select good data points from existing data repositories

Data augmentation

Synthesize train data for robust and generalizable ML

Feature augment

Enrich train data with more features via table joins

Coreset selection

Train ML models more efficiently by selecting a subset of train data

Data debugging

Detect and fix erroneous features in the train data, with human-in-the-loop

Label debugging

Detect and fix erroneous labels in the train data, with human-in-the-loop

Self-automatic Data Visualization: From Good Data to Good Insights

Recommendation

Automatically compute top-k ranked visualizations

NL2VIS

Translate natural language queries into data visualizations

VIS2Vec

Encode visualizations into vector representations

DBLP Google Scholar

Publications (2019-)

AI for Data Preparation

Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes
CIDR 2023 The Conference on Innovative Data Systems Research
Zui Chen, Zihui Gu, Lei Cao, Ju Fan, Samuel Madden, Nan Tang

Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration
SIGMOD 2023 ACM SIGMOD Conference on Management of Data
Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Guoliang Li, Xiaoyong Du, Jia Xiaofeng, Song Gao

Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks [pdf]
VLDB 2023 The 49th International Conference on Very Large Data Bases
Jinfeng Peng, Derong Shen, Nan Tang, Tieying Liu, Yue Kou, Tiezheng Nie, Hang Cui, Ge Yu

PASTA: Table-Operations Aware Fact Verification via Sentence-Table Cloze Pre-training [pdf]
EMNLP 2022 The 2022 Conference on Empirical Methods in Natural Language Processing
Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, and Xiaoyong Du

Domain Adaptation for Deep Entity Resolution [pdf]
SIGMOD 2022 ACM SIGMOD Conference on Management of Data
Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, Xiaoyong Du

DADER: Hands-Off Entity Resolution with Domain Adaptation [pdf]
VLDB 2022 (demo) The 48th International Conference on Very Large Data Bases
Jianhong Tu, Xiaoyue Han, Ju Fan, Nan Tang, Chengliang Chai, Guoliang Li, Xiaoyong Du

Synthesizing Privacy Preserving Entity Resolution Datasets [pdf]
ICDE 2021 The 38th IEEE International Conference on Data Engineering
Xuedi Qin, Chengliang Chai, Nan Tang, Jian Li, Yuyu Luo, Guoliang Li, Yaoyu Zhu

RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation [pdf]
VLDB 2021 The 47th International Conference on Very Large Data Bases
Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam Madden, Mourad Ouzzani

Deep Learning for Blocking in Entity Matching: A Design Space Exploration [pdf]
VLDB 2021 The 47th International Conference on Very Large Data Bases
Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, AnHai Doan

Data Curation with Deep Learning [Vision] [pdf]
EDBT 2020 The 23rd International Conference on Extending Database
Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, AnHai Doan

Raha: A Configuration-Free Error Detection System [pdf]
SIGMOD 2019 ACM SIGMOD Conference on Management of Data
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

Explaining Entity Resolution Predictions : Where are we and What needs to be done? [pdf]
HILDA 2019Workshop on Human-In-the-Loop Data Analytics (Co-located with SIGMOD)
Saravanan Thirumuruganathan, Mourad Ouzzani, Nan Tang

Unsupervised String Transformation Learning for Entity Consolidation [pdf]
ICDE 2019 The 35th IEEE International Conference on Data Engineering
Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Ihab F. Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

Data Preparation Theory and Systems

Mis-Categorized Entities Detection [pdf]
VLDBJ 2021 The VLDB Journal
Shuang Hao, Nan Tang, Guoliang Li, Jianhua Feng, Ning Wang

Pattern Functional Dependencies for Data Cleaning [pdf]
VLDB 2020 The 46th International Conference on Very Large Data Bases
Abdulhakim Qahtan, Nan Tang, Mourad Ouzzani, Yang Cao, Michael Stonebraker

CoClean: Collaborative Data Cleaning [pdf] [code]
SIGMOD 2020 (demo) ACM SIGMOD Conference on Management of Data
Mashaal Musleh, Mourad Ouzzani, Nan Tang, AnHai Doan

Outdated Fact Detection in Knowledge Bases
ICDE 2020 (short paper) The 36th IEEE International Conference on Data Engineering
Shuang Hao, Chengliang Chai, Guoliang Li, Nan Tang, Ning Wang, Xiang Yu

Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics [pdf]
VLDB 2019 (demo) The 45th International Conference on Very Large Data Bases
El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonin, Wenbo Tao, Samuel Madden, Mourad Ouzzani, Nan Tang, Ahmed K. Elmagarmid

ANMAT: Automatic Knowledge Discovery and Error Detection through Pattern Functional Dependencies [pdf]
SIGMOD 2019 (demo) ACM SIGMOD Conference on Management of Data
Abdulhakim Qahtan, Nan Tang, Mourad Ouzzani, Yang Cao, Michael Stonebraker

Data Civilizer: End-to-End Support for Data Discovery, Integration, And Cleaning [Book Chapter] [link]
Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker
Mourad Ouzzani, Nan Tang, Raul Castro Fernandez

Data-centric AI

HybridPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
SIGMOD 2023 ACM SIGMOD Conference on Management of Data
Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, Xiaoyong Du

Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning [pdf]
VLDB 2023 The 49th International Conference on Very Large Data Bases
Jiayi Wang, Changliang Chai, Nan Tang, Jiabin Liu, Guoliang Li

Selective Data Acquisition in the Wild for Model Charging [pdf]
VLDB 2022 The 48th International Conference on Very Large Data Bases
Changliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, Yuyu Luo

Feature Augmentation with Reinforcement Learning [pdf]
ICDE 2021 (industrial track) The 38th IEEE International Conference on Data Engineering
Jiabin Liu, Chengliang Chai, Yuyu Luo, Jianhua Feng, Lou Yin, Nan Tang

Adaptive Data Augmentation for Supervised Learning over Missing Data [pdf] [code]
VLDB 2021 The 47th International Conference on Very Large Data Bases
Tongyu Liu, Ju Fan, Yinqing Luo, Nan Tang, Guoliang Li, Xiaoyong Du

Automatic Data Acquisition for Deep Learning [pdf]
VLDB 2021 (demo) The 47th International Conference on Very Large Data Bases
Jianbin Liu, Fu Zhu, Chengliang Chai, Yuyu Luo, Nan Tang

Debugging Large-Scale Data Science Pipelines using Dagger [pdf]
VLDB 2020 (demo) The 46th International Conference on Very Large Data Bases
El Kindi Rezig, Ashrita Brahmaroutu, Nesime Tatbul, Mourad Ouzzani, Nan Tang, Timothy Mattson, Samuel Madden, Michael Stonebraker

Dagger: A Data (not code) Debugger [pdf]
CIDR 2020 The Conference on Innovative Data Systems Research
El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Mourad Ouzzani, Nan Tang, Michael Stonebraker

Self-automatic Data Visualization

Learned Data-aware Image Representations of Line Charts for Similarity Search
SIGMOD 2023 ACM SIGMOD Conference on Management of Data
Yuyu Luo, Yihui Zhou, Nan Tang, Guoliang Li, Chengliang Chai, Leixian Shen

Natural Language to Visualization by Neural Machine Translation [pdf]
VIS 2021 The IEEE Visualization Conference
Yuyu Luo, Nan Tang, Guoliang Li, Jiawei Tang, Chengliang Chai, Xuedi Qin

Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks [pdf][link]
SIGMOD 2021 ACM SIGMOD Conference on Management of Data
Yuyu Luo, Nan Tang, Guoliang Li, Chengliang Chai, Wenbo Li, Xuedi Qin

DEEPEYE: A Data Science System for Monitoring and Exploring COVID-19 Data [pdf] [link]
IEEE Data Engineering Bulletin, 2020. (invited)
Yuyu Luo, Nan Tang, Guoliang Li, Tianyu Zhao, Wenbo Li, Xiang Yu

Real-Time Mobility Analysis Through Google Maps [pdf] [link]
Fifth International Data for Policy Conference, 2020
Ingmar Weber, Nan Tang, Soon-Gyo Jung, Rade Stanojevic, Noora Al-Emadi, Ji Lucas

DeepTrack: Monitoring and Exploring Spatio-Temporal Data (A Case of Tracking COVID-19) [pdf] [link]
VLDB 2020 (demo) The 46th International Conference on Very Large Data Bases
Yuyu Luo, Wenbo Li, Tianyu Zhao, Xiang Yu, Lixi Zhang, Guoliang Li, Nan Tang

VisClean: Interactive Cleaning for Progressive Visualization [Video]
VLDB 2020 (demo) The 46th International Conference on Very Large Data Bases
Yuyu Luo, Chengliang Chai, Xuedi Qin, Nan Tang, Guoliang Li

Interactive Cleaning for Progressive Visualization through Composite Questions [pdf] [Video]
ICDE 2020 The 36th IEEE International Conference on Data Engineering
Yuyu Luo, Chengliang Chai, Xuedi Qin, Nan Tang, Guoliang Li

Making Data Visualization More Efficient and Effective: A Survey [pdf]
VLDBJ 2020 The VLDB Journal
Xuedi Qin, Yuyu Luo, Nan Tang, Guoliang Li

Steerable Self-driving Data Visualization [pdf]
TKDE 2020 IEEE Transaction on Knowledge and Data Engineering
Yuyu Luo, Xuedi Qin, Chengliang Chai, Nan Tang, Guoliang Li, Wenbo Li

Miscellaneous

Road-aware Indexing for Trajectory Range Queries
TKDE 2023 IEEE Transaction on Knowledge and Data Engineering
Yong Wang, Kaiyu Li, Guoliang Li, Nan Tang

Learned Cardinality Estimation: A Design Space Exploration and a Comparative Evaluation [pdf]
VLDB 2022 The 48th International Conference on Very Large Data Bases
Ji Sun, Jintao Zhang, Zhaoyan Sun, Guoliang Li, Nan Tang

Interactively Discovering and Ranking Desired Tuples by Data Exploration [pdf]
VLDBJ 2022 The VLDB Journal
Xuedi Qin, Chengliang Chai, Yuyu Luo, Tianyu Zhao, Nan Tang, Guoliang Li, Jianhua Feng, Xiang Yu, Mourad Ouzzani

Learned Cardinality Estimation for Similarity Queries [pdf]
SIGMOD 2021 ACM SIGMOD Conference on Management of Data
Ji Sun, Guoliang Li, Nan Tang

Ranking Desired Tuples by Database Exploration [pdf]
ICDE 2021 (short paper) The 37th IEEE International Conference on Data Engineering
Xuedi Qin, Chengliang Chai, Yuyu Luo, Tianyu Zhao, Nan Tang, Guoliang Li, Jianhua Feng, Xiang Yu, Mourad Ouzzani

Deductive Optimization of Relational Data Storage [pdf] [code]
OOPSLA 2020 Object-Oriented Programming, Systems, Languages, and Applications
John K. Feser, Samuel Madden, Nan Tang, Armando Solar-Lezama

Interactively Discovering and Ranking Desired Tuples without Writing SQL Queries [pdf] [video]
SIGMOD 2020 (demo) ACM SIGMOD Conference on Management of Data
Xuedi Qin, Chengliang Chai, Yuyu Luo, Nan Tang, Guoliang Li

Reinforcement Learning with Tree-LSTM for Join Order Selection [pdf]
ICDE 2020 The 36th IEEE International Conference on Data Engineering
Xiang Yu, Guoliang Li, Chengliang Chai, Nan Tang

Querying Shortest Paths on Time Dependent Road Networks [pdf]
VLDB 2019 The 45th International Conference on Very Large Data Bases
Yong Wang, Guoliang Li, Nan Tang

Efficient Algorithms for Approximate Single-Source Personalized PageRank Queries [pdf]
TODS 2019 ACM Transactions on Database Systems
Sibo Wang, Renchi Yang, Runhui Wang, Xiaokui Xiao, Zhewei Wei, Wenqing Lin, Yin Yang, Nan Tang

Tutorials

2019

Towards Democratizing Relational Data Visualization [pdf]
ACM SIGMOD Conference on Management of Data (SIGMOD Tutorial), Amsterdam, The Netherlands, 2019
Nan Tang, Eugene Wu, Guoliang Li
Presentations: Introduction, 30 mins [keynote]; Efficient data visualization, 1 hour [ppt]; Smart data visualization, 1 hour [keynote]; Uncertainty, collaborative and immersive data visualization, 30 mins [keynote]

Awards

2021

VLDB 2021 Distinguished Reviewer Award

2020

SIGMOD 2020 Reproducibility Award. Raha: A Configuration-Free Error Detection System

2018

2018 Best of ICDE. Discovering Mis-Categorized Entities

2015

2015 Best of VLDB. Lightning Fast and Space Efficient Inequality Joins

2012

2012 Best of ICDE. Incremental Detection of Inconsistencies in Distributed Data

2010

2010 VLDB The Best Paper. Towards Certain Fixes with Editing Rules and Master Data

2009

2009 Best of ICDE. Projective Distribution of Full-Fledged XQuery

Services (2019-)

2024

SIGMOD

2023

SIGMOD (research and demo), ICDE

2022

SIGMOD (research and demo), VLDB (research and demo), KDD, VIS, EuroVis

2021

SIGMOD (Exhibition Chair), VLDB, KDD, CHI, VIS

2020

SIGMOD, VLDB, KDD, DASFAA, ICDCS

2019

SIGMOD (research and demo), VLDB (research and demo), KDD, DASFAA (demo co-chair)

Talks

Data Visualization and Exploration of COVID-19 data. At QCRI lectures on the use of AI techniques for COVID-19, Qatar, April, 2020. [Gulf Times]

Data Preparation meets Data Visualization. At Northeastern University, 11/10/2019.

Mind Your Analytics, Clean Your Data. At Harvard University, 07/10/2016. [slides]

The Data Civilizer System. Co-presented with Mike Stonebraker at MIT, 06/10/2016.

Graph Stream Summarization. At MIT, 23/06/2016. [slides]

Big Data Cleaning. At APWeb 2014, Distinguished Lecture Series, 06/09/2014. [slides]

Get in Touch

Contact

ntang@hbku.edu.qa

B1-1124, HBKU RC, Qatar

+974 44542850