Main Components of Data Civilizer

Lorem ipsum dolor sit amet, delectus consequatur, similique quia!

Usability

Help users to formulate their query intent in a structured way.

Data Discovery and Stitching

Discover pieces of data that the user is interested and stitich them together.

Query-driven Data Cleaning

Clean the data the user is interested in a clean-as-you-go fashion.

Usability

In large organizations, it is common to have several hundred to thousand databases. For instance, Merck has over 4000 Oracle databases. They can be managed by a Polystore system or previously federated database systems. In such cases, a data scientist usually needs to write a structured query to stitch information from multiple data sources. The problem is: how to write a structured query when the user does not know what information is stored in the Polystore system? We are building a user-friendly interface to assist users with this task.

Data Discovery and Stitching

Data discovery is to find all relevant data sources and to decide which one to use. Data stitching is to stitch different information together and return to the user an integrated result.

Query-driven Data Cleaning

For most companies, it is almost impossible to clean the data they have, due to the high cost of the cleaning efforts, as well as the dynamics of their data. Hence, we study the problem of query-driven data cleaning, which is to minimize the cost of cleaning the data w.r.t. user given queries.