Suppose we have acquired a table X(id, name, loc) from a relational database management system and we have extracted a table Y(id, cname, address, rev) from a set of news articles.
The table X contains information about the names and locations of companies while the table Y contains, for each company, its address and its quarterly revenue of the company in billions of dollars.
We detect that revenue 351 of GE (the last row of Table Y) is an outlier. Upon closer inspection, we realize that it should have been 35.1 instead of 351, due to an extraction error. So we manually change this value to 35.1.
* Note that there are many other types of cleaning operations in general. We are currently missing a data cleaning component in BigGorilla.
Next, we match the schemas of tables X and Y. We obtain the matches name⬌cname and loc⬌address. Intuitively, this means that the attribute name of table X is the same as cname of table Y, and the attribute loc of table X is the same as address of table Y.
Based on the matchings name⬌cname and loc⬌address between tables X and Y, the data scientist may choose to merge the two schemas X(id, name, loc) and Y(id, cname, address, rev) into a single schema Z(name, loc, rev). Note that the id attribute is omitted in the merge process and this is a conscious decision of the data scientist.
The program that is used to transform data that resides in tables X and Y into table Z is called a schema mapping. Here, the schema mapping is developed based on understanding how tuples from X and Y should be migrated into Z. The program can be an SQL query that populates table Z based on tuples from tables X and Y. It uses table M to determine the matches and it uses the function merge_name(.) to apply the heuristic of selecting the longer string described earlier.
The example above is simplified to illustrate the key steps of data integration and data preparation. In practice, the steps may be carried out in a different order and some of the steps may even be repeated. For example, once table Z is obtained, the data may be cleaned again and matched and merged with some other datasets. Also, instead of discarding information about the last tuple in table Y, the data scientist may choose to keep the last tuple by performing an outer join in the SQL query.
In general, many of the steps such as schema matching/merging and data matching/merging are not as trivial as the example shows. Hence, the schema mapping and the process of arriving at the desired schema mapping are not as trivial as what is illustrated above. To make it easy for a data scientist to wrangle data, it is important to have semi-automatic tools to help the data scientists along each step.
The example above covers the “data wrangling” aspect of the data science pipeline. After data from different sources are integrated into a single database, a data scientist would like to perform analysis on the data through techniques such as classification, clustering, anomaly detection, correlation discovery, and OLAP style exploration.