BigGorilla is an open-source data integration and data preparation ecosystem (powered by Python) to enable data scientists to perform integration and analysis of data. BigGorilla consolidates and documents the different steps that are typically taken by data scientists to bring data from different sources into a single database to perform data analysis. For each of these steps, we document existing technologies and also point to desired technologies that could be developed.
The different components of BigGorilla are freely available for download and use. Data scientists are encouraged to contribute code, datasets, or examples to BigGorilla. We hope to promote education and training for aspiring data scientists with the development, documentation, and tools provided through BigGorilla.
Find the Big Gorilla workflows and related repositories on Github.
Data Acquisition, Extraction, and Cleaning
Use this component when you wish to acquire data from other sources or extract structured data from text. Most tools in this component include data cleaning components to, for example, detect and/or correct inconsistent data.
Entity Matching
Use this component when you wish to identify when two entities are the same entity or when they are related in some ways.
Schema Matching and Mapping
Use this component when you wish to match attributes across two schemas or when you wish to generate scripts (from schema matchings) that can be executed to transform data from one format into another.
Additional Data Preparation Tools
This component contains additional data preparation tools such as tools for building and automating a workflow of tasks or tools for converting data from one format into another.