BigGorilla: data integration and data preparation ecosystem

BigGorilla is an open-source data integration and data preparation ecosystem (powered by Python) to enable data scientists to perform integration and analysis of data. BigGorilla consolidates and documents the different steps that are typically taken by data scientists to bring data from different sources into a single database to perform data analysis. For each of these steps, we document existing technologies and also point to desired technologies that could be developed.

The different components of BigGorilla are freely available for download and use. Data scientists are encouraged to contribute code, datasets, or examples to BigGorilla. We hope to promote education and training for aspiring data scientists with the development, documentation, and tools provided through BigGorilla.

Find the Big Gorilla workflows and related repositories on Github.

Data Acquisition, Extraction, and Cleaning

Use this component when you wish to acquire data from other sources or extract structured data from text. Most tools in this component include data cleaning components to, for example, detect and/or correct inconsistent data.

Entity Matching

Use this component when you wish to identify when two entities are the same entity or when they are related in some ways.

Schema Matching and Mapping

Use this component when you wish to match attributes across two schemas or when you wish to generate scripts (from schema matchings) that can be executed to transform data from one format into another.

Additional Data Preparation Tools

This component contains additional data preparation tools such as tools for building and automating a workflow of tasks or tools for converting data from one format into another.

Archive

BigGorilla: data integration and data preparation ecosystem

Data Acquisition, Extraction, and Cleaning

Entity Matching

Schema Matching and Mapping

Additional Data Preparation Tools

Other Projects:

MEGAnno

KnowledgeHub

ZETT: Zero-shot Triplet Extraction by Template Infilling

GiNZA: install a Japanese NLP library in one step