Use this component when you wish to acquire data from other sources or extract structured data from text. Most tools in this component include data cleaning components to, for example, detect and/or correct inconsistent data.
https://github.com/biggorilla-gh/frameit
FrameIt is a system for creating custom frames for text corpora. FrameIt uses Python3 + Spacy2.
Some features include:
– Intent detection for individual sentences using a CNN model
– Entity extraction paired with intents using either CNN or heuristic models
– SRL system allows for loading multiple Frames for intent detection simultaneously, allowing for the differentiation of similar domains
– Easy to train and customize using jupyter notebooks
Scrapy is a framework for extracting data from websites. Scrapy can be used to build a crawler or spider to crawl multiple websites and retrieve selected data.
https://github.com/biggorilla-gh/usagi
Usagi is an open source platform to build data discovery systems. Usagi crawls and extracts metadata about datasets and builds catalogs and indices to make datasets discoverable by search and browsing.
pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python.
https://docs.python.org/3/library/json.html
The json library parses JSON strings into dictionaries and lists and vice versa.
https://docs.python.org/3/library/csv.html
The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV formats.
https://pypi.python.org/pypi/xlrd
xlrd is a Python package that parses Excel data. It has accompanying packages for writing and formatting information in Excel format.
https://pypi.python.org/pypi/pdftables
PDFtables parses PDF files and extracts what it believes to be tables.
https://pypi.python.org/pypi/slate
Slate is a Python package that simplifies the process of extracting text from PDF files. It depends on the PDFMiner package.
https://pypi.python.org/pypi/pdfminer/
PDFminer is Python package for extracting information from PDF files into text.
PDFminer includes a tool that can convert PDF files into HTML in addition to text.
http://nlp.stanford.edu/software/openie.html
Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.
https://github.com/biggorilla-gh/koko
Koko is an information extraction tool (developed in Python 3) that allows users to query a text corpus and extract those entities that is of interest to them.
SpaCy is a library for advanced Natural Language Processing in Python and Cython.
https://cloud.google.com/natural-language/
Google Cloud Natural Language API provides developers with access to Google-powered, machine learning-based text analysis components such as sentiment analysis, entity recognition, and syntax analysis.
NLTK is an open-source platform for building Python programs to process human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK also provides wrappers for industrial-strength NLP libraries.
Helps easily read and parse Web pages. Great for initial parsing and scraping.
Apache Nutch is an extensible and scalable open source web crawler written in Java.
https://github.com/DataResponsibly/DataSynthesizer
Data Synthesizer can generate a synthetic dataset from a sensitive one for release to public
Tweepy is a Python library for accessing the Twitter API to extract tweets.
https://docs.python.org/2/library/urllib2.html
urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.
https://docs.python.org/2/library/urllib.html
urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.
http://docs.python-requests.org/
Requests is a HTTP library for Python that provides the necessary apis to scrap websites. Requests can make complex requests to visit a page and get content, such as those requiring additional headers, complex POST data, or authentication credentials.
https://github.com/googlemaps/google-maps-services-python
This library brings the Google Maps API Web Services to your Python application. Analytics
The Python Client for Google Maps Services is a Python Client library for the following Google Maps APIs: Directions API Distance Matrix API Elevation API Geocoding API Geolocation API Time Zone API Roads API Places API