At Megagon Labs, we strive to enable seamless utilization of Japanese natural language processing (NLP) for engineers and data scientists around the world. We’re proud to announce we’ve taken one big step forward in fulfilling this endeavor. By applying the results of our joint research efforts with the National Institute for Japanese Language and Linguistics (NINJAL) on universal dependencies for Japanese, we’ve created GiNZA.
GiNZA is an open-source Japanese NLP library with features such as a one-step installer, high-speed and high-precision analysis, and international capabilities for sentence structure analysis.
Why We Built GiNZA
By combining fields like artificial intelligence (AI) and linguistics, NLP allows us to analyze the written and spoken word of various languages. This technology is widely used in an array of consumer and business applications, including search engines, machine translation, chat agents, and more.
Modern NLP employs sophisticated machine learning technology to handle the diverse vocabularies and grammatical rules of different languages. For example, the Japanese language has no spaces separating its words. To analyze this, several tools are required.
Morphological analyzers are usually used to carry out tokenization and part-of-speech tagging. Dependency parsers are often leveraged to analyze and determine the relationship between words. And named-entity recognizers (NER) typically take care of labeling spans and types of named entities such as people, places, and organizations.
In recent years, NLP frameworks that integrate numerous functionalities have grown in popularity. And for good reason — they facilitate easy use of NLP for English, Chinese, and other languages. However, efforts to develop and deploy analytical models for Japanese NLP frameworks operating on commercially usable licenses have lagged behind.
Japanese datasets for NLP research purposes and training models are extremely limited compared to English ones. And the ones that are available are often restricted for commercial use. This lack of standardized knowledge and tools not only poses significant challenges for the evolution of Japanese NLP but also results in exorbitant costs and extra work just to take care of fundamental tasks.
We built GiNZA to address this obstacle.
How is GiNZA implemented?
GiNZA utilizes two core technologies:
- spaCy: An NLP framework that incorporates leading machine learning capabilities.
- SudachiPy: An open-source morphological analyzer that takes care of tokenization.
Leveraging the advantages of these base technologies in our pipeline design allows GiNZA to provide sufficient processing speed and analytical accuracy, even in industrial applications.
How to Use GiNZA
GiNZA is available through the MIT license. Anyone can use it for any purpose, and you only need to take one simple step to install it. In operating environments with Python 3.6 or newer versions, issue the following command:
$ pip install ginza
Based on Universal Dependencies, GiNZA’s dependency parser analyzes Japanese text quickly and accurately. Issue the following command to have it analyze an “input.txt” text file. This file should only contain text separated by line-breaks. You can then save the results in CoNLL-UFormat under the “output.txt” filename:
$ ginza < input.txt > output.txt
As noted before, GiNZA’s analytical model uses the spaCy API for learning. This means that GiNZA is compatible with analytical models provided via spaCy for English, German, and other languages.
In operating environments with GiNZA installed, you can easily change the specified language and use the Japanese analytical model through the spaCy API by following these python code steps:
import spacy
nlp = spacy.load("ja_ginza")
doc = nlp("銀座でランチをご一緒しましょう。")
The Next Steps for GiNZA
Since it is open-source software, GiNZA already has a substantial number of followers. But contributors cannot make direct improvements because the license of the UD Japanese BCCWJ data set, which is used for GiNZA’s analytical model training, is not public.
To address this, Megagon Labs is participating in NINJAL’s UD Japanese GSD restructuring project. More specifically, we’re responsible for the annotation of named-entity labels. This initiative provides large quantities of data that enable us to proceed with open and efficient improvements to the GiNZA/spaCy analytical model.
Once the restructured UD Japanese GSD data set is complete, we plan to release it under the commercially usable Creative Commons Attribution-ShareAlike (CC BY-SA) license. We also intend to integrate the analytical model trained on this new data set as well as GiNZA’s optimized pipeline into spaCy’s language model repository via pull requests.