In order to create a better customer experience, we need to know what makes a service or product attractive to customers. Megagon Labs Tokyo has developed the Japanese Realistic Textual Entailment Corpus (JRTE Corpus), which is a corpus of text based reviews of hotels, and other lodging facilities, published on the travel information website Jalan Net. The JRTE Corpus is enriched with labels provided by human annotators. This article introduces the JRTE corpus and a simple example of how to use JRTE to train (fine-tune) a standard machine learning model (BERT).
Word-of-mouth communication in online services is essential for users who are considering using a service or product. Prospective customers can make better choices if they learn beforehand about an unfamiliar area, or if they understand the amenities of a lodging option they’ve never tried but were reviewed favorably by others. Unfortunately, it is still not easy to quickly obtain the information you want from the large amount of user chatter online. In order to solve this problem, we decided to create a corpus that would help in the automatic extraction and organization of knowledge.
Contents of the JRTE Corpus
The JRTE corpus enriches the reviews’ text with three labels: presence/absence of the hotel feature, sentiment polarity, and textual entailment. The corpus is distinct because the sentences are not created artificially but are from real reviews. In general, using a corpus is a time-consuming process, but it is immediately available for academic purposes, which we think is a valuable language resource in the perspective of academic development.
The presence/absence label and the sentiment polarity
label of the hotel feature
The corpus assigns a binary label (Yes=1, No=0) depending on whether or not the features of the hotel are included. It also assigns a tri-level label in sentiment polarity (positive=1, negative=-1, neutral=0). In this article, we call the classification tasks of each label “RHR” and “N.” Example data is shown below. By using this data to build a classifier, we can extract “mentions of popular lodging features.”
An example of hotel features with Y/N label and sentiment polarity label
The following is the Japanese corpus with sentiment polarity labels available to the public:
- University of Tsukuba sentence unit evaluation polarity tagged corpus (External Link)
- Tokyo Metropolitan University Sentiment Treebank (TMUST) (External Link)
Textual Entailment Label
It assigns a binary label (entail=1, not entail=0) depending on whether or not premise (P) always applies when hypothesis (H) is true (whether P entails H or not). In this article, we call the classification tasks of each label “RTE.” Example data is shown below. By using this data to build a classifier, we can find the same reference with different wording.
“Best view from the room” does not necessarily mean “you can see the ocean from the room,” which labels 0. On the other hand, “The room had an ocean view” means “you can see the ocean from the room,” which labels 1.
An Example of Recognizing Textual Entailment
The following is the Japanese corpus with Recognizing Textual Entailment labels available to the public:
- Textual Entailment Evaluation Data (External Link)
- Japanese SNLI(JSNLI) Data Set (External Link)
- NTCIR-10 Recognizing Inference in TExt Task Japanese Wikipedia Corpus and Task Data (External Link)
Examples of using the JRTE corpus for model training
There are many ways to create a classifier using a corpus, but here we will try to fine-tune the publicly available BERT model of Tohoku University according to the task. (For an explanation of BERT itself, the OGIS Research Institute’s article is a helpful reference.)
The first step is to get the JRTE corpus, sample scripts, and install the library.
$ git clone https://github.com/megagonlabs/jrte-corpus_example
$ cd jrte-corpus_example
$ git clone https://github.com/megagonlabs/jrte-corpus
$ pip3 install poetry
$ poetry install --no-root
The next step is to build the model by fine-tuning with a learning script, which took about 1.5 minutes for PN and RHR and about 5 minutes for RTE on a GPU (NVIDIA V100-SXM2).
$ poetry run python3 ./train.py -i ./jrte-corpus/data/pn.tsv -o ./model-pn --task pn
$ poetry run python3 ./train.py -i ./jrte-corpus/data/rhr.tsv -o ./model-rhr --task rhr
$ poetry run python3 ./train.py -i './jrte-corpus/data/rte.*.tsv' -o ./model-rte --task rte
Verifying that the model works
Now, try to set up a classification server using transformer-cli‘s serve to test the model.
$ poetry run transformers-cli serve --task sentiment-analysis --model ./model-pn --port 8900
$ curl -X POST -H "Content-Type: application/json" "http://localhost:8900/forward" -d '{"inputs":["ご飯が美味しいです。", "3人で行きました。" , "部屋は狭かったです。"] }'
{"output":[{"label":"pos","score":0.8015708923339844},{"label":"neu","score":0.47732535004615784},{"label":"neg","score":0.42699119448661804}]}
Each of the three sentence inputs is given a (correct) label: positive, neutral, or negative.
$ poetry run transformers-cli serve --task sentiment-analysis --model ./model-rhr --port 8901
$ curl -X POST -H "Content-Type: application/json" "http://localhost:8901/forward" -d '{"inputs":["ご飯が美味しいです。", "3人で行きました。"] }'
{"output":[{"label":"yes","score":0.9653761386871338},{"label":"no","score":0.8748807907104492}]}
The two sentence inputs are respectively given a (correct) label: either relevant or irrelevant.
$ poetry run transformers-cli serve --task sentiment-analysis --model ./model-rte --port 8902
$ curl -X POST -H "Content-Type: application/json" "http://localhost:8902/forward" -d '{"inputs":[["風呂がきれいです。", "食事が美味しいです" ] , [ "暑いです。", "とても暑かった"]] }'
{"output":[{"label":"NE","score":0.9982748627662659},{"label":"E","score":0.9790723323822021}]}
The pair of the two sentences’ inputs are respectively given a (correct) label of “not entailed” or “entailed.” Of course, this does not mean that it gives the correct label to every input; it may have the wrong label.
Model Performance Evaluation
Now, we examine the accuracy using the evaluation data.
$ poetry run python3 ./tra in.py --evaluate -i ./jrte-corpus/data/pn.tsv --base ./model-pn --task pn -o ./model-pn/evaluate_output.txt
$ awk '{if($1==$2){ok+=1} } END{ print(ok, NR, ok/NR) }' ./model-pn/evaluate_output.txt
464 553 0.83906
$ poetry run python3 ./train.py --evaluate -i ./jrte-corpus/data/rhr.tsv --base ./model-rhr --task rhr -o ./model-rhr/evaluate_output.txt
$ awk '{if($1==$2){ok+=1} } END{ print(ok, NR, ok/NR) } ' ./model-rhr/evaluate_output.txt
490 553 0.886076
$ poetry run python3 ./train.py --evaluate -i './jrte-corpus/data/rte.*.tsv' --base ./model-rte --task rte -o ./model-rte/evaluate_output.txt
$ awk '{if($1==$2){ok+=1} } END{ print(ok, NR, ok/NR) } ' ./model-rte/evaluate_output.txt
4903 5529 0.886779
This is a brief example of how to build a classification model using the JRTE corpus. The accuracy can be further improved by devising a structure and parameters for the model, or by separately collecting and labeling data that is prone to mistakes, then adding it to the training data. In the JRTE corpus, error analysis of models is easy because it is written in Japanese, and it can be used as a tutorial for those who are not familiar with text classification.
Conclusion
Megagon Labs will continue to enhance the research capabilities of the broader academic community in Japanese natural language processing by continuously releasing research data that will benefit researchers and students at public research institutions and universities.
Written by: Yuta Hayashibe / Megagon Labs Tokyo
Tag: Sentiment Polarity Analysis/感情極性分析, Recognizing Textual Entailment/含意関係認, BERT