JGLUE Hands-on: An outline of the performance evaluation procedure of the Transformers model using JGLUE

JGLUE is the Japanese version of GLUE jointly developed by Yahoo Japan Corporation and Kawahara Lab at Waseda University. It is the standard evaluation dataset for natural language processing technology and was released to the GitHub repository in June 2022. This article explains how to evaluate the models of hugginface/transformers systems using JGLUE, and evaluates the ELECTRA model used in GiNZA* by JGLUE.

*GiNZA is an open-source Japanese NLP library. It features one-step installation, fast and accurate Japanese tokenization, and advanced NLP techniques such as dependency parsing and named entity extraction available on an international framework. For more information, visit our GiNZA page.

JGLUE

As of July 2022, JGLUE consists of the following five datasets.

  • Document Classification
    • MARC-ja
    • (JCoLA, not released yet)
  • Document type classification
    • JSTS
    • JNLI
  • Question and answer
    • JSQuAD
    • JCommonsenseQA

MARC-ja

MARC-ja is a task to classify whether the evaluation from Amazon Review texts is positive or negative. 

JGLUE extracted Japanese data from the Multilingual Amazon Reviews Corpus (MARC), converted ratings of (1,2) as negative and (4,5) as positive, and crowdsourced to unify labels for identical reviews that have different ratings.

JSTS

JSTS is a task to estimate the semantic similarity of two sentences.

Sample
				
					{"sentence_pair_id": "691",
 "yjcaptions_id": "127202-129817-129818",
 "sentence1": "街中の道路を大きなバスが走っています。",
 "sentence2": "道路を大きなバスが走っています。",
 "label": "4.4"}
				
			

JNLI

JNLI is a task to infer whether the relationship between two sentences is (entailment, contradiction, or neutral).

Sample
				
					{"sentence_pair_id": "1157",
 "yjcaptions_id": "127202-129817-129818",
 "sentence1": "街中の道路を大きなバスが走っています。",
 "sentence2": "道路を大きなバスが走っています。",
 "label": "entailment"}
				
			

JSQuAD

JSQuAD is a question-answering task that extracts the answer from the context, given a context and a question.

Sample
				
					 {
      "title": "東海道新幹線 (Tokaido Shinkansen)",
      "paragraphs": [
        {
          "qas": [
            {
              "question": "2020年(令和2年)3月現在、東京駅 - 新大阪駅間の最高速度はどのくらいか。 (What is the maximum speed between Tokyo Station and Shin-Osaka Station as of March 2020?)",
              "id": "a1531320p0q0",
              "answers": [
                {
                  "text": "285 km/h",
                  "answer_start": 182
                }
              ],
              "is_impossible": false
            },
            {
             .. 
            }
          ],
          "context": "東海道新幹線 [SEP] 1987年(昭和62年)4月1日の国鉄分割民営化により、JR東海が運営を継承した。西日本旅客鉄道(JR西日本)が継承した山陽新幹線とは相互乗り入れが行われており、東海道新幹線区間のみで運転される列車にもJR西日本所有の車両が使用されることがある。2020年(令和2年)3月現在、東京駅 - 新大阪駅間の所要時間は最速2時間21分、最高速度285 km/hで運行されている。"
        }
      ]
    }
				
			

JCommonsenseQA

JCommonsenseQA is a question-answering task where you are given a question and choices, and you choose which option is correct. Since no context is given, commonsense is required to answer the questions.

Sample
				
					{"q_id": 3016,
 "question": "会社の最高責任者を何というか? (What do you call the chief executive officer of a company?)",
 "choice0": "社長 (president)",
 "choice1": "教師 (teacher)",
 "choice2": "部長 (manager)",
 "choice3": "バイト (part-time worker)",
 "choice4": "部下 (subordinate)",
 "label": 0}
				
			

Fine-tuning

Patch

In learning and evaluating the transformers models, the following patches need to be applied to the scripts provided with transformers.

https://github.com/yahoojapan/JGLUE/blob/main/fine-tuning/patch/transformers-4.9.2_jglue-1.0.0.patch

The differences in this patch mainly revolve around JSQuAD and include the following changes from the English version of SQuAD.

  • The f1 score is calculated on a character base instead of a token base.
  • Removal of processing around articles and punctuation
  • If the text is not pre-tokenized (*), get_final_text does not get called to match the prediction to the original text, eliminating the effect of normalization. 

※ Note that if you are experimenting with your own tokenizer except BertJapaneseTokenizer, you need to arrange this part.

https://github.com/yahoojapan/JGLUE/blob/53e5ecd9dfa7bbe6d84f818d599bfb4393dd639d/fine-tuning/patch/transformers-4.9.2_jglue-1.0.0.patch#L294-L303

Evaluation by megagonlabs/electra-base-japanese-discriminator

The accuracy of fine-tuning each JGLUE task with the ELECTRA pre-training model published by Megagon Labs on the Hugging Face hub is shown below.

※  Hyperparameters and additional patches for using ELECTRA-SudachiPy tokenizer are available in the Megagon Labs GitHub repository. Note that the script provided by transformers does not do early-stopping, so in this case we manually looked at the dev score and used the result at the converged epoch.

Model MARC_ja JSTS JNLI JSQuAD JCommonsenseQA
metric
acc.
Pearson/Spearman
acc.
EM/F1
acc.
0.954
0.923/0.891
0.924
0.884/0.940
0.901
0.962
0.901/0.865
0.895
0.864/0.927
0.840
0.963
0.913/0.877
0.921
0.813/0.903
0.856
jglue table

Only the JSQuAD score is a little low, but for the other tasks, the scores are among the top of the BASE size models.

Conclusion

In this article, we briefly explained the datasets included in JGLUE and some points to keep in mind when performing fine-tuning with transformers. We are very much looking forward to the further acceleration of the Japanese NLP development with the release of JGLUE.

Written by Written by:  Rintaro Terada, Hiroshi Matsuda, and Megagon Labs Tokyo.

Follow us on LinkedIn and Twitter to stay up to date with us.

Share:

More Blog Posts: