Emu: Enhancing Multilingual Sentence Embeddings with Semantic Similarity

Language is perhaps humanity’s greatest invention. It allows us to communicate information, convey ideas, and collaborate on building a better future. But for emerging technologies like AI and machine learning, the difference in semantics between languages can be a tremendous barrier that limits the innovative potential of applications. To solve this problem, we developed Emu, a neural network model that can enhance multilingual sentence embeddings via semantic similarity.

We’re proud to announce that our research paper on Emu was accepted by the Association for the Advancement of Artificial Intelligence (AAAI)! On February 9th, we will be presenting this work in New York City at AAAI-20, one of the most prestigious academic conferences for AI in the world. In this blog post, we’ll delve into the current constraints of multilingual support for machine learning applications and how Emu addresses them.

 

What Are Multilingual Sentence Embeddings, and Why Are They Important?

There are roughly 6,500 spoken languages in the world today. To accommodate various means of communication, online services often offer support for multiple languages. Each of these services stands to benefit from the use of machine learning. After all, this core AI subset radically improved Internet search engines and made technology such as virtual assistants and automatic translation possible.

But multilingual support remains an elusive bottleneck for machine learning applications. For example, if you trained a chatbot with data written in English, it would not be able to answer queries in other languages such as Spanish or German. You would need to collect data for each desired language and train the chatbot with it to make this work (a monumental, if not impossible, task).

To circumvent this obstacle, AI researchers created multilingual sentence embedding models. These models allow us to insert sentences from different languages as vectors into a common high-dimensional, language-agnostic semantic space. With these vectors, known as multilingual sentence embeddings, we can evaluate the similarities between two sentences from different languages.

 

How Multilingual Sentence Embeddings Work

Let’s look at an example to clarify this. In the space below, there are four vectors shown in a 3D space for simplification (since we cannot directly visualize these vectors in a high-dimensional space). Three of the vectors represent the sentence, “What time is the check-out?” in English, German, and Japanese. Due to their similarity, these three vectors are closely embedded with one another.

The last vector represents the sentence, “How late is the pool open?” Its distance from the rest of the vectors signals that its meaning is quite different. With this embedding space, we can see if a pair of sentences from different languages have similar meanings – without any direct translation needed.

 

 

Several general-purpose multilingual sentence embedding models have been recently developed and made publicly available. The vast majority of them are trained using parallel sentences (sentences with corresponding translations) in several languages. Facebook’s LASER (Language-Agnostic SEntence Representations) is one popular example. This model can process more than 90 different languages and more than 23 different alphabets.

Ideally, multilingual sentence embeddings should work in such a way that they elucidate sentences with similar meanings, as in our example above. But because they rely on textual similarity, this is not always the case. To address this issue, we must change our training approach.

 

Meaning Can Still Be Lost in Translation

As previously noted, most multilingual sentence embedding models train on parallel sentences and leverage textual similarity to work. Let’s look at another example to see how this could be problematic (Please note that, while this dilemma generally happens with sentence pairs in different languages, our example is shown in English for simplicity’s sake):

  • S1: What time is the pool open tonight?
  • S2: What time are the stores on 5th open tonight?
  • S3: When does the pool open this evening?
 

S1 and S2 have high textual similarity because they share many common words such as “What time” and “open tonight.” S1 and S3 share similar intents, but don’t have as many words in common as S1 and S2. Since many current multilingual sentence embedding models heavily rely on textual similarity, S1 and S2 would be placed into the same region of the embedding space. But these two sentences clearly have different semantic meanings. This is a critical issue because various expressions exist for the same intent in many languages.

Fortunately, we can solve this problem by focusing more on semantic similarity instead of just textual similarity. By doing so, multilingual sentence embedding models can correctly identify that S1 and S3 have similar semantic meanings and place their multilingual sentence embeddings into the same region.

 

Emphasizing Semantic Similarity With Emu

Training multilingual sentence embeddings from scratch could theoretically address the issue at hand. But this option requires a large-scale multilingual corpus, which is costly and time-consuming to create. Instead of going this route, we opted to tackle this problem of misunderstanding by building a solution for current multilingual sentence embeddings.

We developed Emu, a neural network model that “tailors” multilingual sentence embeddings by enhancing semantic similarity. Emu takes a lightweight approach; it only requires training data in a single language (e.g., English) and unlabeled sentences in different languages. These minimal requirements make Emu an elegant and versatile solution for improving existing embedding models.

Let’s explore Emu’s model architecture to see how it operates.

 

How Emu Works

Emu consists of three components: (1) a multilingual encoder, (2) a semantic classifier, and (3) a language discriminator.

Emu is extremely flexible when it comes to the choice of the multilingual encoder and its architecture. Any multilingual sentence embedding model will work as long as it can encode sentences in different languages into vector representations with the same dimension size.

By using ground-truth semantic labels like “pool” or “restaurant,” the semantic classifier categorizes input sentences into groups that share the same intent, such as “seeking pool information” or “seeking restaurant information.” Through this refining step, the encoder learns the semantic similarity between related multilingual sentence embeddings.

Since the semantic classifier uses data from only one language, it does not know anything about other languages. As a result, the multilingual model may become too conformed towards a single language and could now perform worse with other languages. The language discriminator plays a key role in preventing this problem.

Essentially, the language discriminator enhances the multilinguality of the embeddings through multilingual adversarial training. It is trained to determine whether the languages of two input embeddings are different. For example, if the language discriminator encounters the sentence “What time is the pool open?”, it learns to predict “English.” If it encounters the sentence “Wann ist das Schwimmbad geöffnet?”, it learns to predict “German.”

While this is occurring, the multilingual encoder tries to “confuse” the discriminator as much as possible. If the trained language discriminator cannot distinguish the languages at this time, then the encoder can successfully generate multilingual sentence embeddings that are truly multilingual. Incorporating this adversarial training allows us to improve Emu’s multilinguality without using parallel sentences.

 

Evaluating Emu's Performance

To evaluate Emu, we conducted experiments with three cross-lingual intent classification tasks using only monolingual labeled data. LASER acted as our base multilingual sentence embedding model. Emu was able to successfully categorize sentences written in six different languages (English, German, Spanish, French, Japanese, and Chinese) into dozens of intent classes.

Furthermore, the results indicate that Emu outperformed the original LASER model and monolingual sentence embeddings with machine translation by up to 47.7% and 86.2%, respectively. These numbers demonstrate that, by enhancing semantic similarity, Emu can help solve the obstacle of multilingual support in machine learning applications — and ensure that innovation is never lost in translation again.

Are you interested in learning more about Emu? Our paper is available online, and we will release the code soon. And feel free to contact us today if you have any questions!

 

Written by Yoshihiko Suhara and Megagon Labs

 

References

Wataru Hirota, Yoshihiko Suhara, Behzad Golshan, Wang-Chiew Tan, “Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization,” AAAI, February 2020

Share:

More Blog Posts: