There is consensus, especially in our current deep-learning era, that more training data almost always helps improve performance of our deep learning models. But the process of collecting labeled data remains a costly and cumbersome task. Naturally, researchers started looking into this problem, which has led to development of various techniques for reducing the labeling cost. Among these, is a popular technique called weak supervision, in which a collection of heuristics and rules are used to label the data. Of course, the labels would be noisy but these weak labels have proven to be valuable as long as the rules have a reasonable error rate.
How to find labeling rules?
Finding labeling rules is not easy, and here are some factors that contribute to that.
- Coming up with good rules is a tedious and time consuming task which involves reviewing samples from the data, looking for patterns, expressing those patterns in the rule language, and evaluating the output of the rule, adjusting the rule to reduce the mistake, and repeating the process.
- A labeling rule has to be executable (i.e., to automatically produce a label given a datapoint). This requires rules to be specified in a particular language which can be something like regular expressions or IKE (from AI2). Given that, only those who are familiar with the rule language can contribute to the process of writing rules.
- It is difficult to parallelize the process of writing rules as people often come up with identical or similar labeling rules.
To address these problems, we have created Darwin, an interactive tool that facilitates rule discovery.
Finding rules with Darwin
Imagine that you want to label a corpus of messages between hotel guests and their concierge. The goal is to find any message in which guests are asking for directions (from different places to the hotel, or from the hotel to different attractions). The figure below shows a few positive (highlighted as green) and negative (highlighted as red) examples for this task. In a nutshell, we are interested in a collection of rules that discover as many positive instances with high precision.
This is how Darwin can help find accurate labeling rules. Starting from a couple of positive-labeled instances (like S1, S2 and S4 shown above), Darwin starts to automatically mine rules that are considered to be both useful and accurate and suggests them (along with a handful of examples labeled by the rule) to annotators as shown in the image below.
In this case, the annotator is asked to verify the rule “//way/ADJ^//from/PNOUN”. If the annotators are familiar with the rule language, they can verify it directly. If not, they can use the provided examples to report if the rule seems accurate or not. Moreover, the annotators are no longer required to come up with rules; they only need to verify them. Lastly, multiple annotators can contribute to the task as Darwin assigns different candidate rules to each annotator to verify.
How does Darwin work?
Darwin uses two complementary ideas to find good rules. The first idea, which inspired the name Darwin, is to modify the existing rules and evolve them into new ones. To achieve this, Darwin uses the grammar of the rule language to understand which rules are generalizations of an existing rule and which are specifications of it. If a candidate rule is verified as too noisy by annotators then Darwin aims to make it more specific. Alternatively, if a rule is verified, Darwin generalizes the rule with the hope to cover more instances. To clarify this better, let’s consider a simple rule language: label the text as positive if it mentions a particular phrase p. Now starting with the seed rule p=”best way to get to”, Darwin evolves the rule as shown in the image below to find another rule p=”shuttle to”. In a nutshell, if annotators verify a rule as correct, Darwin drops part of the phrase to generalize it. When the rule is labeled as noisy, then Darwin adds specific words to the phrase to make it more specific.
The second idea is to utilize a text classifier that is periodically trained using the data labeled by Darwin during the process. Using this classifier, Darwin can guess which other sentences are likely to be positive as well, and suggest rules that capture those even if those rules are not generalization or specifications of any current rule.
To be able to carry out the ideas above, Darwin requires a smart indexing scheme over the entire corpus. The index enables Darwin to efficiently fetch sentences matching each (candidate) rule. The index is created (offline) by considering each sentence as a derivation of the rule grammar. That is, to view each sentence “s” as a very strict rule that only matches itself. Naturally, any generalization of this rule would also match the sentence “s”. The sentence “s” is stored in the index along with all its matching rules in the index.
How well does Darwin work? (or Insights from experimental results)
To test how well Darwin facilitates the process of labeling rules, we considered four different labeling tasks on four different datasets. We observe that with only 100 annotations (i.e., approval or rejection of rules), Darwin reaches 0.8 F1-score on all datasets. Note that the entire process of reviewing the data, coming up with labeling rules, expressing it in the rule language, and debugging it for mistakes, evaluating if the rule was effective and repeating the process is now replaced by 100 annotations which can be parallelized as well. Finally, it’s worth mentioning that Darwin works particularly well when we are dealing with an imbalanced dataset (i.e., when the number of positive instances is a tiny fraction of the entire corpus).