Text classification underlies many of the technologies that we use daily; email routing, news categorization, and social media intent identification are just a few examples. But labeling, an essential component of text classification, is a laborious, time-consuming process. To facilitate the labeling process, we developed an interactive system called Ruler. Thanks to a new framework known as data programming by demonstration (DPBD), Ruler can seamlessly synthesize labeling functions.
Ruler brings the power of data programming to domain experts by lowering the technical barriers to entry. With Ruler, you can generate large amounts of training data for text classification quickly and easily — no coding needed.
What Is Data Programming?
Before we delve into DPBD and Ruler, let’s take a brief detour to review data programming.
Most machine learning models used today are supervised and rely on large labeled training datasets. Thus, the success of machine learning is critically dependent on the availability of large amounts of high-quality labeled data. But because this data is expensive to obtain, the utilization of machine learning models outside of resource-rich settings is inhibited.
Weak supervision methods such as crowdsourcing and user-defined heuristics strive to alleviate this dilemma by enabling the use of noisy and imprecise sources to gather large training datasets. Unsurprisingly, the results can be noisy and imprecise. This is where data programming comes in.
Data programming aims to address the difficulties and exorbitant costs of collecting and curating training data by applying a programmatic approach to weak supervision. Essentially, data programming encodes domain knowledge from subject matter experts as labeling functions over the source data.
However, data programming comes with a few drawbacks. While subject matter specialists possess the domain knowledge necessary to create useful labeling functions for their area of expertise, they usually lack programming experience. And even if the subject matter expert is proficient in programming, it is often difficult to translate domain knowledge into a set of rules.
To make matters worse, many subject matter experts are short on time. For instance, a healthcare expert’s knowledge would be undeniably useful for labeling functions in this domain; training models in medicine requires large volumes of high-accuracy data. But the typical medical expert’s time is limited and in high demand. They can’t possibly be expected to dedicate countless hours to the time-consuming development of labeling functions.
In short, the inaccessibility and time demands of writing labeling functions represent steep challenges to the wider adoption of data programming.
How Data Programming by Demonstration Simplifies Labeling Function Synthesis
DPBD is our new human-in-the-loop framework that relieves users from the burden of writing labeling functions. It does this by employing an intelligent synthesizer to take care of this process. With DPBD, users can steer the synthesis process at multiple semantic levels, from providing relevant rationales for their labeling choices to interactively filtering the proposed functions.
As seen in the figure below, the DPBD framework contains two input sources: the human labeler and the raw text data. In this scenario, the human labeler is a subject matter expert who has sufficient domain knowledge to extract useful data signals but does not necessarily have programming experience.
Labelers interact with the data through the labeling interface. The active sampler selects and shows them examples to maximize the benefit of interaction. The labeler then annotates the examples, selects a label class, and provides their rationale. Next, the labeling interface records the labeler’s interaction and compiles it into a set of conditions. After this, the synthesizer generates labeling functions from this information.
After the labeling functions are synthesized, the user selects the most appropriate ones based on their domain knowledge. These selected labeling functions are then passed to the modeler, which combines and denoises the labels using Snorkel. The output is a labeling model that is trained to produce labels for the large unlabeled dataset automatically. Until a particular stopping criterion is met (e.g., reaching the desired model quality) or the labeler decides to exit, the active sampler selects the next data record to present the labeler.
Ruler: Document Classification Made Simple
Ruler is a first-of-its-kind interactive tool that operationalizes DPBD for document classification. As shown in Figure 1 above, a user iteratively annotates and labels text. Ruler then automatically generates labeling functions. After the user selects the most relevant functions, they can track their performance. The image below demonstrates these key interactions.
Using Ruler in Spam Classification
An example perhaps best illustrates Ruler’s capabilities. So let’s consider spam classification — also known as the reason why we all don’t have to sift through hundreds of junk emails every morning. A typical labeling function for this task might be recognizing “www” or “.com” in text. Here’s how Python code would resemble this:
Instead of writing this labeling function as Python code, a Ruler user can simply annotate a URL as spam to achieve the same output. This is actually the “demonstration” part of DPBD. Ruler can also transform various interactions into functions that can make use of word co-occurrence, recognize named entities like locations or dates, and more. Once the user is satisfied with the labeling functions they’ve created through Ruler, Snorkel aggregates them and denoises the resulting label model.
Evaluating Ruler
Does all of this sound simple? Well, it is. But that doesn’t detract from the major benefits offered by this approach.
In our evaluation of Ruler, we found that many of the rules our participants created through manual data programming could be captured by Ruler interactions. On top of this, several participants felt that Ruler was easier to use. This means that even a skilled programmer could benefit from exploring the space of labeling functions with Ruler, even in conjunction with manual data programming.
Are you curious about how to best use Ruler with custom Python functions? We’ve outlined some guidance in our documentation, which you can find here.
Summarizing the Benefits of Ruler
With Ruler, domain experts who aren’t proficient with coding can still reap the benefits of data programming. Going back to our healthcare expert example, Ruler’s capabilities translate to a much more efficient and effective labeling function generation process. The medical expert no longer needs to waste copious amounts of time, and machine learning models in this domain can still benefit from vast volumes of high-accuracy training data.
By limiting a user’s task to simple annotation and selection from suggested rules, Ruler allows for fast exploration over the space of labeling functions. With Ruler, a user can label as much training data as they like and then use it to train a more sophisticated supervised model.
To sum it all up, Ruler allows user to focus on:
✅ Choosing the right generalization of observed instances
✅ Capturing the tail end of their data distribution
And thanks to Ruler, users don’t have to worry about:
❌ Programming language implementation details
❌ Expressing rules in natural language
❌ Formalizing their intuition
We hope you’ve enjoyed this overview of DPDB and Ruler! Both hold immense promise for improving machine learning accessibility, especially for fields like healthcare, where domain experts’ time is limited, and there is little tolerance for low-quality models. We’re excited to continue our research and development for each one, and we hope to elucidate more advancements soon. Stay tuned!
Interested in learning more about Ruler? Please read our research paper or check out our GitHub!. Do you have any questions about how it works? Contact us today!
Written by Sara Evensen and Megagon Labs