Last month I participated in the 3rd Workshop on Data Science with Human in the Loop @ KDD 2021 (@dash_workshop, #KDD2021). At the workshop I gave a talk on the human-in-the-loop (HITL) from a practitioner’s perspective and also participated in a panel. The workshop overall was a success! There were many great presentations, papers, and lively discussions!
In this blog I will recap some of the points I made during my talk and the panel and identify some directions for the future of human-in-the-loop (HITL) research.
Let’s start with some observations first:
- HITL is actually many loops involving humans in varied roles, with stakeholders in product, engineering, and end-users.
- There is an increasing need to incorporate human feedback across the entire system lifecycle.
- Inputs to ML are increasingly more than just labeled data. Downstream feedback, in various forms (policies, rules, etc.) from stakeholders and users, needs to loop into the process.
- Data labeling is evolving with a more sophisticated toolset.
- There is opportunity for some novel techniques for collecting and incorporating such feedback.
Human Aspects of HITL
As an HCI researcher, I studied “human-in-the-loop” in the context of complex systems management, where systems do more than just take direct commands from humans in various forms and operate through objectives and policies. As I progressed in my career, I couldn’t help but notice the similarities of these systems with HITL in data science.
For example, I observed that there are many stakeholders in both complex systems management and the practice of data science. In data science, there are many loops and many people with different roles (e.g. data scientist, product manager, machine learning engineer), who provide a variety of input in form and semantics into the process.
Human Input: Not just Labels
Traditionally, human input to machine learning has been for the most part in the form of labeled ‘data’, where each instance of data is given labels, for example, a class label such as ‘spam’ if the task was a document classification task. Data augmentation techniques certainly help to reduce the burden on the users by expanding on the training sets, yet human control in such approaches have been limited. There are novel approaches where humans can create labeling functions through code or through examples that will in turn do that labeling, as in data programming by demonstration.
The problem is that there is still a large gap in capturing and utilizing the downstream human feedback into models, particularly from people closer to production. Human input is crucial in defining and refining requirements of the model(s), iterating over alternatives, identifying metrics to monitor during development and post-production, and helping in decision making in general as they provide crucial context to the solution. Understanding the form, semantics, and context of these inputs are the key to successful implementation of HITL in data science and machine learning, in general.
Let’s take a look at one of the data science projects we did recently and use it as a use case to make observations on operationalizing human(s)-in-the-loop(s) processes effectively in data science and machine learning organizations.
A Case Study: Question Deduplication
Often websites have a Q/A platform, where end-users can ask questions about a product, a restaurant, a company, etc. Over time duplicate questions or similar questions can decrease the utility of the Q/A platforms. As such businesses are interested in removing duplicate questions and thereby increasing search quality and user experience (see for example Quora Question Pair Dataset). So, when users search for information, they can find the right question easily, and don’t need to jump from one question to another. Technical work involves identifying and consolidating user-generated questions, removing duplicates, selecting a representative question (one of the existing questions) and merging all answers.
Just to give a few examples from the HR domain, there are questions like: “If you were to leave X, what would be the reason?”, “Why did you quit your job?”, and “Why do so many leave at X warehouse?” As you can see the problem is not straightforward. Even for questions phrased similarly it requires a lot of context and domain knowledge to interpret questions. Since the penalty for merging wrong questions is high, there was a high bar in terms of precision (0.95 in our case) while recall was less of an issue, as that means essentially no change to existing content. Obviously, preference was to achieve as much recall as possible.
While technical details don’t matter to the points I want to make, at a high-level, we encode each question into a high-dimensional representation and do basic clustering, so each cluster is considered to be questions that could be merged. Then select one of the questions as the representative question, and merge all answers (as shown below).
Naturally, we did several iterations over the techniques, tried different ideas on encodings, employed different clustering techniques, played with different params, etc. so that we could meet the high-bar. As for selection of a representative question our initial default was to select the question closest to the center of the cluster.
In the initial phases, the persons in the loop in these were the data scientists primarily as they observed output, examined results, and tried different approaches.
Naturally, the next step was to share the data with our partners, have them review it, and see what they think. There were about 2M questions from around 100 companies so we needed to do something to help ease this human review. The idea was to have them review data that we think are around the decision border, sort of like uncertainty sampling. So, we implemented a basic classification algorithm where we produced confidence scores and based on scores selected questions for review by the content manager. This classification algorithm was trained on a sample labeled dataset we labeled ourselves. We received feedback on the question clusters, which came back on a spreadsheet, with some annotations on individual questions that should not be merged, as well as questions/clusters that should be merged.
On our end, this feedback from the content manager (CM) and product manager (PM) transformed into the algorithms, params, pre-and post-processing, through interactions with the data scientists on the project. We also received feedback on the cluster-representative question from the PM and over time in several iterations the criteria for the representative question emerged as a ranked list of rules: “choose the seeded question if it is in cluster”, “select questions that mention the company specifically”, “prefer questions that have more answers”, and “prefer questions that are well-formed, grammatically.”
Feedback came during discussions in meetings, through many iterations where we shared intermediate results, they got reviewed, annotated, we captured as to-dos, and built into a post-processing step as rules to select representative questions.
A key point to note here is the necessity of capturing provenance in this whole process. There were many sheets shared, annotated at different times for different versions of the data, produced by different models and parameters. Verbal feedback was received at different times, encoded as rules in different versions of the software, clearly illustrating the necessity of MLOps, for operationalizing the HITL data science, specifically, of the need for capture params for different runs, versions of code, model, artifacts, etc.
Going forward once the staged results were acceptable, and confirmed, it went into production, with an A/B test, where some part of the user population saw the merged questions and others did not. There were metrics to evaluate against business metrics such as organic traffic, and some others. These provided feedback into the loop, through measurements of these metrics.
The last stage is the promotion of the solution to full-scale production…. But as we all know it doesn’t end here, ML/Engineers continue to keep track of metrics, examine data / model drift and provide feedback into the process at any point, sometimes all the way to the initial steps.
In short human input came in at several junctions and came in a variety of forms.
Let’s take a look at the forms of human input (from the above case study and many other engagements we conducted) in a bit more detail.
Forms of Human Input
Through interactions with data scientists, machine learning engineers, researchers, content managers, product managers, and beyond, human input came as:
- Ideas on params, algorithms, architecture, hyperparams, pre-/post-processing
- New datasets, as list of keywords, sites, documents, dictionaries
- Instance-based qualitative/quantitative feedback on algorithm output, verification, classification, comments
- Instance-based corrections, edits
- (Aggregate) business metrics, such as traffic, new signups, views, etc.
- (Aggregate) technical metrics such as precision, recall, false positive rate
- Global high-level qualitative feedback on quality, sample errors
How did this feedback find its way into the system?
They were transformed into:
- Params, changes to parameters, maybe network architectures, hyperparameters of the neural network, configuration of the models
- Pre- and post-processing steps, rules, thresholds, exceptions
- Labeled data, corrections to labeled data, more diverse labels
- Additional data, additions to dictionaries, additional reference data, corpus of text
HITL: What is next?
When we look at the loops in data science, needless to say that there were many loops, with different kinds of people. Loops within loops, loops going back single step, multiple steps, loops going back all the way to the front.
Several things to note here. First, as feedback comes in a variety of forms, at different times, at minimum we should capture all the artifacts that represent observations and feedback and couple them with the data under review, the code, the models, params, everything that shaped the results. Second, because people have different backgrounds they use different tools for reviewing the data, for example providing the feedback in google sheets. Last, but one of the most important points, is that there are many different kinds of feedback and metrics. To optimize the whole process end-to-end we need ways to bring them together.
So, what is next for HITL research? Some predictions:
- Interesting research will be done at the intersection of human-in-the-loop, data augmentation, and active learning aimed to increase the level of abstraction of human input and scale human work.
- Data augmentation (for NLP in particular) will likely be more domain-specific and more controllable by humans through languages that support abstractions and compositions.
- MLOps (provenance, metadata) will be more closely connected to HITL, at an even finer-level of granularity, maybe even at the level of individual data points.
- We will see productions systems that bridge the gaps, reduce frictions, and close and tighten the loops
- Novel user experiences that connect people, data, and models will surface and be incorporated into the existing tools for a variety of users in an integrated manner.
I believe the workshop was a great success in bringing people from machine learning, natural language processing, data management and human-computer interaction fields to talk on this important topic! Thanks Yunyao Li (@yunyao_li) and Lucian Popa (@lucian_popa_us) from IBM Research for inviting me.
Written by Eser Kandogan and Megagon Labs