Weedle: Composable Dashboard for Data-Centric NLP in Computational Notebooks

When you are building machine learning (ML) models for real-world applications, having high-quality data is just as important as having a good model. To help NLP researchers and practitioners understand and improve their data, we introduce Weedle, an exploratory text analysis tool for data-centric NLP. Here are Weedle’s biggest strengths:

  • Weedle’s design is grounded in our analysis of public NLP-related notebooks.
  • Weedle is implemented as a Python package containing a Jupyter widget, which can be seamlessly integrated into existing ML workflows, allowing users to explore their data at every stage of the ML process.
  • Weedle is equipped with built-in text support and an integrated composable dashboard. 

Before we dive deeper into Weedle, let us explain what “data-centric” means.

What is data-centric AI?

Traditionally, the way researchers have built machine learning (ML) models is by tweaking models to improve their performance on public benchmark datasets. However, with widely available models such as BERT performing so well in a variety of tasks, the attention of the ML practitioners is shifting toward improving the quality of the data, rather than building better models. This is called data-centric AI.

For data-centric AI, you first have to understand your data. That is, you need to carefully examine and diagnose the underlying data throughout the entire ML lifecycle. For instance, before and during training, you need to check if you have enough data, if the labels are consistent and well-distributed, if there are any noises or biases, and so on. During modeling, you need to identify if there are any underperforming data subsets. After deployment, you need to monitor the data to ensure it isn’t changing too much. Insights and findings from these analyses will lead to additional rounds of data wrangling, debugging, collection, annotation, or augmentation to improve your data.

Exploratory data analysis (EDA) is a critical tool with the above issues, as it helps you identify patterns, detect outliers, and test assumptions. EDA tools are typically used for the initial data analysis step, but EDA principles and techniques are useful for the other steps of ML lifecycle in uncovering insights into data, labels, and model outputs. Unfortunately, existing EDA tools are not enough to support data-centric NLP as is.

Why can’t we use existing data exploration tools?

First, they are separate from the rest of NLP tools, so you would need to switch between multiple tools. Frequently importing and exporting between spreadsheets, programming languages, and visualization tools can be a pain. So we need exploration tools that can be seamlessly integrated with existing model development tools and environments. Next, most EDA tools are designed for structured data, such as tabular data. As a result, users often have to transform their data into new forms (e.g., text → word frequency, embeddings, topic modeling) or look at a few samples as explained by Wongsuphasawat et al. 2019. This calls for a tool with a dedicated data model, operations, and interfaces with built-in text support.

More recently, some tools such as B2, DataPrep.EDA, Altair, Symphony, and Leam try to mitigate these challenges by integrating codes and visualizations. However, these works either lack built-in text support, lack customizability and composability, or are bespoke tools that require environment switching. In addition, these tools do not offer rich interactive visualizations that are supported in popular EDA tools.

To address these challenges, we first have to understand how NLP researchers and practitioners examine data for their projects and identify what are the key requirements for EDA for data-centric NLP.

NLP notebook analysis: understanding data exploration practices

To understand current data exploration practices for NLP, we analyzed 5k+ public computational notebooks from Kaggle and GitHub.

We manually examined the top 30 most-voted-on NLP notebooks as of July 2022 from Kaggle and collected frequently used text transformation functions. We observed that text data is often first converted into a structured format (e.g., frequent word distribution, ngram count, document length, distribution by labels, and embedding). This allows them to do further analyses using different visualizations as shown in Table 1. We also noticed that there was a lot of repetition in the code snippets. For example, creating the same visualizations for different slices of data was common. This may be due to the used visualization libraries being static, which calls for rich interaction capabilities in EDA for data-centric NLP.

Visualization Transformation

Table view

simple overview/statistics of data, class distribution
Bar chart / histogram
class distribution, word count, ngram count, data item count per matching condition, document length, punctuation analysis, feature importance, embedding visualization
Line chart
document length, numerical trend over time
Scatter plot
t-SNE distribution, bivariate correlation, data item distribution, data item distribution with clustering, numerical trend over time
KDE plot
word count, document length
Pie chart
class distribution
Treemap
word count

Table 1: Transformation functions per visualization type from top 30 most-voted-on Kaggle notebooks with NLP tag.

These top-voted-on notebooks from Kaggle were more instructive and well-documented – like a tutorial, but they are not necessarily about the best performing or most practical solutions for their tasks. As a follow-up analysis, we wanted to investigate that our findings can be generalized to real-world settings, specifically, if the various visualizations in Table 1 are actively used by NLP practitioners. From the GitHub notebook dataset, we sampled the most recent 80k notebooks and retrieved 5k notebooks that have used well-known NLP-specific packages. Then, we looked at the import statements to see which visualization packages were frequently used. Focusing on the top-two libraries (matplotlib and pandas dataframe), we found that people usually use basic visualizations like bar charts, line charts, and scatter plots, along with plain tables rather than fancy visualizations. These visualizations are used to find patterns of a variable, or how two variables are related. Given the popularity of bar charts and scatter plots we decided to support uni- and bi-variate analysis using these as the base visualization types. 

Weedle: EDA for data-centric NLP

Figure 1: (a) Syntax for data/widget instance initialization, transformations, and loading dashboard. (b) A dashboard widget with interconnected charts, which can be composed programmatically or interactively. Users can filter or group data items by brushing or selecting visual objects. Filter view displays currently applied filters, and table view shows raw data.

Weedle is a pip-installable Python package. Its widgets are implemented using ipywidget and Svelte. The underlying data operations utilize various NLP packages including nltk, spacy, gensim, and transformers, along with data management/analysis libraries such as numpy and pandas.

Data model for text support

Input data is stored as a DataFrame where each row represents a data item and each column represents a feature. According to user interactions, the data is filtered and aggregated for dashboard visualizations. For filtered data and aggregated data, Weedle keeps the history of explorations for provenance management.

Weedle offers built-in text transformations which can be triggered with simple parameters (Fig 1a). Common functions such as document length, word count, word embedding, named entity recognition, tf-idf, topic modeling, sentiment analysis, and part-of-speech (POS) tagging are included. Users can pass on optional parameters, e.g., custom stopwords list for creating a bag of words. Any custom transformations can be plugged in as Python functions.

Each transformation operation appends a new feature column. Features can then be used to create descriptive statistics and visualizations.

Composable dashboard with multiple coordinated charts

Weedle’s composable dashboard widget enables EDA within computational notebooks. Each dashboard has the chart view, the filter view, the table view, and the menu bar (Fig 1b). Users can build a dashboard on demand by providing configurations programmatically or interactively in a widget. All components in a dashboard are linked – selecting and filtering in one component automatically updates the others (Fig 2). Users can resize the widget and its view components to allocate more space if needed.

Weedle dashboard

Figure 2: Interactive filtering for subset analysis. All charts are interconnected by shared filters. (b) Users can select visual components, i.e., bars and dots, via clicking (top) or brushing (bottom) to apply filters. (a) and (c) show the widget status before and after the filtering interaction, correspondingly.

Want to learn more? Here is a demo video. You can also track progress via our project page

Written by: Hannah Kim and Megagon Labs

Follow us on LinkedIn and Twitter to stay up to date with us.

Share:

More Blog Posts: