Characterizing Human-Centered Information Extraction

Information extraction (IE) approaches often play a pivotal role in text analysis and require significant human intervention. Therefore, a deeper understanding of existing IE practices and related challenges from a human-in-the-loop perspective is warranted. We conducted semi-structured interviews in an industrial environment to study projects employing information extraction, then we analyzed the reported IE approaches and limitations. We observed that data practitioners often follow an iterative task model consisting of information foraging and sensemaking loops across all the phases of an IE workflow. The task model is generalizable and captures diverse goals across these phases (e.g., data preparation, modeling, evaluation). We found several limitations in both foraging (e.g., data exploration) and sensemaking (e.g., qualitative debugging) loops stemming from a lack of adherence to existing cognitive engineering principles. Moreover, we identified that due to the iterative nature of an IE workflow, the requirement of provenance is often implied but rarely supported by existing systems. Based on these findings, we discussed design implications for supporting IE workflows. In particular, we proposed feature- and system-specific guidelines for designing human-centered data systems. The feature-specific guidelines, inspired by cognitive engineering principles for enhancing human-computer performance, recommend automating the unwanted workload of humans. The system-specific guidelines suggest (a) integrating provenance management mechanisms for the reproducibility of workflows and (b) building end-to-end solutions that support multiple modalities, such as data, code, and visualizations within a single tool, to ensure ease of use.