Xiaolan Wang, Aaron Feng, Behzad Golshan, Alon Y. Halevy, George A. Mihaila, Hidekazu Oiwa, Wang-Chiew Tan
We present the KOKO system that takes declarative information extraction to a new level by incorporating advances in natural language processing techniques in its extraction language. KOKO is
novel in that its extraction language simultaneously supports conditions on the surface of the text and on the structure of the dependency parse tree of sentences, thereby allowing for more refined
extractions. KOKO also supports conditions that are forgiving to
linguistic variation of expressing concepts and allows to aggregate
evidence from the entire document in order to filter extractions.
To scale up, KOKO exploits a multi-indexing scheme and heuristics for efficient extractions. We extensively evaluate KOKO over
publicly available text corpora. We show that KOKO indices take up
the smallest amount of space, are notably faster and more effective
than a number of prior indexing schemes. Finally, we demonstrate
KOKO’s scalability on a corpus of 5 million Wikipedia articles.