Towards Enterprise Compound AI Systems

The emergence of large language models (LLMs) as proficient agents — capable of accomplishing various knowledge-intensive tasks — has ushered in a new era of compound AI systems. Such systems support agentic workflows in which agents (such as LLMs) interact with tools and data retrievers to solve complex tasks involving natural language understanding, code generation, and complex reasoning. 

However, as we transition toward productizing such systems, production challenges persist, including issues related to consistency, availability, and monetary budget, among others. Researchers at Megagon Labs have been exploring how we can address the challenges of building compound AI systems for enterprises. In this blog post, we introduce three projects that we have undertaken: (1) developing a suitable architecture for productizing compound AI systems, (2) optimizing agentic workflows with real-world constraints, and (3) benchmarking the performance of agents within a compound AI system, specifically in an enterprise setting.

In this project, we design and develop a blueprint architecture for compound AI systems to operate in enterprise settings cost-effectively and feasibly. Our proposed architecture aims for seamless integration with existing compute and data infrastructure, with streams serving as the key orchestration concept to coordinate data and instructions among agents and other components. Together, task and data planners break down, map, and optimize tasks and data to available agents and data sources defined in their respective registries, given production constraints such as accuracy and latency. 

Orchestration platform to enterprise infrastructure

The question we tackle in this project is how to understand and analyze the performance of compound AI systems, given that they comprise many agents with varying performance outputs. Current approaches often rely on narrow, single-focus objectives for optimization and evaluation, and subsequently overlook real-world constraints. We propose a novel criterion: reasoning capacity. This criterion not only enables a more holistic approach to optimization and evaluation but also provides essential tools for interpreting, analyzing, and debugging compound AI systems.the 

Reasoning capacity multi-agent system structure

Drawing inspiration from information theory and distributed computing, we define reasoning capacity (RC) as a system’s overall ability to effectively process input and generate output for a given task within a set of constraints. More specifically, it is defined as the maximum mutual information between input and output with respect to the input distribution. We explore the use of reasoning capacity in addressing bottlenecks across various components of compound AI systems. These limitations range from budget, ethical, privacy, and trust considerations in orchestration and planning to out-of-distribution tasks and data, as well as the lack of self-verification capabilities in agents.

While compound AI systems (CASs) have the potential to supplement typical analysis workflows of data analysts in enterprise data platforms, unfortunately, CASs are subject to the same data discovery challenges that analysts have encountered over the years. These issues include silos of multimodal data sources, created across teams and departments within an organization, which make it difficult to identify appropriate data sources for accomplishing the task at hand. Existing data discovery benchmarks do not model such multimodality and multiplicity of data sources. Moreover, benchmarks of CASs prioritize only evaluating end-to-end task performance. To catalyze research on evaluating the data discovery performance of multimodal data retrievers in CASs within a real-world setting, we propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.

CMDBench from compound to source discovery and retrieval of enterprise data.

We adapt existing datasets and benchmarks in the open domain — from question answering and complex reasoning tasks to natural language querying over structured data — to evaluate coarse- and fine-grained data discovery and task execution performance. Our experiments reveal the impact of data retriever design on downstream task performance (a 46% drop in task accuracy on average) across various modalities, data sources, and task difficulty. The results indicate the need to develop optimization strategies to identify appropriate LLM agents and retrievers for efficient execution of CASs over enterprise data.

Compound AI systems offer a viable path to develop reliable, effective, and usable AI applications. While the best practices for developing AI systems are open to exploration, we believe interdisciplinary research in AI, NLP, databases, systems, and HCI will provide the right methodologies for effectively addressing the most persistent problems.

Written by Eser Kandogan, Sajjadur Rahman, Pouya Pezeshkpour and Megagon Labs

  title={Towards Enterprise Compound AI Systems},
  author={Eser Kandogan and Sajjadur Rahman and Pouya Pezeshkpour},


Follow us on LinkedIn and Twitter for more! 


More Blog Posts: