Skip to content

Commit 42798ca

Browse files
azettlyvrjsharma
andauthored
Consilium: When Multiple LLMs Collaborate (#2977)
* Create consilium-multi-llm.md * First draft * Create thumbnail.png * Add links and minor text changes * Update consilium-multi-llm.md * Update consilium-multi-llm.md * Update consilium-multi-llm.md * add _blog.yml entry * Update consilium-multi-llm.md --------- Co-authored-by: Yuvraj Sharma <[email protected]>
1 parent 2c8337b commit 42798ca

File tree

3 files changed

+175
-1
lines changed

3 files changed

+175
-1
lines changed

_blog.yml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6384,6 +6384,16 @@
63846384
- evaluation
63856385
- ai
63866386

6387+
- local: consilium-multi-llm
6388+
title: "Consilium: When Multiple LLMs Collaborate"
6389+
author: azettl
6390+
thumbnail: /blog/assets/consilium-multi-llm/thumbnail.png
6391+
date: Jul 17, 2025
6392+
tags:
6393+
- multi-agents
6394+
- hackathon
6395+
- gradio
6396+
- mcp
63876397

63886398
- local: virtual-cell-challenge
63896399
title: "Arc Virtual Cell Challenge: A Primer"
@@ -6392,4 +6402,4 @@
63926402
date: July 18, 2025
63936403
tags:
63946404
- collaboration
6395-
- guide
6405+
- guide
76.7 KB
Loading

consilium-multi-llm.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
---
2+
title: "Consilium: When Multiple LLMs Collaborate"
3+
thumbnail: /blog/assets/consilium-multi-llm/thumbnail.png
4+
authors:
5+
- user: azettl
6+
---
7+
8+
# Consilium: When Multiple LLMs Collaborate
9+
10+
Picture this: four AI experts sitting around a poker table, debating your toughest decisions in real-time. That's exactly what Consilium, the multi-LLM platform I built during the [Gradio Agents & MCP Hackathon](https://huggingface.co/spaces/Agents-MCP-Hackathon/consilium_mcp), does. It lets AI models discuss complex questions and reach consensus through structured debate.
11+
12+
The platform works both as a visual Gradio interface and as an MCP (Model Context Protocol) server that integrates directly with applications like Cline (Claude Desktop had issues as the timeout could not be adjusted). The core idea was always about LLMs reaching consensus through discussion; that's where the name Consilium came from. Later, other decision modes like majority voting and ranked choice were added to make the collaboration more sophisticated.
13+
14+
## From Concept to Architecture
15+
16+
This wasn't my original hackathon idea. I initially wanted to build a simple MCP server to talk to my projects in RevenueCat. But I reconsidered when I realized a multi-LLM platform where these models discuss questions and return well-reasoned answers would be far more compelling.
17+
18+
The timing turned out to be perfect. Shortly after the hackathon, Microsoft published their [AI Diagnostic Orchestrator (MAI-DxO)](https://microsoft.ai/new/the-path-to-medical-superintelligence/), which is essentially an AI doctor panel with different roles like "Dr. Challenger Agent" that iteratively diagnose patients. In their setup with OpenAI o3, they correctly solved 85.5% of medical diagnosis benchmark cases, while practicing physicians achieved only 20% accuracy. This validates exactly what Consilium demonstrates: multiple AI perspectives collaborating can dramatically outperform individual analysis.
19+
20+
After settling on the concept, I needed something that worked as both an MCP server and an engaging Hugging Face space demo. Initially I considered using the standard Gradio Chat component, but I wanted my submission to stand out. The idea was to seat LLMs around a table in a boardroom with speech bubbles, which should capture the collaborative discussion while also making it visually engaging. As I did not manage to style a standard table nicely so it was actually recognized as a table, I went for a poker-style roundtable. This approach also let me submit to two hackathon tracks by building a custom Gradio component and MCP server.
21+
22+
## Building the Visual Foundation
23+
24+
The custom Gradio component became the heart of the submission; the poker-style roundtable where participants sit and display speech bubbles showing their responses, thinking status, and research activities immediately caught the eye of anyone visiting the space. The component development was remarkably smooth thanks to Gradio's excellent developer experience, though I did encounter one documentation gap around PyPI publishing that led to my first contribution to the Gradio project.
25+
26+
```python
27+
# The visual component integration
28+
roundtable = consilium_roundtable(
29+
    label="AI Expert Roundtable",
30+
label_icon="https://huggingface.co/front/assets/huggingface_logo-noborder.svg",
31+
    value=json.dumps({
32+
        "participants": [],
33+
        "messages": [],
34+
        "currentSpeaker": None,
35+
        "thinking": [],
36+
        "showBubbles": [],
37+
        "avatarImages": avatar_images
38+
    })
39+
)
40+
```
41+
42+
The visual design proved robust throughout the hackathon; after the initial implementation, only features like user-defined avatars and center table text were added, while the core interaction model remained unchanged.
43+
44+
If you are interested in creating your own custom Gradio component you should take a look at [Custom Components in 5 minutes](https://www.gradio.app/guides/custom-components-in-five-minutes) and yes the title does not lie; it literally only takes 5 minutes for the basic setup.
45+
46+
## Session State Management
47+
48+
The visual roundtable maintains state through a session-based dictionary system where each user gets isolated state storage via `user_sessions[session_id]`. The core state object tracks `participants`, `messages`, `currentSpeaker`, `thinking`, and `showBubbles` arrays that are updated through `update_visual_state()` callbacks. When models are thinking, speaking, or research is being executed, the engine pushes incremental state updates to the frontend by appending to the messages array and toggling speaker/thinking states, creating the real-time visual flow without complex state machines - just direct JSON state mutations synchronized between backend processing and frontend rendering.
49+
50+
## Making LLMs Actually Discuss
51+
52+
While implementing, I realized there was no real discussion happening between the LLMs because they lacked clear roles. They received the full context of ongoing discussions but didn't know how to engage meaningfully. I introduced distinct roles to create productive debate dynamics, which, after a few tweaks, ended up being like this:
53+
54+
```python
55+
self.roles = {
56+
'standard': "Provide expert analysis with clear reasoning and evidence.",
57+
'expert_advocate': "You are a PASSIONATE EXPERT advocating for your specialized position. Present compelling evidence with conviction.",
58+
'critical_analyst': "You are a RIGOROUS CRITIC. Identify flaws, risks, and weaknesses in arguments with analytical precision.",
59+
'strategic_advisor': "You are a STRATEGIC ADVISOR. Focus on practical implementation, real-world constraints, and actionable insights.",
60+
'research_specialist': "You are a RESEARCH EXPERT with deep domain knowledge. Provide authoritative analysis and evidence-based insights.",
61+
'innovation_catalyst': "You are an INNOVATION EXPERT. Challenge conventional thinking and propose breakthrough approaches."
62+
}
63+
```
64+
65+
This solved the discussion problem but raised a new question: how to determine consensus or identify the strongest argument? I implemented a lead analyst system where users select one LLM to synthesize the final result and evaluate whether consensus was reached.
66+
67+
I also wanted users to control communication structure. Beyond the default full-context sharing, I added two alternative modes:
68+
69+
* **Ring**: Each LLM only receives the previous participant's response  
70+
* **Star**: All messages flow through the lead analyst as a central coordinator
71+
72+
Finally, discussions need endpoints. I implemented configurable rounds (1-5), with testing showing that more rounds increase the likelihood of reaching consensus (though at higher computational cost).
73+
74+
## LLM Selection and Research Integration
75+
76+
The current model selection includes Mistral Large, DeepSeek-R1, Meta-Llama-3.3-70B, and QwQ-32B. While notable models like Claude Sonnet and OpenAI's o3 are absent, this reflected hackathon credit availability and sponsor award considerations rather than technical limitations.
77+
78+
```python
79+
self.models = {
80+
    'mistral': {
81+
        'name': 'Mistral Large',
82+
        'api_key': mistral_key,
83+
        'available': bool(mistral_key)
84+
    },
85+
    'sambanova_deepseek': {
86+
        'name': 'DeepSeek-R1',
87+
        'api_key': sambanova_key,
88+
        'available': bool(sambanova_key)
89+
    }
90+
...
91+
}
92+
```
93+
94+
For models supporting function calling, I integrated a dedicated research agent that appears as another roundtable participant. Rather than giving models direct web access, this agent approach provides visual clarity about external resource availability and ensures consistent access across all function-calling models.
95+
96+
```python
97+
def handle_function_calls(self, completion, original_prompt: str, calling_model: str) -> str:
98+
    """UNIFIED function call handler with enhanced research capabilities"""
99+
   
100+
    message = completion.choices[0].message
101+
   
102+
    # If no function calls, return regular response
103+
    if not hasattr(message, 'tool_calls') or not message.tool_calls:
104+
        return message.content
105+
   
106+
    # Process each function call
107+
    for tool_call in message.tool_calls:
108+
        function_name = tool_call.function.name
109+
        arguments = json.loads(tool_call.function.arguments)
110+
       
111+
        # Execute research and show progress
112+
        result = self._execute_research_function(function_name, arguments, calling_model_name)
113+
```
114+
115+
The research agent accesses five sources: Web Search, Wikipedia, arXiv, GitHub, and SEC EDGAR. I built these tools on an extensible base class architecture for future expansion while focusing on freely embeddable resources.
116+
117+
```python
118+
class BaseTool(ABC):
119+
    """Base class for all research tools"""
120+
   
121+
    def __init__(self, name: str, description: str):
122+
        self.name = name
123+
        self.description = description
124+
self.last_request_time = 0
125+
        self.rate_limit_delay = 1.0
126+
   
127+
    @abstractmethod
128+
    def search(self, query: str, **kwargs) -> str:
129+
        """Main search method - implemented by subclasses"""
130+
        pass
131+
   
132+
    def score_research_quality(self, research_result: str, source: str = "web") -> Dict[str, float]:
133+
        """Score research based on recency, authority, specificity, relevance"""
134+
        quality_score = {
135+
            "recency": self._check_recency(research_result),
136+
            "authority": self._check_authority(research_result, source),
137+
            "specificity": self._check_specificity(research_result),
138+
            "relevance": self._check_relevance(research_result)
139+
        }
140+
        return quality_score
141+
```
142+
143+
Since research operations can be time-intensive, the speech bubbles display progress indicators and time estimates to maintain user engagement during longer research tasks.
144+
145+
## Discovering the Open Floor Protocol
146+
147+
After the hackathon, Deborah Dahl introduced me to the [Open Floor Protocol](https://github.com/open-voice-interoperability/openfloor-docs), which aligns perfectly with the roundtable approach. This protocol provides standardized JSON message formatting for cross-platform agent communication. Its key differentiator from other agent-to-agent protocols is that all agents maintain constant conversation awareness exactly like sitting at the same table. Another feature I have not seen with other protocols is that the floor manager can dynamically invite and remove agents from the floor and agents.
148+
149+
The protocol's interaction patterns map directly to Consilium's architecture:
150+
151+
* **Delegation**: Transferring control between agents  
152+
* **Channeling**: Passing messages without modification  
153+
* **Mediation**: Coordinating behind the scenes  
154+
* **Orchestration**: Multiple agents collaborating
155+
156+
I'm currently integrating Open Floor Protocol support to allow users to add any OFP-compliant agents to their roundtable discussions. You can follow this development at: https://huggingface.co/spaces/azettl/consilium_ofp
157+
158+
## Lessons Learned and Future Implications
159+
160+
The hackathon introduced me to multi-agent debate research I hadn't previously encountered, including foundational studies like [Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate](https://arxiv.org/abs/2305.19118). The community experience was remarkable; all participants actively supported each other through Discord feedback and collaboration. Seeing my roundtable component integrated into [another hackathon project](https://huggingface.co/spaces/Agents-MCP-Hackathon/multi-agent-chat) was one of my highlights working on Consilium.
161+
162+
I will continue to work on Consilium and with expanded model selection, Open Floor Protocol integration, and configurable agent roles, the platform could support virtually any multi-agent debate scenario imaginable.
163+
164+
Building Consilium reinforced my conviction that AI's future lies not just in more powerful individual models but in systems enabling effective AI collaboration. As specialized smaller language models become more efficient and resource-friendly, I believe roundtables of task-specific SLMs with dedicated research agents may provide compelling alternatives to general-purpose large language models for many use cases.

0 commit comments

Comments
 (0)