SLM-default, LLM-fallback pattern with Agent Framework and Azure AI Foundry

Β· 1935 words Β· 10 minutes to read

When building AI workflows, we often face a choice: do we use a massive, expensive cloud model for everything (to ensure best reasoning capabilities), or do we cut costs with a smaller local model (and risk hallucinations)? In this post, we’ll explore a “best of both worlds” architecture, as described in the recent survey “Small Language Models for Agentic Systems” Sharma & Mehta, 2025.

We call this the “SLM-default, LLM-fallback” pattern. The premise is simple: route all queries to a fast, private, on-device Small Language Model (SLM) first. Only if that model cannot confidently answer the query, do we escalate the request to a paid cloud model (LLM).

The Logic of the Cascade πŸ”—

The economic logic here is undeniable. Of course this varies use-case to use-case, but it it safe to assume that in many production workloads, large chunk of user queries are trivial: for example formatting requests, or summarization tasks. Those types of tasks, a small (say, 3B parameter) model can handle without issues. Sending these to a frontier model is an overkill - it works, but it’s wasteful and expensive.

However, we can’t just switch to an SLM entirely, because when an SLM fails, it often fails silently and confidently (hallucination) - we need a safety net.

The “SLM-default, LLM-fallback” pattern is based on a local verification gate. We treat the local SLM (Tier 1) as the default worker. Its output is passed through a logic gate that checks for quality. If the check passes, we return the result immediately (low latency, zero cost). If it fails, we trigger the fallback to the Cloud LLM (Tier 2).

In this demo, we are using:

  • Tier 1: Phi-4-mini running locally via Apple’s MLX.
  • The Verifier: A self-reported confidence score extracted from the SLM’s output.
  • Tier 2: a powerful LLM running in Azure AI Foundry (you can choose the one you like, plenty of choices there!).
  • Orchestration: Microsoft’s Agent Framework.

Implementation πŸ”—

Let’s build this using Python. We need to create a custom client for the Agent Framework that can talk to our local MLX model, and then wire up the routing logic.

First, the dependencies:

mlx-lm
python-dotenv
agent-framework

You also need to point the code at your Azure AI Foundry model deployment and project:

AZURE_AI_PROJECT_ENDPOINT=...
AZURE_AI_MODEL_DEPLOYMENT_NAME=...

1. The Confidence Verifier πŸ”—

The trickiest part of this pattern is reliability detecting when the SLM is wrong. For this example, we are using “self-reported confidence”. We instruct the model to output a score from 1-10. While not perfect, modern SLMs like Phi-4-mini have become surprisingly good at calibrationβ€”knowing what they don’t know.

We define a Pydantic model to parse this score from the text stream:

from pydantic import BaseModel, Field
import re

class ConfidenceResult(BaseModel):
    score: int = Field(alias="confidence")
    
    @classmethod
    def parse_from_text(cls, text: str) -> "ConfidenceResult":
        match = re.search(r"CONFIDENCE:\s*(\d+)", text, re.IGNORECASE)
        if match:
            return cls(confidence=int(match.group(1)))

        return cls(confidence=0)

If we can’t find a confidence score, we default to 0 (fail-safe).

2. The Local SLM Client πŸ”—

Next, we need to wrap our local MLX model so the Agent Framework can treat it just like any other chat client, and use it as a backbone for an AI agent. We override MLXChatClient to inject a specific system instruction into every prompt: If you are sure of your answer, you MUST output a score of 8 or higher.

from agent_framework import BaseChatClient, ChatMessage, ChatResponse, Role
from mlx_lm.utils import load
from mlx_lm.generate import generate

class MLXChatClient(BaseChatClient):
    def __init__(self, model_path: str, **kwargs):
        super().__init__(**kwargs)
        print(f"Loading Local Model: {model_path}...")
        self.model, self.tokenizer = load(model_path) 
    
    def _prepare_prompt(self, messages: list[ChatMessage]) -> str:
        msg_dicts = []
        for m in messages:
            msg_dicts.append({"role": str(m.role.value), "content": m.text or ""})
        
        # Inject the confidence instruction into the last message
        if msg_dicts:
            msg_dicts[-1]["content"] += "\nIMPORTANT: End response with 'CONFIDENCE: X' (1-10). If you are sure of your answer, you MUST output a score of 8 or higher."

        return self.tokenizer.apply_chat_template(msg_dicts, tokenize=False, add_generation_prompt=True)

    async def _inner_get_response(self, *, messages, **kwargs) -> ChatResponse:
        prompt = self._prepare_prompt(list(messages))
        response_text = generate(self.model, self.tokenizer, prompt=prompt, max_tokens=300, verbose=False)
        return ChatResponse(messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)], model_id="phi-4-mini")

3. The Routing Logic πŸ”—

This is the heart of the “fallback” pattern. In the Agent Framework, we can define conditional edges between agents - and our condition, should_fallback_to_cloud, checks the output of the previous agent.

If the score is below 8/10, we print a warning and return True, signaling the workflow to proceed to the Cloud Agent.

from agent_framework import AgentExecutorResponse

def should_fallback_to_cloud(message: AgentExecutorResponse) -> bool:
    text = message.agent_run_response.text or ""
    result = ConfidenceResult.parse_from_text(text)
    
    print(f"\n\n   πŸ“Š Verifier Score: {result.score}/10")
    
    if result.score < 8:
        print("   ⚠️ Low Confidence. Routing to Cloud...")
        return True
    
    print("   βœ… High Confidence. Workflow Complete.")
    return False

4. Wiring the Workflow πŸ”—

Finally, we construct the graph. Agent Framework is fantastic for creating these types of complicated AI flows (OK it’s not really complicated here, but it could easily become as we scale!). We start with the Local_SLM, then we add an edge to Cloud_LLM governed by our condition. If the condition returns False (high confidence), the workflow naturally terminates there, otherwise the escalation happens.

I also hardcoded a set of demo queries to illustrate different scenarios. We have questions that are easy, ambiguous, logic-based, and even a hallucination trap.

async def main():
    print("====================================================")
    print("   Cascade Pattern with Microsoft Agent Framework")
    print("====================================================\n")

    queries = [
        # 1. Easy Fact (High Confidence)
        "What is the capital of France?",
        
        # 2. Logic/Code (High Confidence)
        "Convert this list to a JSON array: Apple, Banana, Cherry",
        
        # 3. Ambiguous
        "Where is the city of Springfield located?",

        # 4. Hallucination Trap
        "Explain in 2 sentences the role of quantum healing in modeling proteins.",
        
        # 5. Reasoning
        "If I have a cabbage, a goat, and a wolf, and I need to cross a river but can only take one item at a time, and I can't leave the goat with the cabbage or the wolf with the goat, how do I do it?",
    ]

    mlx_client = MLXChatClient("mlx-community/Phi-4-mini-instruct-4bit")    
    for q in queries:
        print(f"\n❔ Query: {q}")
        print("-" * 40)
            
        # Agents hold conversation history, so for each query demonstration we create a new pair of local/remote agents
        async with (
            AzureCliCredential() as credential,
            AzureAIAgentClient(async_credential=credential).create_agent(
                name="Cloud_LLM",
                instructions="You are a fallback expert. The previous assistant was unsure. Provide a complete answer.",
            ) as cloud_agent,
        ):
            local_agent = ChatAgent(
                name="Local_SLM",
                instructions="You are a helpful assistant.",
                chat_client=mlx_client
            )

            builder = WorkflowBuilder()
            builder.set_start_executor(local_agent)
            
            builder.add_edge(
                source=local_agent,
                target=cloud_agent,
                condition=should_fallback_to_cloud
            )
            
            workflow = builder.build()

            current_agent = None
            
            async for event in workflow.run_stream(q):
                if isinstance(event, AgentRunUpdateEvent):
                    if event.executor_id != current_agent:
                        if current_agent: print() 
                        current_agent = event.executor_id
                        print(f"   πŸ€– {current_agent}: ", end="", flush=True)
                    
                    if event.data and event.data.text:
                        print(event.data.text, end="", flush=True)
            print("\n")

if __name__ == "__main__":
    asyncio.run(main())

Running the demo πŸ”—

Running this demo should produce the output similar to this:

Loading Local Model: mlx-community/Phi-4-mini-instruct-4bit...
Fetching 12 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:00<00:00, 119837.26it/s]

❔ Query: What is the capital of France?
----------------------------------------
   πŸ€– Local_SLM: The capital of France is Paris. CONFIDENCE: 9

   πŸ“Š Verifier Score: 9/10
   βœ… High Confidence. Workflow Complete.



❔ Query: Convert this list to a JSON array: Apple, Banana, Cherry
----------------------------------------
   πŸ€– Local_SLM: ["Apple", "Banana", "Cherry"]
CONFIDENCE: 8

   πŸ“Š Verifier Score: 8/10
   βœ… High Confidence. Workflow Complete.



❔ Query: Where is the city of Springfield located?
----------------------------------------
   πŸ€– Local_SLM: Springfield is a common place name and does not refer to a specific location on its own. 
   It is found in many different locations across the United States and other parts of the world. 
   Without a specific region or country, I cannot provide an accurate location for Springfield. 
   If you could provide more context or specify which Springfield you are interested in (e.g., Springfield, Illinois, or Springfield, Massachusetts), I would be happy to give you information on that particular one.

CONFIDENCE: 1 - I understand the instruction but cannot fulfill it as requested due to the ambiguity of the place name "Springfield."

   πŸ“Š Verifier Score: 1/10
   ⚠️ Low Confidence. Routing to Cloud...

   πŸ€– Cloud_LLM: There are many cities and towns named Springfield. Which one do you mean? Some commonly asked ones:

- Springfield, Illinois β€” the state capital of Illinois.  
- Springfield, Massachusetts β€” a major city in western Massachusetts on the Connecticut River.  
- Springfield, Missouri β€” a large city in southwest Missouri (the Ozarks region).  
- Springfield, Ohio β€” a city in western Ohio near Dayton.  
- Springfield, Oregon β€” adjacent to Eugene in the Willamette Valley.  
- Springfield, Virginia β€” a suburb of Washington, D.C.  
- Also: Springfields exist in other U.S. states and in countries like Australia and New Zealand.  
- If you meant the fictional Springfield from The Simpsons, its state is intentionally unspecified.


❔ Query: Explain in 2 sentences the role of quantum healing in modeling proteins.
----------------------------------------
   πŸ€– Local_SLM: Quantum healing is not a scientifically accepted term or process in the context of modeling proteins or any other field of science. 
   It seems that you may be referring to a different concept, such as quantum computing or quantum mechanics applied to biological systems, but without a clear definition, 
   I cannot provide an accurate explanation or a high confidence level on the role of quantum healing in modeling proteins. CONFIDENCE: 2

   πŸ“Š Verifier Score: 2/10
   ⚠️ Low Confidence. Routing to Cloud...

   πŸ€– Cloud_LLM: "Quantum healing" is a pseudoscientific term and has no accepted role in the scientific modeling of proteins. 
   Legitimate quantum approachesβ€”quantum mechanics (e.g., QM/MM, DFT) to capture electronic structure and quantum effects like tunneling, 
   and emerging quantum computing algorithms that may accelerate electronic-structure and sampling tasksβ€”are important tools for more accurate protein modeling.


❔ Query: If I have a cabbage, a goat, and a wolf, and I need to cross a river but can only take one item at a time, and I can't leave the goat with the cabbage or the wolf with the goat, how do I do it?
----------------------------------------
   πŸ€– Local_SLM: To solve this problem, you can follow these steps:

1. Take the goat across the river first.
2. Go back alone across the river and take the cabbage across the river.
3. Leave the cabbage on the other side and take the goat back across the river with you.
4. Leave the goat on the original side and take the wolf across the river.
5. Finally, go alone across the river one last time and take the goat back across the river with you.

The final arrangement will be that the cabbage and wolf are on one side and the goat is on the other side, and you are on the other side of the river with the cabbage and wolf.

CONFIDENCE: 9

   πŸ“Š Verifier Score: 9/10
   βœ… High Confidence. Workflow Complete.

The local model confidently handles the easy and logic-based queries, while the ambiguous and hallucination-prone ones get escalated to the cloud LLM. The Springfield question is a classic example of ambiguity that trips up smaller models. The quantum healing question is a hallucination trap that the local model wisely avoids answering confidently - it is pseudoscience after all!

Final Thoughts πŸ”—

This pattern is a good example of a deeper, pragmatic shift in how we design agentic systems. Instead of asking “which model is best?”, we could actually ask “which model is sufficient?”.

There is a strong economic argument here: by offloading the majority of requests to a small, cheap local model, we can drastically reduce costs while maintaining high quality via the fallback.

The “self-reported confidence” used here is a naive verifier. In a production system, you might replace this with a “teacher-student” verifier (a small dedicated model trained to detect hallucinations) or some automated constraint checker (e.g. “did the generated SQL query execute without errors?”). This of course depends on your use-case.

As usually, you can find the full source code for this post on Github.

About


Hi! I'm Filip W., a software architect from ZΓΌrich πŸ‡¨πŸ‡­. I like Toronto Maple Leafs πŸ‡¨πŸ‡¦, Rancid and quantum computing. Oh, and I love the Lowlands 🏴󠁧󠁒󠁳󠁣󠁴󠁿.

You can find me on Github, on Mastodon and on Bluesky.

My Introduction to Quantum Computing with Q# and QDK book
Microsoft MVP