Exploring Microsoft Foundry Local

Β· 1613 words Β· 8 minutes to read

Last week at Build, Microsoft released a public preview of Foundry Local - a new tool for running AI models locally on your machine, with a focus on privacy and security. It is a spiritual sibling to Azure AI Foundry, which is a managed Azure cloud service for building and running AI applications.

However, using Foundry Local is independent of Azure, and it can run models locally without any cloud dependencies. It is currently available for Windows x64, Windows ARM and macOS ARM.

I was involved in the private testing of Foundry Local already prior to last week’s announcement, and so I have been using it for a while now. In this post, I will share my experience with it.

How to think of Foundry Local? πŸ”—

There are several ways on how you can integrate AI models into your applications. The way I like to think about it, is to use tiered-taxonomy.

Tier 1. Low-level inference frameworks & libraries

These are the engines that actually load and execute your model graphs. Examples:

  • llama.cpp – native C/C++ CPU inference for LLaMA-style models
  • Candle – Rust-based, CPU/GPU inference
  • PyTorch – research/production DL library with eager & TorchScript modes
  • Apple MLX – Apple’s Swift/Obj-C array math + mlx.nn for just-in-time, lazy execution on Apple silicon
  • ONNX Runtime – C/C++ (with Python bindings) runtime that executes ONNX graphs with graph-level optimizations

Tier 2. Self-hosted orchestration & serving platforms

These wrap one or more low-level engines to give you REST endpoints, batching, model catalogs etc. They often come with a GUI for managing models and monitoring usage. Examples:

  • LM Studio can use different engines - built-in β€œllama” engine (via llama.cpp) or Apple MLX engine on macOS. Provides a desktop GUI plus a local REST server and access to models on HuggingFace.
  • Ollama – CLI + REST wrapper around llama.cpp

Tier 3. Managed PaaS / Cloud APIs Fully hosted inference endpoints you access over HTTP - the provider manages GPUs, scaling, SLAs, billing. Examples: OpenAI (and Azure OpenAI), Anthropic (Claude), Google AI Platform, Hugging Face Inference API, etc.

Within such taxonomy, Foundry Local slots into Tier 2, as a self-hosted orchestration and model serving platform. Foundry Local is a de fact CLI/service engine for ONNX - it support only ONNX Runtime models, exposes an OpenAI-compatible REST API on localhost and auto-downloads ONNX models optimized for your hardware from the online Foundry catalog. It also allows you to run your own models, as long as they are in ONNX format.

Getting started πŸ”—

Installation is super simple, with the installers currently being available for Windows x64, Windows ARM and macOS ARM. The installer is a single executable that installs the Foundry Local service and CLI.

The Foundry Local service is started upon first interaction with the foundry CLI. You can check the status of the service with the foundry service status command:

➜  /Users/filipw/dev  foundry service status
🟒 Model management service is running on http://localhost:5273/openai/status

On MacOS, Foundry ships with the following models:

➜  /Users/filipw/dev  foundry model list
Alias                          Device     Task               File Size    License      Model ID            
-----------------------------------------------------------------------------------------------
phi-4                          GPU        chat-completion    8.37 GB      MIT          Phi-4-generic-gpu   
                               CPU        chat-completion    10.16 GB     MIT          Phi-4-generic-cpu   
--------------------------------------------------------------------------------------------------------
mistral-7b-v0.2                GPU        chat-completion    4.07 GB      apache-2.0   mistralai-Mistral-7B-Instruct-v0-2-generic-gpu
                               CPU        chat-completion    4.07 GB      apache-2.0   mistralai-Mistral-7B-Instruct-v0-2-generic-cpu
-------------------------------------------------------------------------------------------------------------------------------------
phi-3.5-mini                   GPU        chat-completion    2.16 GB      MIT          Phi-3.5-mini-instruct-generic-gpu
                               CPU        chat-completion    2.53 GB      MIT          Phi-3.5-mini-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
phi-3-mini-128k                GPU        chat-completion    2.13 GB      MIT          Phi-3-mini-128k-instruct-generic-gpu
                               CPU        chat-completion    2.54 GB      MIT          Phi-3-mini-128k-instruct-generic-cpu
---------------------------------------------------------------------------------------------------------------------------
phi-3-mini-4k                  GPU        chat-completion    2.13 GB      MIT          Phi-3-mini-4k-instruct-generic-gpu
                               CPU        chat-completion    2.53 GB      MIT          Phi-3-mini-4k-instruct-generic-cpu
-------------------------------------------------------------------------------------------------------------------------
phi-4-mini-reasoning           GPU        chat-completion    3.15 GB      MIT          Phi-4-mini-reasoning-generic-gpu
                               CPU        chat-completion    4.52 GB      MIT          Phi-4-mini-reasoning-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
deepseek-r1-14b                GPU        chat-completion    10.27 GB     MIT          deepseek-r1-distill-qwen-14b-generic-gpu
-------------------------------------------------------------------------------------------------------------------------------
deepseek-r1-7b                 GPU        chat-completion    5.58 GB      MIT          deepseek-r1-distill-qwen-7b-generic-gpu
------------------------------------------------------------------------------------------------------------------------------
phi-4-mini                     GPU        chat-completion    3.72 GB      MIT          Phi-4-mini-instruct-generic-gpu
----------------------------------------------------------------------------------------------------------------------
qwen2.5-0.5b                   GPU        chat-completion    0.68 GB      apache-2.0   qwen2.5-0.5b-instruct-generic-gpu
                               CPU        chat-completion    0.80 GB      apache-2.0   qwen2.5-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-0.5b             GPU        chat-completion    0.52 GB      apache-2.0   qwen2.5-coder-0.5b-instruct-generic-gpu
                               CPU        chat-completion    0.80 GB      apache-2.0   qwen2.5-coder-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-1.5b                   GPU        chat-completion    1.51 GB      apache-2.0   qwen2.5-1.5b-instruct-generic-gpu
                               CPU        chat-completion    1.78 GB      apache-2.0   qwen2.5-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-7b                     GPU        chat-completion    5.20 GB      apache-2.0   qwen2.5-7b-instruct-generic-gpu
                               CPU        chat-completion    6.16 GB      apache-2.0   qwen2.5-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-1.5b             GPU        chat-completion    1.25 GB      apache-2.0   qwen2.5-coder-1.5b-instruct-generic-gpu
                               CPU        chat-completion    1.78 GB      apache-2.0   qwen2.5-coder-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-7b               GPU        chat-completion    4.73 GB      apache-2.0   qwen2.5-coder-7b-instruct-generic-gpu
                               CPU        chat-completion    6.16 GB      apache-2.0   qwen2.5-coder-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------------
qwen2.5-14b                    GPU        chat-completion    9.30 GB      apache-2.0   qwen2.5-14b-instruct-generic-gpu
                               CPU        chat-completion    11.06 GB     apache-2.0   qwen2.5-14b-instruct-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-14b              GPU        chat-completion    8.79 GB      apache-2.0   qwen2.5-coder-14b-instruct-generic-gpu
                               CPU        chat-completion    11.06 GB     apache-2.0   qwen2.5-coder-14b-instruct-generic-cpu

The GPU variants indicate that the models offer Metal acceleration on Apple silicon.

The model list depends on the hardware and platform, which leads us to one of the great value propositions of Foundry Local - its compatibility with Windows ARM, a notoriously neglected platform. Foundry Local not only works on Windows ARM but even includes models optimized for Copilot+ PCs and their NPUs. This is very exciting, as very few orchestrators support Windows ARM at all, let alone with optimized models.

When I run the same command on my Windows ARM machine, I get the following set of models (note some NPU models):

PS C:\Users\filip> foundry model list
Alias                          Device     Task               File Size    License      Model ID
-----------------------------------------------------------------------------------------------
phi-4                          CPU        chat-completion    10.16 GB     MIT          Phi-4-generic-cpu
--------------------------------------------------------------------------------------------------------
phi-3.5-mini                   CPU        chat-completion    2.53 GB      MIT          Phi-3.5-mini-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
deepseek-r1-14b                NPU        chat-completion    7.12 GB      MIT          deepseek-r1-distill-qwen-14b-qnn-npu
---------------------------------------------------------------------------------------------------------------------------
deepseek-r1-7b                 NPU        chat-completion    3.71 GB      MIT          deepseek-r1-distill-qwen-7b-qnn-npu
--------------------------------------------------------------------------------------------------------------------------
phi-4-mini-reasoning           NPU        chat-completion    2.78 GB      MIT          Phi-4-mini-reasoning-qnn-npu
                               CPU        chat-completion    4.52 GB      MIT          Phi-4-mini-reasoning-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
phi-3-mini-128k                CPU        chat-completion    2.54 GB      MIT          Phi-3-mini-128k-instruct-generic-cpu
---------------------------------------------------------------------------------------------------------------------------
phi-3-mini-4k                  CPU        chat-completion    2.53 GB      MIT          Phi-3-mini-4k-instruct-generic-cpu
-------------------------------------------------------------------------------------------------------------------------
mistral-7b-v0.2                CPU        chat-completion    4.07 GB      apache-2.0   mistralai-Mistral-7B-Instruct-v0-2-generic-cpu
-------------------------------------------------------------------------------------------------------------------------------------
qwen2.5-0.5b                   CPU        chat-completion    0.80 GB      apache-2.0   qwen2.5-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-0.5b             CPU        chat-completion    0.80 GB      apache-2.0   qwen2.5-coder-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-1.5b                   CPU        chat-completion    1.78 GB      apache-2.0   qwen2.5-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-7b                     CPU        chat-completion    6.16 GB      apache-2.0   qwen2.5-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-1.5b             CPU        chat-completion    1.78 GB      apache-2.0   qwen2.5-coder-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-7b               CPU        chat-completion    6.16 GB      apache-2.0   qwen2.5-coder-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------------
qwen2.5-14b                    CPU        chat-completion    11.06 GB     apache-2.0   qwen2.5-14b-instruct-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-14b              CPU        chat-completion    11.06 GB     apache-2.0   qwen2.5-coder-14b-instruct-generic-cpu

The models are downloaded on demand and cached for future use. The model can be downloaded when starting the inference session (foundry model run {model id}) or using an explicit command (foundry model download {model id})

➜  /Users/filipw/dev  foundry model download qwen2.5-1.5b
Downloading model...
[####################################] 100.00 % [Time remaining: about 0s]        34.6 MB/s
Tips:
- To find model cache location use: foundry cache location
- To find models already downloaded use: foundry cache ls

Once its there you can simply chat with it in the terminal:

➜  /Users/filipw/dev  foundry model run qwen2.5-coder-0.5b 
Model qwen2.5-coder-0.5b was found in the local cache.
πŸ•” Loading model... 
🟒 Model qwen2.5-coder-0.5b-instruct-generic-gpu loaded successfully

Interactive Chat. Enter /? or /help for help.

Interactive mode, please enter your prompt
> hello who are you
πŸ€– Hello! I am a computer program designed to assist users with information and tasks. How can I assist you today?

Service mode πŸ”—

A model can be also loaded into the Foundry Service - you can then access via the OpenAI-compatible REST API, which is available at http://localhost:5273/v1.

➜  /Users/filipw/dev  foundry model load qwen2.5-1.5b  
πŸ•• Loading model... 
🟒 Model qwen2.5-1.5b loaded successfully

The loaded models can then be viewed with:

➜  /Users/filipw/dev  foundry service list
Models running in service:
    Alias                          Model ID            
🟒  qwen2.5-coder-0.5b             qwen2.5-coder-0.5b-instruct-generic-gpu
🟒  qwen2.5-1.5b                   qwen2.5-1.5b-instruct-generic-gpu

The API is compatible with the OpenAI API, so you can use it with any OpenAI-compatible client library such as openai or azure-ai-inference. Beware that you should use the “long” model ID in the request - not the alias.

Some time ago I used a toy classification example to illustrate how to switch between models running at different providers when using azure-ai-inference client library. The same example works with Foundry Local. Below is the code:

from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

instruction = """You are a medical classification engine for health conditions. Classify the prompt into into one of the following possible treatment options: 'doctor_required' (serious condition), 'pharmacist_required' (light condition) or 'rest_required' (general tiredness). If you cannot classify the prompt, output 'unknown'. 
Only respond with the single word classification. Do not produce any additional output.

# Examples:
User: "I did not sleep well." Assistant: "rest_required"
User: "I chopped off my arm." Assistant: "doctor_required"

# Task
User: 
"""

user_inputs = [
    "I'm tired.", # rest_required
    "I'm bleeding from my eyes.", # doctor_required
    "I have a headache." # pharmacist_required
]

def run_inference():
    for user_input in user_inputs:
        messages = [{
            "role": "user",
            "content": f"{instruction}{user_input} Assistant: "
        }]
        print(f"{user_input} -> ", end="")
        stream = client.complete(
            messages=messages,
            stream=True,
            model="qwen2.5-1.5b-instruct-generic-gpu"
        )
        for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                print(chunk.choices[0].delta.content, end="")
        print()

client = ChatCompletionsClient(
    endpoint="http://localhost:5273/v1",
    credential=AzureKeyCredential(""),
)

run_inference()

This uses our Foundry service running locally (the model being qwen2.5-1.5b-instruct-generic-gpu), and the output is:

I'm tired. -> rest_required
I'm bleeding from my eyes. -> doctor_required
I have a headache. -> pharmacist_required

Known limitations πŸ”—

I’ve been working with Foundry Local over the last little while, and it has certainly come a long way in that time. It does feel very stable and usable at this point, and is an integral part of my workflows now. But bear in mind that it is still in public preview and has some limitations - all of which are tracked on Github.

The two that I personally found most annoying: On Windows ARM, every version upgrade results in broken state and a reboot is needed. It is also currently not possible to access the service REST API from inside of WSL (or from outside of local host, for that matter).

But all things considered, Foundry Local is a very promising tool for running AI models locally, especially on Windows ARM. It is easy to install, has a good set of pre-packaged models, and provides a simple CLI and REST API for interacting with them.

Have a go yourself!

About


Hi! I'm Filip W., a software architect from ZΓΌrich πŸ‡¨πŸ‡­. I like Toronto Maple Leafs πŸ‡¨πŸ‡¦, Rancid and quantum computing. Oh, and I love the Lowlands 🏴󠁧󠁒󠁳󠁣󠁴󠁿.

You can find me on Github, on Mastodon and on Bluesky.

My Introduction to Quantum Computing with Q# and QDK book
Microsoft MVP