Last week at Build, Microsoft released a public preview of Foundry Local - a new tool for running AI models locally on your machine, with a focus on privacy and security. It is a spiritual sibling to Azure AI Foundry, which is a managed Azure cloud service for building and running AI applications.
However, using Foundry Local is independent of Azure, and it can run models locally without any cloud dependencies. It is currently available for Windows x64, Windows ARM and macOS ARM.
I was involved in the private testing of Foundry Local already prior to last week’s announcement, and so I have been using it for a while now. In this post, I will share my experience with it.
How to think of Foundry Local? π
There are several ways on how you can integrate AI models into your applications. The way I like to think about it, is to use tiered-taxonomy.
Tier 1. Low-level inference frameworks & libraries
These are the engines that actually load and execute your model graphs. Examples:
- llama.cpp β native C/C++ CPU inference for LLaMA-style models
- Candle β Rust-based, CPU/GPU inference
- PyTorch β research/production DL library with eager & TorchScript modes
- Apple MLX β Appleβs Swift/Obj-C array math + mlx.nn for just-in-time, lazy execution on Apple silicon
- ONNX Runtime β C/C++ (with Python bindings) runtime that executes ONNX graphs with graph-level optimizations
Tier 2. Self-hosted orchestration & serving platforms
These wrap one or more low-level engines to give you REST endpoints, batching, model catalogs etc. They often come with a GUI for managing models and monitoring usage. Examples:
- LM Studio can use different engines - built-in βllamaβ engine (via llama.cpp) or Apple MLX engine on macOS. Provides a desktop GUI plus a local REST server and access to models on HuggingFace.
- Ollama β CLI + REST wrapper around llama.cpp
Tier 3. Managed PaaS / Cloud APIs Fully hosted inference endpoints you access over HTTP - the provider manages GPUs, scaling, SLAs, billing. Examples: OpenAI (and Azure OpenAI), Anthropic (Claude), Google AI Platform, Hugging Face Inference API, etc.
Within such taxonomy, Foundry Local slots into Tier 2, as a self-hosted orchestration and model serving platform. Foundry Local is a de fact CLI/service engine for ONNX - it support only ONNX Runtime models, exposes an OpenAI-compatible REST API on localhost and auto-downloads ONNX models optimized for your hardware from the online Foundry catalog. It also allows you to run your own models, as long as they are in ONNX format.
Getting started π
Installation is super simple, with the installers currently being available for Windows x64, Windows ARM and macOS ARM. The installer is a single executable that installs the Foundry Local service and CLI.
The Foundry Local service is started upon first interaction with the foundry CLI. You can check the status of the service with the foundry service status
command:
β /Users/filipw/dev foundry service status
π’ Model management service is running on http://localhost:5273/openai/status
On MacOS, Foundry ships with the following models:
β /Users/filipw/dev foundry model list
Alias Device Task File Size License Model ID
-----------------------------------------------------------------------------------------------
phi-4 GPU chat-completion 8.37 GB MIT Phi-4-generic-gpu
CPU chat-completion 10.16 GB MIT Phi-4-generic-cpu
--------------------------------------------------------------------------------------------------------
mistral-7b-v0.2 GPU chat-completion 4.07 GB apache-2.0 mistralai-Mistral-7B-Instruct-v0-2-generic-gpu
CPU chat-completion 4.07 GB apache-2.0 mistralai-Mistral-7B-Instruct-v0-2-generic-cpu
-------------------------------------------------------------------------------------------------------------------------------------
phi-3.5-mini GPU chat-completion 2.16 GB MIT Phi-3.5-mini-instruct-generic-gpu
CPU chat-completion 2.53 GB MIT Phi-3.5-mini-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
phi-3-mini-128k GPU chat-completion 2.13 GB MIT Phi-3-mini-128k-instruct-generic-gpu
CPU chat-completion 2.54 GB MIT Phi-3-mini-128k-instruct-generic-cpu
---------------------------------------------------------------------------------------------------------------------------
phi-3-mini-4k GPU chat-completion 2.13 GB MIT Phi-3-mini-4k-instruct-generic-gpu
CPU chat-completion 2.53 GB MIT Phi-3-mini-4k-instruct-generic-cpu
-------------------------------------------------------------------------------------------------------------------------
phi-4-mini-reasoning GPU chat-completion 3.15 GB MIT Phi-4-mini-reasoning-generic-gpu
CPU chat-completion 4.52 GB MIT Phi-4-mini-reasoning-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
deepseek-r1-14b GPU chat-completion 10.27 GB MIT deepseek-r1-distill-qwen-14b-generic-gpu
-------------------------------------------------------------------------------------------------------------------------------
deepseek-r1-7b GPU chat-completion 5.58 GB MIT deepseek-r1-distill-qwen-7b-generic-gpu
------------------------------------------------------------------------------------------------------------------------------
phi-4-mini GPU chat-completion 3.72 GB MIT Phi-4-mini-instruct-generic-gpu
----------------------------------------------------------------------------------------------------------------------
qwen2.5-0.5b GPU chat-completion 0.68 GB apache-2.0 qwen2.5-0.5b-instruct-generic-gpu
CPU chat-completion 0.80 GB apache-2.0 qwen2.5-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-0.5b GPU chat-completion 0.52 GB apache-2.0 qwen2.5-coder-0.5b-instruct-generic-gpu
CPU chat-completion 0.80 GB apache-2.0 qwen2.5-coder-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-1.5b GPU chat-completion 1.51 GB apache-2.0 qwen2.5-1.5b-instruct-generic-gpu
CPU chat-completion 1.78 GB apache-2.0 qwen2.5-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-7b GPU chat-completion 5.20 GB apache-2.0 qwen2.5-7b-instruct-generic-gpu
CPU chat-completion 6.16 GB apache-2.0 qwen2.5-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-1.5b GPU chat-completion 1.25 GB apache-2.0 qwen2.5-coder-1.5b-instruct-generic-gpu
CPU chat-completion 1.78 GB apache-2.0 qwen2.5-coder-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-7b GPU chat-completion 4.73 GB apache-2.0 qwen2.5-coder-7b-instruct-generic-gpu
CPU chat-completion 6.16 GB apache-2.0 qwen2.5-coder-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------------
qwen2.5-14b GPU chat-completion 9.30 GB apache-2.0 qwen2.5-14b-instruct-generic-gpu
CPU chat-completion 11.06 GB apache-2.0 qwen2.5-14b-instruct-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-14b GPU chat-completion 8.79 GB apache-2.0 qwen2.5-coder-14b-instruct-generic-gpu
CPU chat-completion 11.06 GB apache-2.0 qwen2.5-coder-14b-instruct-generic-cpu
The GPU variants indicate that the models offer Metal acceleration on Apple silicon.
The model list depends on the hardware and platform, which leads us to one of the great value propositions of Foundry Local - its compatibility with Windows ARM, a notoriously neglected platform. Foundry Local not only works on Windows ARM but even includes models optimized for Copilot+ PCs and their NPUs. This is very exciting, as very few orchestrators support Windows ARM at all, let alone with optimized models.
When I run the same command on my Windows ARM machine, I get the following set of models (note some NPU models):
PS C:\Users\filip> foundry model list
Alias Device Task File Size License Model ID
-----------------------------------------------------------------------------------------------
phi-4 CPU chat-completion 10.16 GB MIT Phi-4-generic-cpu
--------------------------------------------------------------------------------------------------------
phi-3.5-mini CPU chat-completion 2.53 GB MIT Phi-3.5-mini-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
deepseek-r1-14b NPU chat-completion 7.12 GB MIT deepseek-r1-distill-qwen-14b-qnn-npu
---------------------------------------------------------------------------------------------------------------------------
deepseek-r1-7b NPU chat-completion 3.71 GB MIT deepseek-r1-distill-qwen-7b-qnn-npu
--------------------------------------------------------------------------------------------------------------------------
phi-4-mini-reasoning NPU chat-completion 2.78 GB MIT Phi-4-mini-reasoning-qnn-npu
CPU chat-completion 4.52 GB MIT Phi-4-mini-reasoning-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
phi-3-mini-128k CPU chat-completion 2.54 GB MIT Phi-3-mini-128k-instruct-generic-cpu
---------------------------------------------------------------------------------------------------------------------------
phi-3-mini-4k CPU chat-completion 2.53 GB MIT Phi-3-mini-4k-instruct-generic-cpu
-------------------------------------------------------------------------------------------------------------------------
mistral-7b-v0.2 CPU chat-completion 4.07 GB apache-2.0 mistralai-Mistral-7B-Instruct-v0-2-generic-cpu
-------------------------------------------------------------------------------------------------------------------------------------
qwen2.5-0.5b CPU chat-completion 0.80 GB apache-2.0 qwen2.5-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-0.5b CPU chat-completion 0.80 GB apache-2.0 qwen2.5-coder-0.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-1.5b CPU chat-completion 1.78 GB apache-2.0 qwen2.5-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------
qwen2.5-7b CPU chat-completion 6.16 GB apache-2.0 qwen2.5-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-1.5b CPU chat-completion 1.78 GB apache-2.0 qwen2.5-coder-1.5b-instruct-generic-cpu
------------------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-7b CPU chat-completion 6.16 GB apache-2.0 qwen2.5-coder-7b-instruct-generic-cpu
----------------------------------------------------------------------------------------------------------------------------
qwen2.5-14b CPU chat-completion 11.06 GB apache-2.0 qwen2.5-14b-instruct-generic-cpu
-----------------------------------------------------------------------------------------------------------------------
qwen2.5-coder-14b CPU chat-completion 11.06 GB apache-2.0 qwen2.5-coder-14b-instruct-generic-cpu
The models are downloaded on demand and cached for future use. The model can be downloaded when starting the inference session (foundry model run {model id}) or using an explicit command (foundry model download {model id})
β /Users/filipw/dev foundry model download qwen2.5-1.5b
Downloading model...
[####################################] 100.00 % [Time remaining: about 0s] 34.6 MB/s
Tips:
- To find model cache location use: foundry cache location
- To find models already downloaded use: foundry cache ls
Once its there you can simply chat with it in the terminal:
β /Users/filipw/dev foundry model run qwen2.5-coder-0.5b
Model qwen2.5-coder-0.5b was found in the local cache.
π Loading model...
π’ Model qwen2.5-coder-0.5b-instruct-generic-gpu loaded successfully
Interactive Chat. Enter /? or /help for help.
Interactive mode, please enter your prompt
> hello who are you
π€ Hello! I am a computer program designed to assist users with information and tasks. How can I assist you today?
Service mode π
A model can be also loaded into the Foundry Service - you can then access via the OpenAI-compatible REST API, which is available at http://localhost:5273/v1
.
β /Users/filipw/dev foundry model load qwen2.5-1.5b
π Loading model...
π’ Model qwen2.5-1.5b loaded successfully
The loaded models can then be viewed with:
β /Users/filipw/dev foundry service list
Models running in service:
Alias Model ID
π’ qwen2.5-coder-0.5b qwen2.5-coder-0.5b-instruct-generic-gpu
π’ qwen2.5-1.5b qwen2.5-1.5b-instruct-generic-gpu
The API is compatible with the OpenAI API, so you can use it with any OpenAI-compatible client library such as openai or azure-ai-inference. Beware that you should use the “long” model ID in the request - not the alias.
Some time ago I used a toy classification example to illustrate how to switch between models running at different providers when using azure-ai-inference client library. The same example works with Foundry Local. Below is the code:
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
instruction = """You are a medical classification engine for health conditions. Classify the prompt into into one of the following possible treatment options: 'doctor_required' (serious condition), 'pharmacist_required' (light condition) or 'rest_required' (general tiredness). If you cannot classify the prompt, output 'unknown'.
Only respond with the single word classification. Do not produce any additional output.
# Examples:
User: "I did not sleep well." Assistant: "rest_required"
User: "I chopped off my arm." Assistant: "doctor_required"
# Task
User:
"""
user_inputs = [
"I'm tired.", # rest_required
"I'm bleeding from my eyes.", # doctor_required
"I have a headache." # pharmacist_required
]
def run_inference():
for user_input in user_inputs:
messages = [{
"role": "user",
"content": f"{instruction}{user_input} Assistant: "
}]
print(f"{user_input} -> ", end="")
stream = client.complete(
messages=messages,
stream=True,
model="qwen2.5-1.5b-instruct-generic-gpu"
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
print()
client = ChatCompletionsClient(
endpoint="http://localhost:5273/v1",
credential=AzureKeyCredential(""),
)
run_inference()
This uses our Foundry service running locally (the model being qwen2.5-1.5b-instruct-generic-gpu
), and the output is:
I'm tired. -> rest_required
I'm bleeding from my eyes. -> doctor_required
I have a headache. -> pharmacist_required
Known limitations π
I’ve been working with Foundry Local over the last little while, and it has certainly come a long way in that time. It does feel very stable and usable at this point, and is an integral part of my workflows now. But bear in mind that it is still in public preview and has some limitations - all of which are tracked on Github.
The two that I personally found most annoying: On Windows ARM, every version upgrade results in broken state and a reboot is needed. It is also currently not possible to access the service REST API from inside of WSL (or from outside of local host, for that matter).
But all things considered, Foundry Local is a very promising tool for running AI models locally, especially on Windows ARM. It is easy to install, has a good set of pre-packaged models, and provides a simple CLI and REST API for interacting with them.
Have a go yourself!