Install the local LLM runtime (Ollama) and pull the recommended Qwen 2.5 instruct models on Windows, macOS, and Linux — fully offline, then point FrameworX at the endpoint.
AI Integration → Local AI → Installing Models (Windows, macOS, Linux)
FrameworX Local AI talks to an OpenAI-compatible chat-completions endpoint. The documented reference runtime is Ollama serving a Qwen 2.5 instruct model on http://localhost:11434/v1/chat/completions. Everything runs on the machine, with no internet connection — the runtime, the model weights, and every inference call stay local. Once a model is pulled, the network can be disconnected entirely and the AI features keep working.
This page covers installing the runtime and pulling the models on each operating system. For first-install orientation — hardware sizing, what to expect, and verifying the connection — see Local AI - First Install Walkthrough; for the full endpoint and tool-bit reference see Local AI Configuration.
FrameworX standardizes on the Qwen 2.5 instruct family. The choice between tiers is driven by tool-call reliability (the model must emit a well-formed JSON tool-call envelope for the chat + MCP-tool surface) and by the hardware on hand.
Model | Tier | Disk | Rough VRAM | When to use |
|---|---|---|---|---|
| Recommended default | ~4.7 GB | ~6 GB class | The default for most deployments — the recommended model for new solutions, demos, and templates. Best JSON tool-call reliability for the chat + MCP-tool surface; 7B is the floor below which the structured tool-call envelope starts to malform. Expects a machine with a GPU. |
| Limited-hardware fallback | ~2 GB | ~3 GB class | Lower speed and quality. Not recommended for the interactive chat experience; acceptable for atomic reporting, classification, and summary tasks. Use only when there is no GPU or the machine is very low spec. |
| Maximum performance | ~20 GB | ~20 GB VRAM class | Best reasoning and multi-step tool logic. Requires a strong GPU (roughly the 20 GB VRAM class). |
The model runs with no internet connection — everything is local. For the default 7B interactive chat experience a GPU is expected; CPU-only runs are slow and are best limited to the 3B atomic-task use. |
Ollama is the documented reference OpenAI-compatible local server. FrameworX also works with LM Studio (in OpenAI mode), vLLM, the llama.cpp server, or any other endpoint that speaks OpenAI-compatible chat-completions JSON — but Ollama is the runtime these instructions target, and the FrameworX defaults assume it. Whichever runtime you pick, the endpoint must answer at /v1/chat/completions for inference and (for Ollama, LM Studio, and vLLM) at /v1/models for the reachability listing.
Install Ollama with winget, or download the installer from https://ollama.com/download and run it. Ollama installs per-user — no administrator elevation is required.
winget install Ollama.Ollama |
After install, Ollama runs as a background service. Verify it is listening on port 11434:
# Confirm the service is reachable (lists installed models) curl http://localhost:11434/v1/models # Or check the listening port directly netstat -ano | findstr 11434 |
An NVIDIA CUDA GPU is auto-detected — Ollama offloads model layers to it transparently, with no configuration needed. This is the expected configuration for the 7B default.
To keep the model resident in memory between requests and avoid paying the cold-load cost on every call, set OLLAMA_KEEP_ALIVE as a system environment variable:
# System environment variable (run as admin), then restart Ollama setx OLLAMA_KEEP_ALIVE 24h /M |
The default keep-alive is 5 minutes; idle longer than that and the next call pays the cold-load again.
Install either the Ollama macOS app (download the .dmg from https://ollama.com/download and drag it to Applications) or via Homebrew:
brew install ollama |
Launch Ollama (open the app, or run ollama serve if installed via Homebrew). The Apple-Silicon GPU (Metal) is used automatically — no configuration needed. Verify the server:
curl http://localhost:11434/v1/models |
Set OLLAMA_KEEP_ALIVE so the model stays resident. For a launchd-managed Ollama, set it as a user environment variable with launchctl; for a shell-launched ollama serve, export it in your profile:
# launchd-managed install launchctl setenv OLLAMA_KEEP_ALIVE 24h # Or, for a shell-launched server (add to ~/.zshrc) export OLLAMA_KEEP_ALIVE=24h |
Install with the official one-line script from ollama.com:
curl -fsSL https://ollama.com/install.sh | sh |
The installer registers Ollama as a systemd service and starts it. Verify and check status:
systemctl status ollama curl http://localhost:11434/v1/models |
If an NVIDIA GPU is present with the proper drivers, the install script detects it and Ollama offloads layers automatically — the expected configuration for the 7B default. On a server with no GPU, expect slow inference; limit such machines to the 3B atomic-task use.
Set OLLAMA_KEEP_ALIVE in the systemd unit so the model stays resident across requests. Add it under [Service] via a drop-in:
# Create a drop-in: /etc/systemd/system/ollama.service.d/keepalive.conf [Service] Environment="OLLAMA_KEEP_ALIVE=24h" |
Then reload and restart:
sudo systemctl daemon-reload sudo systemctl restart ollama |
Pull the recommended default first. The 3B fallback and the 32B maximum-performance tier are optional — pull only what your hardware supports. The same commands work identically on Windows, macOS, and Linux.
# Recommended default (most deployments; expects a GPU) ollama pull qwen2.5:7b-instruct # Limited-hardware fallback (no GPU / very low spec) ollama pull qwen2.5:3b-instruct # Maximum performance (requires a strong GPU) ollama pull qwen2.5:32b-instruct |
Confirm what is installed with ollama list:
ollama list |
Multiple tiers can coexist on disk; pulling one does not remove another. Pulling requires internet access (or a pre-seeded model store); once pulled, inference is fully offline. Disk and rough VRAM per tier:
Model | Disk | Rough VRAM |
|---|---|---|
| ~2 GB | ~3 GB class |
| ~4.7 GB | ~6 GB class |
| ~20 GB | ~20 GB VRAM class |
VRAM figures are approximate and depend on quantization and context length. When the model does not fit in VRAM, Ollama runs the remaining layers on CPU — functional, but slower.
Once the runtime is serving and the model is pulled, connect the solution in Designer:
Name field to the model you pulled — for example qwen2.5:7b-instruct. This value goes into the POST body's "model" field and must match a model the endpoint can serve.URL field to http://localhost:11434/v1/chat/completions for a local runtime, or to http://<host-ip>:11434/v1/chat/completions to reach a runtime on a separate (typically GPU) host.Full field-by-field reference for the Settings JSON (URL, Name, Authorization, Headers, Info, TimeoutSeconds), the master enable, and the tool-category bits is on Local AI Configuration.