Where to physically run the LLM endpoint relative to TServer — the tradeoffs, the four topologies, and the sizing recommendations. The REST-over-OpenAI-compatible-POST design makes every topology a one-field configuration change; this page covers when to use which.
AI Integration → Local AI → Deployment and Performance
Version 10.1.5+
|
LLM inference is CPU-heavy. A single call on a 7B-class model uses every available core for 1–30 seconds. If the LLM runs on the same host as TServer, the runtime competes with the model for CPU and operator-facing latency suffers. For production deployments with non-trivial AI workload, run the LLM on a separate host or VM. The REST surface is OpenAI-compatible HTTP — only the |
FrameworX Local AI calls an OpenAI-compatible chat-completions endpoint over HTTP. Both consumption paths — the ChatRequest Display action and the AI.Execute script API — issue the POST from inside the TServer process or one of its server-domain children (ScriptTaskServer.exe, ReportServer.exe). See Local AI Architecture Reference § Where Local AI runs for the per-process breakdown.
What that page does not address is the endpoint on the other side of the POST — where the actual LLM model loads, decodes, and generates tokens. That host has different resource profile from TServer:
LLM inference saturates CPU. A modern 7B-class model running CPU-only generates roughly 3–5 tokens per second on a typical x64 server, and uses every available core to do it. A 50-token response is ~10–15 seconds of 100% CPU; a 200-token narrative is ~40–60 seconds. There is no "throttle" knob — the model uses what it has.
TServer also wants CPU. The runtime polls devices, runs scripts, evaluates expressions, dispatches alarms, and serves clients. None of those are CPU-heavy individually, but they are latency-sensitive: a missed scan slows control loops, a delayed heartbeat shows as Disconnected on operator panels.
Co-location starves both. When TServer and the LLM share a host, every active LLM call effectively pauses the runtime for the duration of inference. Symptoms: Property Watch panels freeze, the Designer→TServer heartbeat times out at 5 seconds, the HTML5 client feels sluggish, alarm reactions delay by seconds.
This is not a FrameworX limitation — it is the nature of CPU-bound large-model inference. The fix is topological: move the inference workload off the runtime's critical path.
Topology |
Hardware |
Latency profile |
When to choose |
|---|---|---|---|
1. Same host — LLM runs on the TServer machine |
16+ GB RAM, modern 8+ core CPU. GPU optional but unused if absent. |
Inference latency native (no network). TServer contention during every call. |
Workshop / demo / engineering laptop. Single operator. Occasional chat. Low alarm-narrative volume. Default for fresh installs. |
2. Separate VM on same hardware — LLM in a sibling VM |
32+ GB RAM split, host hypervisor pins CPU cores per VM. |
Inference latency native (loopback or same-host network). TServer isolated from LLM CPU use. |
Existing hardware budget allows separation; ops team manages one box. Good middle ground when adding a new physical host is friction. |
3. Separate physical host — dedicated LLM server |
Dedicated box, 16–64 GB RAM, optional GPU (cuts latency 3–5×). Can be a refurbished workstation or a small server. |
Inference latency native to that host; +1–3 ms network. TServer never sees LLM load. Multiple FX runtimes can share one LLM host. |
Production deployment with regular AI use. Multiple solutions sharing one LLM. GPU acceleration desired. |
4. Cloud LLM endpoint — off-premises managed model |
Outbound HTTPS only. SecuritySecrets-backed API key. |
Network round-trip dominates (50–500 ms) for short prompts; inference itself can be 2–10× faster than CPU-only local. Throughput limited by provider rate-limits. |
Strong external LLM justified (GPT-class quality, multi-modal needed). No air-gap constraint. Per-call cost acceptable. Sensitive data redacted before send. |
Local AI calls an OpenAI-compatible chat/completions endpoint over HTTP. The TServer process opens the TCP connection; the endpoint can be on localhost, on a sibling VM at 192.168.x.y, on a dedicated server at llm-host.lan, or at https://api.example-provider.com. From FrameworX's side, all four cases are the same code path with a different URL string.
Switching topology after the solution is built is a single field edit on the AI Engine tile in Solution → Capabilities. The JSON shapes for each topology — local Ollama, remote Ollama, OpenAI-compatible cloud with Bearer token, endpoint with extra headers — are documented on Local AI Configuration § Pointing at a different LLM endpoint.
Authentication and credential handling for non-local endpoints is covered on SecuritySecrets Authentication for Local AI. The short version: API keys go in the SecuritySecrets vault and are referenced from the Authorization or Headers field via /secret:<Name> tokens — they never appear in plain text in the solution.
Model: qwen2.5:3b-instruct (~2 GB) — the limited-hardware fallback. On a no-GPU workshop or demo laptop this is the model that runs end-to-end on CPU. The recommended default is qwen2.5:7b-instruct, which expects a GPU (see Production topology below).
RAM: 16 GB total (4 GB headroom for Ollama + the model, 4 GB for TServer, rest for the OS / client). Smaller works for the LLM but cuts other margins.
CPU: Any modern x64 with 8+ cores. More cores = faster token generation; the model is fully parallel.
GPU: Optional. If present, Ollama auto-detects and uses it — latency drops 3–5×. A 4 GB consumer GPU comfortably runs the 3B model.
Expect: ~10–15 sec per medium-length operator chat reply on CPU; ~3–5 sec with a small GPU.
Model: qwen2.5:7b-instruct (~4.7 GB) — the FrameworX-recommended default quality / reasoning tier — or qwen2.5:32b-instruct for maximum reasoning if the LLM host has the GPU and RAM. Tool-call reliability (the ChatRequest action’s autonomous tool dispatch) is solid at 7B and improves further at 32B.
RAM on the LLM host: 16 GB minimum for 7B, 32 GB+ recommended for the 32B tier. The model loads fully into RAM (or VRAM) and stays warm.
CPU on the LLM host: 8+ cores if no GPU; the inference parallelizes across all of them.
GPU on the LLM host: Strongly recommended for production. An entry consumer GPU (RTX 3060 / 4060 class) typically runs 7B at native human-conversation speed. A 24 GB GPU comfortably runs the 32B tier.
Network: any reliable link, sub-10 ms latency. The Local AI client uses a 60-second per-call wall-clock timeout (configurable via TimeoutSeconds in the Settings JSON), so even occasional network blips do not break the surface.
Expect: 1–3 sec per operator chat reply with a GPU; ~5–15 sec on CPU-only with a dedicated host (still much better than co-located because TServer is not waiting on CPU).
Model: match the provider’s recommended instruct model. FrameworX requires only OpenAI-compatible chat-completions; the model selection is opaque to the platform.
Authentication: always SecuritySecrets-backed (see referenced page above). Hard-coded API keys are a deployment anti-pattern.
Data sensitivity: the prompt the LLM sees can include live tag values, alarm context, batch IDs — treat that as exfiltrated data. Compliance review before pointing FX at a third-party endpoint.
Cost model: per-token billing. For a busy alarm-narrative workload (one narrative per critical alarm raise, ~150 input + ~80 output tokens), budget by alarm volume.
Workload |
Same host |
Sibling VM |
Dedicated host |
Cloud |
|---|---|---|---|---|
Demo / training class |
Yes |
Overkill |
Overkill |
Avoid (network / cost dependency) |
Single operator, occasional chat |
Yes |
Fine |
Fine |
Fine |
Multiple operators, regular chat |
Painful (queueing) |
Acceptable |
Yes |
Yes |
Alarm-narrative on every critical alarm |
No (TServer pause-on-narrate) |
Acceptable |
Yes |
Yes |
End-of-shift batch summaries |
Acceptable (off-peak) |
Acceptable |
Yes |
Yes |
Multiple FX solutions sharing one LLM |
No |
Yes |
Yes |
Yes (rate-limit-aware) |
Air-gapped site, IP-sensitive prompts |
Yes |
Yes |
Yes |
No |
Per-tick reasoning in fast scan loops |
No |
No (still CPU-bound) |
Only with GPU |
Only with cloud rate budget |
If you suspect the LLM is stealing CPU from TServer, watch for these in the same-host topology:
Designer Property Watch panels stop updating for tens of seconds at a time, then catch up in bursts. The Designer→TServer ServiceClient heartbeat is timing out at 5 seconds.
Alarm acknowledgment lag. Operator clicks Acknowledge; the alarm state visually updates several seconds later.
HTML5 client feels sluggish. Page refreshes pause; data bindings re-evaluate in chunks.
Scan-period overruns in the runtime log on Device or Script Tasks set to short periods (under 1 second).
Synchronous AI.Execute from a Script Task hangs the entire ScriptTaskServer.exe for the call duration. Other tasks queued behind it wait.
None of these are FrameworX defects — they are the operating-system scheduler responding to a CPU-bound peer process. Moving the LLM endpoint to another host eliminates all of them.
Fresh customer install of FrameworX 10.1.5+. The AI Engine endpoint is pre-configured for local Ollama at http://127.0.0.1:11434/v1/chat/completions with qwen2.5:7b-instruct — the recommended default, which expects a GPU. On a no-GPU machine, switch to the qwen2.5:3b-instruct fallback. To change the model or point at a remote endpoint, edit Solution → Capabilities → AI Engine → Edit Configuration.
Shipped demo solutions (BottlingLine ML Demo, SolarPanels MCP Demo, LocalAI KnowledgeGraph Demo) and the Chat Session Example are configured for a local Ollama model so they work end-to-end out of the box. The recommended model is qwen2.5:7b-instruct; on a no-GPU laptop, switch the Name field to qwen2.5:3b-instruct (the limited-hardware fallback) for evaluation. Pull the model named in the solution’s AI Engine settings with ollama pull <name>.
Production deployments override at Solution → Capabilities → AI Engine → Edit Configuration — pick qwen2.5:7b-instruct (or qwen2.5:32b-instruct) for the same host, or point URL at a remote Ollama / cloud LLM endpoint. The Settings JSON written by the dialog takes precedence over the installed default.
Local AI Configuration — JSON examples for each topology’s Settings value (local Ollama, remote Ollama, cloud, custom headers).
Local AI - First Install Walkthrough — getting the local Ollama running for same-host topology.
SecuritySecrets Authentication for Local AI — storing API keys safely for cloud / remote endpoints.
Local AI Architecture Reference § Where Local AI runs — which FX-side processes initiate the LLM POST.
In this section...