Where to physically run the LLM endpoint relative to TServer — the tradeoffs, the four topologies, and the sizing recommendations. The REST-over-OpenAI-compatible-POST design makes every topology a one-field configuration change; this page covers when to use which.
AI Integration → Local AI → Deployment and Performance
Version 10.1.5+
| Info | ||
|---|---|---|
| ||
LLM inference is CPU-heavy. A single call on a 7B-class model uses every available core for 1–30 seconds. If the LLM runs on the same host as TServer, the runtime competes with the model for CPU and operator-facing latency suffers. For production deployments with non-trivial AI workload, run the LLM on a separate host or VM. The REST surface is OpenAI-compatible HTTP — only the |
Why this matters
FrameworX Local AI calls an OpenAI-compatible chat-completions endpoint over HTTP. Both consumption paths — the ChatRequest Display action and the AI.Execute script API — issue the POST from inside the TServer process or one of its server-domain children (ScriptTaskServer.exe, ReportServer.exe). See Local AI Architecture Reference § Where Local AI runs for the per-process breakdown.
What that page does not address is the endpoint on the other side of the POST — where the actual LLM model loads, decodes, and generates tokens. That host has different resource profile from TServer:
LLM inference saturates CPU. A modern 7B-class model running CPU-only generates roughly 3–5 tokens per second on a typical x64 server, and uses every available core to do it. A 50-token response is ~10–15 seconds of 100% CPU; a 200-token narrative is ~40–60 seconds. There is no "throttle" knob — the model uses what it has.
TServer also wants CPU. The runtime polls devices, runs scripts, evaluates expressions, dispatches alarms, and serves clients. None of those are CPU-heavy individually, but they are latency-sensitive: a missed scan slows control loops, a delayed heartbeat shows as Disconnected on operator panels.
Co-location starves both. When TServer and the LLM share a host, every active LLM call effectively pauses the runtime for the duration of inference. Symptoms: Property Watch panels freeze, the Designer→TServer heartbeat times out at 5 seconds, the HTML5 client feels sluggish, alarm reactions delay by seconds.
This is not a FrameworX limitation — it is the nature of CPU-bound large-model inference. The fix is topological: move the inference workload off the runtime's critical path.
The four topologies
Topology | Hardware | Latency profile | When to choose |
|---|---|---|---|
1. Same host — LLM runs on the TServer machine | 16+ GB RAM, modern 8+ core CPU. GPU optional but unused if absent. | Inference latency native (no network). TServer contention during every call. | Workshop / demo / engineering laptop. Single operator. Occasional chat. Low alarm-narrative volume. Default for fresh installs. |
2. Separate VM on same hardware — LLM in a sibling VM | 32+ GB RAM split, host hypervisor pins CPU cores per VM. | Inference latency native (loopback or same-host network). TServer isolated from LLM CPU use. | Existing hardware budget allows separation; ops team manages one box. Good middle ground when adding a new physical host is friction. |
3. Separate physical host — dedicated LLM server | Dedicated box, 16–64 GB RAM, optional GPU (cuts latency 3–5×). Can be a refurbished workstation or a small server. | Inference latency native to that host; +1–3 ms network. TServer never sees LLM load. Multiple FX runtimes can share one LLM host. | Production deployment with regular AI use. Multiple solutions sharing one LLM. GPU acceleration desired. |
4. Cloud LLM endpoint — off-premises managed model | Outbound HTTPS only. SecuritySecrets-backed API key. | Network round-trip dominates (50–500 ms) for short prompts; inference itself can be 2–10× faster than CPU-only local. Throughput limited by provider rate-limits. | Strong external LLM justified (GPT-class quality, multi-modal needed). No air-gap constraint. Per-call cost acceptable. Sensitive data redacted before send. |
The REST surface enables all four
Local AI calls an OpenAI-compatible chat/completions endpoint over HTTP. The TServer process opens the TCP connection; the endpoint can be on localhost, on a sibling VM at 192.168.x.y, on a dedicated server at llm-host.lan, or at https://api.example-provider.com. From FrameworX's side, all four cases are the same code path with a different URL string.
Switching topology after the solution is built is a single field edit on the AI Engine tile in Solution → Capabilities. The JSON shapes for each topology — local Ollama, remote Ollama, OpenAI-compatible cloud with Bearer token, endpoint with extra headers — are documented on Local AI Configuration § Pointing at a different LLM endpoint.
Authentication and credential handling for non-local endpoints is covered on SecuritySecrets Authentication for Local AI. The short version: API keys go in the SecuritySecrets vault and are referenced from the Authorization or Headers field via /secret:<Name> tokens — they never appear in plain text in the solution.
Sizing recommendations
Same-host topology (workshop / demo)
Model:
qwen2.5:3b-instruct(~2 GB). This is what the shipping demos use precisely because it runs end-to-end on no-GPU workshop laptops.RAM: 16 GB total (4 GB headroom for Ollama + the model, 4 GB for TServer, rest for the OS / client). Smaller works for the LLM but cuts other margins.
CPU: Any modern x64 with 8+ cores. More cores = faster token generation; the model is fully parallel.
GPU: Optional. If present, Ollama auto-detects and uses it — latency drops 3–5×. A 4 GB consumer GPU comfortably runs the 3B model.
Expect: ~10–15 sec per medium-length operator chat reply on CPU; ~3–5 sec with a small GPU.
Production topology (separate host)
Model:
qwen2.5:7b-instruct(~4.7 GB) for the FrameworX-recommended quality / reasoning tier, or larger if the LLM host has the RAM and GPU. Tool-call reliability (theChatRequestaction’s autonomous tool dispatch) improves notably above 7B.RAM on the LLM host: 16 GB minimum, 32 GB recommended for the 14B tier. The model loads fully into RAM and stays warm.
CPU on the LLM host: 8+ cores if no GPU; the inference parallelizes across all of them.
GPU on the LLM host: Strongly recommended for production. An entry consumer GPU (RTX 3060 / 4060 class) typically runs 7B at native human-conversation speed. A 24 GB GPU comfortably runs 14B and above.
Network: any reliable link, sub-10 ms latency. The Local AI client uses a 60-second per-call wall-clock timeout (configurable via
TimeoutSecondsin the Settings JSON), so even occasional network blips do not break the surface.Expect: 1–3 sec per operator chat reply with a GPU; ~5–15 sec on CPU-only with a dedicated host (still much better than co-located because TServer is not waiting on CPU).
Cloud topology
Model: match the provider’s recommended instruct model. FrameworX requires only OpenAI-compatible chat-completions; the model selection is opaque to the platform.
Authentication: always SecuritySecrets-backed (see referenced page above). Hard-coded API keys are a deployment anti-pattern.
Data sensitivity: the prompt the LLM sees can include live tag values, alarm context, batch IDs — treat that as exfiltrated data. Compliance review before pointing FX at a third-party endpoint.
Cost model: per-token billing. For a busy alarm-narrative workload (one narrative per critical alarm raise, ~150 input + ~80 output tokens), budget by alarm volume.
Decision matrix — pick a topology by workload
Workload | Same host | Sibling VM | Dedicated host | Cloud |
|---|---|---|---|---|
Demo / training class | Yes | Overkill | Overkill | Avoid (network / cost dependency) |
Single operator, occasional chat | Yes | Fine | Fine | Fine |
Multiple operators, regular chat | Painful (queueing) | Acceptable | Yes | Yes |
Alarm-narrative on every critical alarm | No (TServer pause-on-narrate) | Acceptable | Yes | Yes |
End-of-shift batch summaries | Acceptable (off-peak) | Acceptable | Yes | Yes |
Multiple FX solutions sharing one LLM | No | Yes | Yes | Yes (rate-limit-aware) |
Air-gapped site, IP-sensitive prompts | Yes | Yes | Yes | No |
Per-tick reasoning in fast scan loops | No | No (still CPU-bound) | Only with GPU | Only with cloud rate budget |
Watch for these symptoms when co-located
If you suspect the LLM is stealing CPU from TServer, watch for these in the same-host topology:
Designer Property Watch panels stop updating for tens of seconds at a time, then catch up in bursts. The Designer→TServer ServiceClient heartbeat is timing out at 5 seconds.
Alarm acknowledgment lag. Operator clicks Acknowledge; the alarm state visually updates several seconds later.
HTML5 client feels sluggish. Page refreshes pause; data bindings re-evaluate in chunks.
Scan-period overruns in the runtime log on Device or Script Tasks set to short periods (under 1 second).
Synchronous
AI.Executefrom a Script Task hangs the entire ScriptTaskServer.exe for the call duration. Other tasks queued behind it wait.
None of these are FrameworX defects — they are the operating-system scheduler responding to a CPU-bound peer process. Moving the LLM endpoint to another host eliminates all of them.
What defaults ship
Fresh customer install of FrameworX 10.1.5+. The AI Engine endpoint is pre-configured for local Ollama at
http://127.0.0.1:11434/v1/chat/completionswithqwen2.5:3b-instruct— the demo-recommended 3B model that runs end-to-end on a no-GPU workshop laptop. Same-host topology. To move to the production-tier 7B model (or larger, or a remote endpoint), edit Solution → Capabilities → AI Engine → Edit Configuration.Shipped demo solutions (BottlingLine ML Demo, SolarPanels MCP Demo, LocalAI KnowledgeGraph Demo) and the Local AI Chat Example all ship with the same 3B configuration so they work end-to-end on a workshop laptop with no FrameworX configuration.
Production deployments override at Solution → Capabilities → AI Engine → Edit Configuration — pick
qwen2.5:7b-instruct(or a larger model) for the same host, or pointURLat a remote Ollama / cloud LLM endpoint. TheSettingsJSON written by the dialog takes precedence over the installed default.
See also
Local AI Configuration — JSON examples for each topology’s
Settingsvalue (local Ollama, remote Ollama, cloud, custom headers).Local AI - First Install Walkthrough — getting the local Ollama running for same-host topology.
SecuritySecrets Authentication for Local AI — storing API keys safely for cloud / remote endpoints.
Local AI Architecture Reference § Where Local AI runs — which FX-side processes initiate the LLM POST.
In this section...
| Page Tree | ||
|---|---|---|
|