Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Reframe defaults: 7b recommended default (GPU-expected), 3b limited-hardware fallback, 32b max-performance; fresh-install default 3b?7b; demo-ships wording made asset-agnostic; 14b?32b sizing

Where to physically run the LLM endpoint relative to TServer — the tradeoffs, the four topologies, and the sizing recommendations. The REST-over-OpenAI-compatible-POST design makes every topology a one-field configuration change; this page covers when to use which.

AI IntegrationLocal AI → Deployment and Performance

Version 10.1.5+


Info
titleTL;DR

LLM inference is CPU-heavy. A single call on a 7B-class model uses every available core for 1–30 seconds. If the LLM runs on the same host as TServer, the runtime competes with the model for CPU and operator-facing latency suffers. For production deployments with non-trivial AI workload, run the LLM on a separate host or VM. The REST surface is OpenAI-compatible HTTP — only the URL field in SolutionCapabilities[LocalAI].Settings changes.

Why this matters

FrameworX Local AI calls an OpenAI-compatible chat-completions endpoint over HTTP. Both consumption paths — the ChatRequest Display action and the AI.Execute script API — issue the POST from inside the TServer process or one of its server-domain children (ScriptTaskServer.exe, ReportServer.exe). See Local AI Architecture Reference § Where Local AI runs for the per-process breakdown.

What that page does not address is the endpoint on the other side of the POST — where the actual LLM model loads, decodes, and generates tokens. That host has different resource profile from TServer:

  • LLM inference saturates CPU. A modern 7B-class model running CPU-only generates roughly 3–5 tokens per second on a typical x64 server, and uses every available core to do it. A 50-token response is ~10–15 seconds of 100% CPU; a 200-token narrative is ~40–60 seconds. There is no "throttle" knob — the model uses what it has.

  • TServer also wants CPU. The runtime polls devices, runs scripts, evaluates expressions, dispatches alarms, and serves clients. None of those are CPU-heavy individually, but they are latency-sensitive: a missed scan slows control loops, a delayed heartbeat shows as Disconnected on operator panels.

  • Co-location starves both. When TServer and the LLM share a host, every active LLM call effectively pauses the runtime for the duration of inference. Symptoms: Property Watch panels freeze, the Designer→TServer heartbeat times out at 5 seconds, the HTML5 client feels sluggish, alarm reactions delay by seconds.

This is not a FrameworX limitation — it is the nature of CPU-bound large-model inference. The fix is topological: move the inference workload off the runtime's critical path.


The four topologies

Topology

Hardware

Latency profile

When to choose

1. Same host — LLM runs on the TServer machine

16+ GB RAM, modern 8+ core CPU. GPU optional but unused if absent.

Inference latency native (no network). TServer contention during every call.

Workshop / demo / engineering laptop. Single operator. Occasional chat. Low alarm-narrative volume. Default for fresh installs.

2. Separate VM on same hardware — LLM in a sibling VM

32+ GB RAM split, host hypervisor pins CPU cores per VM.

Inference latency native (loopback or same-host network). TServer isolated from LLM CPU use.

Existing hardware budget allows separation; ops team manages one box. Good middle ground when adding a new physical host is friction.

3. Separate physical host — dedicated LLM server

Dedicated box, 16–64 GB RAM, optional GPU (cuts latency 3–5×). Can be a refurbished workstation or a small server.

Inference latency native to that host; +1–3 ms network. TServer never sees LLM load. Multiple FX runtimes can share one LLM host.

Production deployment with regular AI use. Multiple solutions sharing one LLM. GPU acceleration desired.

4. Cloud LLM endpoint — off-premises managed model

Outbound HTTPS only. SecuritySecrets-backed API key.

Network round-trip dominates (50–500 ms) for short prompts; inference itself can be 2–10× faster than CPU-only local. Throughput limited by provider rate-limits.

Strong external LLM justified (GPT-class quality, multi-modal needed). No air-gap constraint. Per-call cost acceptable. Sensitive data redacted before send.


The REST surface enables all four

Local AI calls an OpenAI-compatible chat/completions endpoint over HTTP. The TServer process opens the TCP connection; the endpoint can be on localhost, on a sibling VM at 192.168.x.y, on a dedicated server at llm-host.lan, or at https://api.example-provider.com. From FrameworX's side, all four cases are the same code path with a different URL string.

Switching topology after the solution is built is a single field edit on the AI Engine tile in Solution → Capabilities. The JSON shapes for each topology — local Ollama, remote Ollama, OpenAI-compatible cloud with Bearer token, endpoint with extra headers — are documented on Local AI Configuration § Pointing at a different LLM endpoint.

Authentication and credential handling for non-local endpoints is covered on SecuritySecrets Authentication for Local AI. The short version: API keys go in the SecuritySecrets vault and are referenced from the Authorization or Headers field via /secret:<Name> tokens — they never appear in plain text in the solution.


Sizing recommendations

Same-host topology (workshop / demo)

  • Model: qwen2.5:3b-instruct (~2 GB) . This is what the shipping demos use precisely because it — the limited-hardware fallback. On a no-GPU workshop or demo laptop this is the model that runs end-to-end on no-GPU workshop laptopsCPU. The recommended default is qwen2.5:7b-instruct, which expects a GPU (see Production topology below).

  • RAM: 16 GB total (4 GB headroom for Ollama + the model, 4 GB for TServer, rest for the OS / client). Smaller works for the LLM but cuts other margins.

  • CPU: Any modern x64 with 8+ cores. More cores = faster token generation; the model is fully parallel.

  • GPU: Optional. If present, Ollama auto-detects and uses it — latency drops 3–5×. A 4 GB consumer GPU comfortably runs the 3B model.

  • Expect: ~10–15 sec per medium-length operator chat reply on CPU; ~3–5 sec with a small GPU.

Production topology (separate host)

  • Model: qwen2.5:7b-instruct (~4.7 GB) for the FrameworX-recommended default quality / reasoning tier , or larger — or qwen2.5:32b-instruct for maximum reasoning if the LLM host has the GPU and RAM and GPU. Tool-call reliability (the ChatRequest action’s autonomous tool dispatch) improves notably above 7Bis solid at 7B and improves further at 32B.

  • RAM on the LLM host: 16 GB minimum for 7B, 32 GB+ recommended for the 14B 32B tier. The model loads fully into RAM (or VRAM) and stays warm.

  • CPU on the LLM host: 8+ cores if no GPU; the inference parallelizes across all of them.

  • GPU on the LLM host: Strongly recommended for production. An entry consumer GPU (RTX 3060 / 4060 class) typically runs 7B at native human-conversation speed. A 24 GB GPU comfortably runs 14B and abovethe 32B tier.

  • Network: any reliable link, sub-10 ms latency. The Local AI client uses a 60-second per-call wall-clock timeout (configurable via TimeoutSeconds in the Settings JSON), so even occasional network blips do not break the surface.

  • Expect: 1–3 sec per operator chat reply with a GPU; ~5–15 sec on CPU-only with a dedicated host (still much better than co-located because TServer is not waiting on CPU).

Cloud topology

  • Model: match the provider’s recommended instruct model. FrameworX requires only OpenAI-compatible chat-completions; the model selection is opaque to the platform.

  • Authentication: always SecuritySecrets-backed (see referenced page above). Hard-coded API keys are a deployment anti-pattern.

  • Data sensitivity: the prompt the LLM sees can include live tag values, alarm context, batch IDs — treat that as exfiltrated data. Compliance review before pointing FX at a third-party endpoint.

  • Cost model: per-token billing. For a busy alarm-narrative workload (one narrative per critical alarm raise, ~150 input + ~80 output tokens), budget by alarm volume.


Decision matrix — pick a topology by workload

Workload

Same host

Sibling VM

Dedicated host

Cloud

Demo / training class

Yes

Overkill

Overkill

Avoid (network / cost dependency)

Single operator, occasional chat

Yes

Fine

Fine

Fine

Multiple operators, regular chat

Painful (queueing)

Acceptable

Yes

Yes

Alarm-narrative on every critical alarm

No (TServer pause-on-narrate)

Acceptable

Yes

Yes

End-of-shift batch summaries

Acceptable (off-peak)

Acceptable

Yes

Yes

Multiple FX solutions sharing one LLM

No

Yes

Yes

Yes (rate-limit-aware)

Air-gapped site, IP-sensitive prompts

Yes

Yes

Yes

No

Per-tick reasoning in fast scan loops

No

No (still CPU-bound)

Only with GPU

Only with cloud rate budget


Watch for these symptoms when co-located

If you suspect the LLM is stealing CPU from TServer, watch for these in the same-host topology:

  • Designer Property Watch panels stop updating for tens of seconds at a time, then catch up in bursts. The Designer→TServer ServiceClient heartbeat is timing out at 5 seconds.

  • Alarm acknowledgment lag. Operator clicks Acknowledge; the alarm state visually updates several seconds later.

  • HTML5 client feels sluggish. Page refreshes pause; data bindings re-evaluate in chunks.

  • Scan-period overruns in the runtime log on Device or Script Tasks set to short periods (under 1 second).

  • Synchronous AI.Execute from a Script Task hangs the entire ScriptTaskServer.exe for the call duration. Other tasks queued behind it wait.

None of these are FrameworX defects — they are the operating-system scheduler responding to a CPU-bound peer process. Moving the LLM endpoint to another host eliminates all of them.


What defaults ship

  • Fresh customer install of FrameworX 10.1.5+. The AI Engine endpoint is pre-configured for local Ollama at http://127.0.0.1:11434/v1/chat/completions with qwen2.5:3b7b-instruct — the demo-recommended 3B model that runs end-to-end on recommended default, which expects a GPU. On a no-GPU workshop laptop. Same-host topology. To move to the production-tier 7B model (or larger, or a remote endpoint)machine, switch to the qwen2.5:3b-instruct fallback. To change the model or point at a remote endpoint, edit Solution → Capabilities → AI Engine → Edit Configuration.

  • Shipped demo solutions (BottlingLine ML Demo, SolarPanels MCP Demo, LocalAI KnowledgeGraph Demo) and the Local AI Chat Example all ship with the same 3B configuration are configured for a local Ollama model so they work end-to-end on a workshop laptop with no FrameworX configurationout of the box. The recommended model is qwen2.5:7b-instruct; a copy configured for qwen2.5:3b-instruct runs on a no-GPU laptop for evaluation. Pull the model named in the solution’s AI Engine settings with ollama pull <name>.

  • Production deployments override at Solution → Capabilities → AI Engine → Edit Configuration — pick qwen2.5:7b-instruct (or a larger model qwen2.5:32b-instruct) for the same host, or point URL at a remote Ollama / cloud LLM endpoint. The Settings JSON written by the dialog takes precedence over the installed default.


See also


In this section...

Page Tree
root@parent