Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Off-host LLM stance: demote same-host (throwaway test only), demos steer to GPU host, LAN-IP placeholder, 3B/CPU framing, cross-link Remote and Cloud LLM Models

Where to physically run the LLM endpoint relative to TServer — the tradeoffs, the four topologies, and the sizing recommendations. The REST-over-OpenAI-compatible-POST design makes every topology a one-field configuration change; this page covers when to use which.

AI IntegrationLocal AI → Deployment and Performance

Version 10.1.5+


Info
titleTL;DR

LLM inference is CPU-heavy. A single call on a 7B-class model uses every available core for 1–30 seconds. If the LLM runs on the same host as TServer, the runtime competes with the model for CPU and operator-facing latency suffers. For production deployments with non-trivial AI workload, run Run the LLM on a separate host or VMGPU machine — or a remote / cloud endpoint — not on the FrameworX/TServer host. This is the floor even for demos. A small CPU-only model on the TServer box produces only ~2-4 tokens/sec, too slow even for a demonstration. The REST surface is OpenAI-compatible HTTP — only the URL field in SolutionCapabilities[LocalAI].Settings changes. For remote and cloud placement specifically, see Remote and Cloud LLM Models.

Why this matters

FrameworX Local AI calls an OpenAI-compatible chat-completions endpoint over HTTP. Both consumption paths — the ChatRequest Display action and the AI.Execute script API — issue the POST from inside the TServer process or one of its server-domain children (ScriptTaskServer.exe, ReportServer.exe). See Local AI Architecture Reference § Where Local AI runs for the per-process breakdown.

What that page does not address is the endpoint on the other side of the POST — where the actual LLM model loads, decodes, and generates tokens. That host has different resource profile from TServer:

  • LLM inference saturates CPU. A modern 7B-class model running CPU-only generates roughly 3–5 only ~2–4 tokens per second on a typical x64 server , — too slow even for a demonstration — and uses every available core to do it. A 50-token response is ~10–15 ~15–25 seconds of 100% CPU; a 200-token narrative is ~40–60 secondsover a minute. There is no "throttle" knob — the model uses what it has.

  • TServer also wants CPU. The runtime polls devices, runs scripts, evaluates expressions, dispatches alarms, and serves clients. None of those are CPU-heavy individually, but they are latency-sensitive: a missed scan slows control loops, a delayed heartbeat shows as Disconnected on operator panels.

  • Co-location starves both. When TServer and the LLM share a host, every active LLM call effectively pauses the runtime for the duration of inference. Symptoms: Property Watch panels freeze, the Designer→TServer heartbeat times out at 5 seconds, the HTML5 client feels sluggish, alarm reactions delay by seconds.

This is not a FrameworX limitation — it is the nature of CPU-bound large-model inference. The fix is topological: move the inference workload off the runtime's critical pathhost entirely — onto a separate GPU machine or a remote / cloud endpoint.


The four topologies

Topology

Hardware

Latency profile

When to choose

1. Same host — LLM runs on the TServer machine

16+ GB RAM, modern 8+ core CPU. GPU optional but unused if absent.

Inference latency native (no network). TServer contention during every call.Workshop / demo / engineering laptop. Single operator. Occasional chat. Low alarm-narrative volume. Default for fresh installs; CPU-only runs at only ~2-4 tok/s.

Throwaway test only — not recommended. The model competes with the runtime for CPU, and CPU-only inference is too slow (~2-4 tok/s) even for a demonstration. For any demo, training class, or real use, run the model on a separate GPU machine (topology 3) or a cloud endpoint (topology 4) instead.

2. Separate VM on same hardware — LLM in a sibling VM

32+ GB RAM split, host hypervisor pins CPU cores per VM.

Inference latency native (loopback or same-host network). TServer isolated from LLM CPU use, but the LLM still shares the physical CPU unless a GPU is passed through.

Existing hardware budget allows separation; ops team manages one box. Good A middle ground when adding a new physical host is friction — but without a passed-through GPU the model is still CPU-bound and slow.

3. Separate physical host — dedicated LLM server

Dedicated box, 16–64 GB RAM, optional GPU strongly recommended (cuts latency 3–5×). Can be a gaming PC, dev workstation, refurbished workstation, or a small server.

Inference latency native to that host; +1–3 ms network. TServer never sees LLM load. Multiple FX runtimes can share one LLM host.

Production deployment with regular Recommended for demos, training, and production. Regular AI use. Multiple solutions sharing one LLM. GPU acceleration desired.

4. Cloud LLM endpoint — off-premises managed model

Outbound HTTPS only. SecuritySecrets-backed API key.

Network round-trip dominates (50–500 ms) for short prompts; inference itself can be 2–10× faster than CPU-only local. Throughput limited by provider rate-limits.

Strong external LLM justified (GPT-class quality, multi-modal needed). No air-gap constraint. Per-call cost acceptable. Sensitive data redacted before send. See Remote and Cloud LLM Models.


The REST surface enables all four

Local AI calls an OpenAI-compatible chat/completions endpoint over HTTP. The TServer process opens the TCP connection; the endpoint can be on localhost, on a sibling VM separate GPU machine at 192.168.x1.y50, on a dedicated server at llm-host.lan, or at https://api.example-provider.com — replace 192.168.1.50 with the IP of the machine running your model (the model must not run on the FrameworX/TServer host). From FrameworX's side, all four cases are the same code path with a different URL string.

Switching topology after the solution is built is a single field edit on the AI Engine tile in Solution → Capabilities. The JSON shapes for each topology — local Ollama, remote Ollama, OpenAI-compatible cloud with Bearer token, endpoint with extra headers — are documented on Local AI Configuration § Pointing at a different LLM endpoint, and the remote / cloud guidance on Remote and Cloud LLM Models.

Authentication and credential handling for non-local endpoints is covered on SecuritySecrets Authentication for Local AI. The short version: API keys go in the SecuritySecrets vault and are referenced from the Authorization or Headers field via /secret:<Name> tokens — they never appear in plain text in the solution.


Sizing recommendations

Same-

Separate GPU host

topology (workshop / demo)

(demos, training, production) — recommended

  • Model: qwen2.5:3b-instruct (~2 GB) — the limited-hardware fallback. On a no-GPU workshop or demo laptop this is the model that runs end-to-end on CPU. The recommended default is qwen2.5:7b-instruct, which expects a GPU (see Production topology below).

  • RAM: 16 GB total (4 GB headroom for Ollama + the model, 4 GB for TServer, rest for the OS / client). Smaller works for the LLM but cuts other margins.

  • CPU: Any modern x64 with 8+ cores. More cores = faster token generation; the model is fully parallel.

  • GPU: Optional. If present, Ollama auto-detects and uses it — latency drops 3–5×. A 4 GB consumer GPU comfortably runs the 3B model.

  • Expect: ~10–15 sec per medium-length operator chat reply on CPU; ~3–5 sec with a small GPU.

Production topology (separate host)

  • Model: qwen2.5:7b-instruct (~4.7 GB) — the FrameworX-recommended default quality / reasoning tier, the floor even for demos — or qwen2.5:32b-instruct for maximum reasoning if the LLM host has the GPU and RAM. Tool-call reliability (the ChatRequest action’s autonomous tool dispatch) is solid at 7B and improves further at 32B.

  • RAM on the LLM host: 16 GB minimum for 7B, 32 GB+ recommended for the 32B tier. The model loads fully into RAM (or VRAM) and stays warm.

  • CPU on the LLM host: 8+ cores if no GPU; the inference parallelizes across all of them — but CPU-only is still only ~2-4 tok/s, so a GPU is strongly preferred.

  • GPU on the LLM host: Strongly recommended for production. An entry consumer GPU (RTX 3060 / 4060 class) typically runs 7B at native human-conversation speed. A 24 GB GPU comfortably runs the 32B tier.

  • Network: any reliable link, sub-10 ms latency. The Local AI client uses a 60-second per-call wall-clock timeout (configurable via TimeoutSeconds in the Settings JSON), so even occasional network blips do not break the surface.

  • Expect: 1–3 sec per operator chat reply with a GPU; ~5–15 sec on . CPU-only with on a dedicated host (still is still ~2-4 tok/s (much better than co-located because TServer is not waiting on CPU, but still too slow for interactive chat — add a GPU).

Cloud topology

  • Model: match the provider’s recommended instruct model. FrameworX requires only OpenAI-compatible chat-completions; the model selection is opaque to the platform. See Remote and Cloud LLM Models.

  • Authentication: always SecuritySecrets-backed (see referenced page above). Hard-coded API keys are a deployment anti-pattern.

  • Data sensitivity: the prompt the LLM sees can include live tag values, alarm context, batch IDs — treat that as exfiltrated data. Compliance review before pointing FX at a third-party endpoint.

  • Cost model: per-token billing. For a busy alarm-narrative workload (one narrative per critical alarm raise, ~150 input + ~80 output tokens), budget by alarm volume.

Same host — not recommended

  • Throwaway test only. If you must run the model on the TServer box for a quick wiring check, qwen2.5:3b-instruct (~2 GB) is the only model small enough to load — but CPU-only it produces only ~2-4 tokens/sec, too slow even for a demonstration, and it competes with the runtime for CPU. It is a last-resort for single-shot atomic tasks only, never interactive chat. For a demo or a training class, do not use the same host — run 7B on a separate GPU machine (topology 3) instead.


Decision matrix — pick a topology by workload

Workload

Same host

Sibling VM

Dedicated host

Cloud

Demo / training class

Yes

Overkill

Overkill

No — too slow, ~2-4 tok/s

Only with GPU passthrough

Yes

YesAvoid (network / cost dependency)

Single operator, occasional chat

Yes

Fine

No — too slow on CPU

Only with GPU passthrough

YesFine

Fine

Multiple operators, regular chat

Painful No (queueing + CPU starvation)

Acceptable with GPU passthrough

Yes

Yes

Alarm-narrative on every critical alarm

No (TServer pause-on-narrate)

Acceptable with GPU passthrough

Yes

Yes

End-of-shift batch summaries

Acceptable (off-peak, single-shot atomic)

Acceptable

Yes

Yes

Multiple FX solutions sharing one LLM

No

Yes

Yes

Yes (rate-limit-aware)

Air-gapped site, IP-sensitive prompts

YesTest only

Yes (GPU passthrough)

Yes

No

Per-tick reasoning in fast scan loops

No

No (still CPU-bound)

Only with GPU

Only with cloud rate budget


Watch for these symptoms when co-located

If you suspect the LLM is stealing CPU from TServer, watch for these in the same-host topology (a reason to move the model off the host onto a separate GPU machine):

  • Designer Property Watch panels stop updating for tens of seconds at a time, then catch up in bursts. The Designer→TServer ServiceClient heartbeat is timing out at 5 seconds.

  • Alarm acknowledgment lag. Operator clicks Acknowledge; the alarm state visually updates several seconds later.

  • HTML5 client feels sluggish. Page refreshes pause; data bindings re-evaluate in chunks.

  • Scan-period overruns in the runtime log on Device or Script Tasks set to short periods (under 1 second).

  • Synchronous AI.Execute from a Script Task hangs the entire ScriptTaskServer.exe for the call duration. Other tasks queued behind it wait.

None of these are FrameworX defects — they are the operating-system scheduler responding to a CPU-bound peer process. Moving the LLM endpoint to another host a separate GPU machine (or a remote / cloud endpoint) eliminates all of them.


What defaults ship

  • Fresh customer install of FrameworX 10.1.5+. The AI Engine endpoint is pre-configured for local Ollama at ships a placeholder — http://127192.0168.01.150:11434/v1/chat/completions with qwen2.5:7b-instruct , the recommended default, which expects a GPU. On a no-GPU machine, switch to the qwen2.5:3b-instruct fallback. To change the model or point at a remote endpoint. The install may seed a loopback (127.0.0.1 / localhost) value, but that is a placeholder you MUST change to the IP of the machine running your model (the model must not run on the FrameworX/TServer host). 7B expects a GPU — run it on a separate GPU machine. To set the URL or model, edit Solution → Capabilities → AI Engine → Edit Configuration.

  • Shipped demo solutions (BottlingLine ML Demo, SolarPanels MCP Demo, LocalAI KnowledgeGraph Demo) and the Local AI Chat Session Example are configured for ship a local placeholder Ollama model so endpoint and model name — you must point the URL at the machine running your model before they work end-to-end out of the box. The recommended model is qwen2.5:7b-instruct; on a no-GPU laptop, switch the Name field to qwen2.5:3b-instruct (the limited-hardware fallback) for evaluation. separate GPU machine. Pull the model named in the solution’s AI Engine settings with ollama pull <name> on that host.

  • Production deployments override at Solution → Capabilities → AI Engine → Edit Configuration — pick qwen2.5:7b-instruct (or qwen2.5:32b-instruct) for the same host, or and point URL at a separate GPU machine, a remote Ollama / , or a cloud LLM endpoint. The Settings JSON written by the dialog takes precedence over the installed default.


See also


In this section...

Page Tree
root@parent