GT Gilligan Tech
Platform · Cloudflare Workers AI

Inference at the edge. Milliseconds from any user.

When response time is everything, Cloudflare Workers AI runs inference at the network edge — in 300+ data centres globally, milliseconds from your users, with no cold starts and no GPU provisioning. For real-time suggestions, autocomplete, and lightweight AI triage, edge inference changes what's possible.

<50ms
Edge Latency
Inference runs in the same data centre that serves your web traffic.
300+
Global PoPs
Cloudflare's network spans 300+ cities — your users are never far from a compute node.
0
Cold Starts
Workers are always warm. No container spin-up delays, no tail latency spikes.
In-region
Data Residency
Data processed at the edge never leaves the region it enters. GDPR-friendly by design.
Capabilities

What edge AI unlocks for your product.

Real-Time Inference
Sub-50ms · No cold starts · Always-on

Edge inference is the right choice for any user-facing feature where latency is perceptible — smart autocomplete, real-time content suggestions, inline sentiment scoring on support ticket input. Workers AI eliminates the round-trip to a cloud data centre.

  • Llama 3.1 8B (edge) — Sub-50ms reasoning for common queries at the network edge
  • Mistral 7B (edge) — Lightweight general reasoning with fast first-token latency
  • Streaming responses — Token-by-token streaming for chat interfaces that feel instant
🎤
Edge Audio Transcription
Whisper · In-region · GDPR-safe

Whisper Large v3 on Workers AI transcribes audio at the edge — meeting recordings, voice notes, call-centre audio — without audio data leaving the region it was recorded in. Ideal for GDPR-sensitive audio processing pipelines.

  • Whisper Large v3 — State-of-the-art transcription, processed in-region
  • Multi-language — 99 language transcription with automatic language detection
  • Timestamped output — Word-level timestamps for meeting minutes and search indexing
📈
Intelligent Triage
Pre-classify · Route · Cost-gate

Run a fast lightweight model at the edge before routing to an expensive cloud model. Cloudflare Workers AI acts as a cost gate — simple queries are resolved at the edge, complex ones are escalated to GPT-4o or Gemini 1.5 Pro. This alone can cut inference costs by 40–60%.

  • Fast pre-classification — Tag query complexity at the edge, route accordingly
  • Intent detection — Identify user intent before invoking a full RAG pipeline
  • Cost-gating — Resolve simple queries at edge cost; escalate complex ones
🌎
Global Reach, Local Compliance
300+ PoPs · In-region processing · Zero egress

With 300+ global points of presence, Workers AI ensures users in any geography get low-latency inference. Regional data boundaries are respected automatically — EU users' data is processed in EU data centres, US data stays in the US.

  • Regional routing — Requests routed to the nearest Cloudflare PoP automatically
  • GDPR data residency — EU inference stays within EU; configurable per-region
  • Zero egress costs — Cloudflare's flat-rate model eliminates data transfer fees
Architecture

How Gilligan Tech uses edge inference.

  1. Edge entry: User requests arrive at the nearest Cloudflare PoP. Workers AI intercepts the request before it reaches origin servers — no round-trip latency to a cloud region.
  2. Fast classification: A lightweight edge model (Llama 3.1 8B or Mistral 7B) classifies the request in under 50ms. Simple, high-confidence queries are resolved here.
  3. Smart escalation: Complex queries, multi-document lookups, or low-confidence edge results are forwarded to the appropriate cloud model — Gemini, GPT-4o, or Llama on Bedrock.
  4. Response caching: Cloudflare's edge cache stores common query responses. Repeated queries on the same content return in single-digit milliseconds with zero inference cost.
  5. Unified logging: Edge inference events are forwarded to the same audit log as cloud inference — giving a complete picture of cost, latency, and resolution tier per query.
Model Reference

Workers AI models we deploy.

ModelLatencyBest for
Llama 3.1 8B Instruct<50msEdge reasoning, fast Q&A, real-time suggestions
Mistral 7B Instruct<40msLightweight general reasoning, autocomplete, triage
Whisper Large v3Near-real-timeAudio transcription in-region (EU, US, APAC)
BGE Small EN v1.5<10msFast edge embeddings for semantic classification
BAAI BGE M3<15msMultilingual edge embeddings for global deployments

See edge inference in your product.

We'll show you how Workers AI can add real-time AI features to your existing web application — with sub-100ms response and no infrastructure changes.