◈ Open Source Project

Enox AI

A production-grade, multi-provider AI platform with streaming chat, multimodal generation, custom agents, and a full admin dashboard.

Full Technical Case Study
Created by Yad Qasim
Published March 2026

01 — OverviewExecutive Summary

Enox AI is a full-stack, open-source AI platform that unifies access to the world's leading language models, image generators, video creators, and text-to-speech engines through a single, beautifully designed interface.

The platform implements a three-tier monorepo architecture: a Next.js 16 user-facing application, an Express.js API server, and a separate Next.js admin dashboard, all backed by Supabase (PostgreSQL) with Row-Level Security.

Key capabilities include real-time token-by-token streaming via Server-Sent Events, multimodal input support (images, voice, PDFs, files), inline media generation through AI tool calling, a custom agent studio with live testing, per-user rate limiting with admin override controls, and a Bring-Your-Own-Key (BYOK) system that gives users unlimited access with their own API keys.

6
AI Providers
4
Media Modalities
18
API Endpoints
841
Lines in AI Core

01b — ProductProduct Thinking & UX Decisions

Before writing a single line of code, I identified real user problems in the AI tool landscape and made deliberate product decisions to solve them. This section explains the why behind every major design choice.

The User Problems

When I started building Enox, I kept running into the same frustrations that every AI power-user hits:

Vendor Lock-In

ChatGPT, Claude, and Gemini each live in their own walled garden. Switching between them means separate accounts, separate UIs, and separate billing. If you want to compare model outputs on the same prompt, you're juggling browser tabs.

💰
Cost Confusion

Users either pay $20/month per provider (expensive if you use multiple) or manage raw API keys with no usage visibility. There's no middle ground between "consumer subscription" and "raw developer API."

🔓
No Customization

ChatGPT's "Custom GPTs" are limited — you can't set temperature, top_p, or max_tokens. You can't live-test different system prompts side by side. Power users need an Agent Studio, not a wizard.

📷
Fragmented Media

Image generation, TTS, and video are separate tools. You can't ask a chat model to "make me a logo" and get an image inline — you have to switch to DALL-E or Midjourney. The AI experience is scattered across a dozen apps.

Who Is This For?

Enox targets two distinct user personas, and every feature maps to one or both:

🧑 AI Power Users

Developers, writers, and researchers who use multiple AI models daily. They want model comparison, BYOK for unlimited usage, fine-grained agent control (temperature, top_p), and a single interface that does text + image + audio + video.

Key features for them: BYOK system, Agent Studio with live testing, model pinning, thinking mode toggle, multimodal attachments.

🏭 Platform Operators

Team leads or small companies who want to host a private AI portal for their team. They need admin controls: model management, per-user rate limits, usage analytics, and the ability to rotate API keys without disrupting users.

Key features for them: Admin dashboard, per-user model limits, usage tracking with own-key separation, centralized model CRUD.

UX Design Decisions

Every UI choice was driven by a specific user need or pain point I observed:

🎨
Model-Adaptive Composer

Problem: Users were confused when selecting a TTS model but seeing a standard chat input. Solution: The composer dynamically changes based on model type — emerald-themed with voice pills for TTS, purple-themed with Sparkles icon for image/video, and standard chat with attachments for text. This immediately communicates what the model expects without any instructions.

💬
Inline Media Generation (Not Separate Pages)

Problem: Switching between "chat" and "image generator" tabs destroys conversational context. Solution: Media generation happens inline via tool calling. You ask "draw a sunset" in a normal chat, and the image appears in the conversation. The AI decides when to generate, keeping the experience conversational rather than transactional.

👁
Thinking Mode as a Toggle, Not Default

Problem: Extended thinking is useful for complex reasoning but annoying for quick questions — it adds latency and visual noise. Solution: A one-click toggle in the composer lets users opt into thinking mode per-message. When active, thinking streams into a collapsible panel so it doesn't overwhelm the response. Casual users never see it; power users love it.

📌
Model Pinning with Quick-Switch Dropdown

Problem: With 6+ providers and dozens of models, the model selector becomes overwhelming. Solution: Users pin up to 3 favorite models. The dropdown organizes by Pinned → Recent → All, so your daily-drivers are always one click away. Type badges (IMG/VID/TTS) provide instant visual cues about model capability.

🔒
Privacy-First Attachment Handling

Problem: Users hesitate to upload sensitive documents (contracts, medical images) to AI platforms that might store them. Solution: Attachments are explicitly never stored in the database. Only a placeholder like [Image sent] is persisted. The user can verify this — chat history shows only placeholders, not their actual files. This was a deliberate trust-building UX decision, not just a technical one.

Skeleton Shimmer for Media Generation

Problem: Image/video generation takes 5-30 seconds. Without feedback, users think the app is broken and re-submit. Solution: A generating SSE event immediately shows a type-appropriate shimmer skeleton (image aspect ratio, audio waveform, or video player shape). This communicates "I'm working on it" within 200ms, even though the media won't arrive for seconds.

Design Philosophy: The guiding principle was "one interface, zero context-switching." Every modal, every tab, every separate page is a place users get lost. Enox keeps everything in the chat — text, images, audio, video, agent testing — because that's where the user's mental model already lives.

02 — StackTechnology Stack

Every technology was chosen for a specific engineering reason — performance, developer experience, or production reliability.

Frontend (User App & Admin App)

Next.js 16 React 19 TypeScript Tailwind CSS 3.4 Framer Motion Radix UI Zustand Lucide Icons React Markdown

Backend (API Server)

Express.js 4.18 Node.js 18+ OpenAI SDK 4.24 Google GenAI SDK Zod Validation Helmet Security pdf-parse

Infrastructure & Database

Supabase PostgreSQL Row-Level Security Google OAuth 2.0 JWT Authentication

AI Providers Supported

OpenAI (GPT-4o, DALL-E 3, TTS-1) Anthropic (Claude) Google (Gemini 2.5, Imagen 3, Veo 2) Mistral Groq OpenRouter

02b — TradeoffsTechnology Tradeoffs

Every technology choice involved rejecting an alternative. This section documents what I chose, what I didn't, and — most importantly — why.

SSE
Server-Sent Events over WebSockets

I chose SSE because the data flow is strictly unidirectional: the server streams tokens to the client. WebSockets are bidirectional, which adds complexity (heartbeat pings, reconnection state, socket lifecycle management) for zero benefit here. SSE also works natively with HTTP/2 multiplexing, auto-reconnects on disconnect via the EventSource API, and passes cleanly through every reverse proxy and CDN without special upgrade headers. WebSockets require an HTTP Upgrade handshake that some corporate firewalls and load balancers block.

When I'd pick WebSockets instead: If Enox had real-time collaborative editing or live presence indicators (multiple users typing), I'd need true bidirectional communication. For a chat streaming use case, SSE is strictly simpler and more reliable.

Supa
Supabase over Raw PostgreSQL

I chose Supabase because it bundles three things I'd otherwise have to build myself: authentication (Google OAuth with JWT out of the box), Row-Level Security policies with a clean SDK, and a hosted Postgres instance with connection pooling. Setting up raw Postgres + a custom auth server + session management + OAuth provider integration would have taken weeks and introduced security surface area I'd have to maintain forever.

The tradeoff: Supabase's RLS is powerful but opaque — policy bugs are hard to debug because queries silently return empty results instead of throwing errors. I also can't use advanced Postgres features like logical replication or custom extensions without their approval. For this project, the speed-to-ship and built-in auth far outweighed those limitations.

Expr
Express.js over Fastify

I chose Express because the AI SDK ecosystem is built around it. The OpenAI SDK examples, Anthropic streaming guides, and Google GenAI tutorials all use Express. Fastify is ~2× faster in raw benchmarks, but the bottleneck in an AI chat app is never the HTTP framework — it's the AI provider response time (800-3000ms). Saving 0.5ms on routing is meaningless when you're waiting 2 seconds for GPT-4o to think. Express also has a vastly larger middleware ecosystem (Helmet, CORS, express-rate-limit) that Just Works.

When I'd pick Fastify instead: For a high-throughput API that handles 10,000+ req/s where framework overhead matters (e.g., a REST API serving cached data). Enox's backend handles maybe 50 concurrent streams — Express is not the bottleneck.

Zust
Zustand over Redux / Context API

I chose Zustand because streaming chat creates extreme state update pressure — every token triggers a store update. Redux's reducer dispatch overhead and middleware chain add measurable latency at 30+ updates/second. React Context causes full subtree re-renders on every update (catastrophic for a chat UI). Zustand's useShallow selectors and direct state mutation give me surgical re-render control with zero boilerplate.

The tradeoff: Zustand has no built-in dev tools as mature as Redux DevTools, and the "single store" pattern can get unwieldy. I mitigated this by keeping the store interface clean and using localStorage persistence selectively (7 keys) rather than syncing everything.

Mono
Monorepo over Separate Repositories

I chose a monorepo because the three projects (app, backend, admin-app) share types, constants, and deployment context. With separate repos, a database schema change would require coordinated PRs across 3 repositories. In a monorepo, one commit can update the schema, backend route, and frontend type simultaneously. The mental overhead of "which repo has the bug?" disappears.

The tradeoff: No shared package extraction (like a /packages/types workspace) — types are duplicated between frontend and backend. I accepted this because the duplication is small (a few interfaces) and a Turborepo/Nx setup would add tooling complexity disproportionate to the project's size.

Dual
Native Gemini SDK + OpenAI SDK (Dual Path) over Single SDK

I chose to maintain two code paths because Google's OpenAI-compatible endpoint lies about streaming. It buffers the entire Gemini response (~5-8 seconds), then "streams" it as a burst of chunks. The native @google/genai SDK delivers true token-by-token streaming with ~200ms time-to-first-token. For the most popular free model (Gemini 2.5 Flash), this made the difference between "feels instant" and "feels broken."

The tradeoff: Two code paths means double the maintenance surface for streaming logic, error handling, and tool calling. Every new feature has to work on both paths. I mitigated this with the yield* delegation pattern — both paths yield the same typed events to the same consumer, so the downstream code is unified.

Decision Framework: My general rule was: "Optimize for the bottleneck, not the benchmark." The bottleneck in an AI chat app is always AI provider latency (800-3000ms). Any technology choice that saves <10ms is irrelevant. Choices that save 200ms+ (like native Gemini streaming, connection pooling, parallel queries) are where I invested engineering effort.

03 — ArchitectureSystem Architecture

A clean three-tier monorepo with strict separation of concerns. The backend never exposes API keys to the frontend, and all data access is governed by Supabase RLS policies.

┌─────────────────────────────────────────────────────────────────────┐ │ MONOREPO STRUCTURE │ ├──────────────┬──────────────────┬──────────────┬───────────────────┤ │ /app │ /backend │ /admin-app │ /supabase │ │ Next.js 16 │ Express.js │ Next.js 16 │ schema.sql │ │ Port 3000 │ Port 3001 │ Port 3002 │ RLS Policies │ │ User UI │ REST + SSE API │ Admin UI │ DB Functions │ └──────┬───────┴────────┬─────────┴──────┬───────┴───────────────────┘ │ │ │ │ ┌───────────▼───────────┐ │ └────► Supabase Auth ◄────┘ │ (Google OAuth) │ └───────────┬───────────┘ │ ┌────────────────▼────────────────┐ │ PostgreSQL (RLS) │ │ users │ models │ agents │ chats │ │ messages │ usage_logs │ keys │ └────────────────┬────────────────┘ │ ┌────────────────▼────────────────┐ │ AI Provider Abstraction │ │ OpenAI │ Anthropic │ Google │ │ Mistral │ Groq │ OpenRouter │ └─────────────────────────────────┘

Directory Layout

enox/ ├── app/ # User-facing Next.js application │ └── src/ │ ├── app/(app)/ # Route groups: chat, agents, models, settings... │ ├── components/ # 21 component directories │ │ ├── agents/ # AgentStudio, AgentsView │ │ ├── chat/ # ChatView, MessageBubble, ModelSelector, VoiceRecorder... │ │ ├── layout/ # Sidebar, AppShell │ │ ├── settings/ # SettingsView, ApiKeysView │ │ └── ... # explore, models, auth, legal, usage, providers │ ├── lib/ # api.ts, supabase.ts, utils.ts │ └── store/ # useStore.ts (Zustand) ├── backend/ # Express.js API server │ └── src/ │ ├── lib/ # aiProvider.js, rateLimit.js, supabase.js... │ ├── middleware/ # auth.js (JWT + cache), errorHandler.js │ └── routes/ # chat.js, agents.js, admin.js, generate.js... ├── admin-app/ # Admin dashboard (Next.js) │ └── src/app/ # Models, Users, Usage management └── supabase/ # schema.sql — 8 tables, 14 indexes, RLS policies

04 — DatabaseDatabase Schema & Security

Eight tables with 14 optimized indexes, automatic timestamp triggers, and comprehensive Row-Level Security policies that enforce data isolation at the database level.

Table Purpose Key Columns RLS Policy
users User profiles (linked to auth.users) id, email, name, avatar_url, role Own profile read/write; admins full access
models Admin-managed AI models id, name, provider, model_id, api_key, daily_limit, model_type Public read (active only); api_key never exposed to client
agents Custom AI agents with system prompts id, user_id, name, username, system_prompt, model_id, temperature, top_p, max_tokens Own agents CRUD; public agents read-only
chats Chat sessions id, user_id, model_id, agent_id, title Own chats only
messages Chat messages (user/assistant/system) id, chat_id, role, content Messages of own chats only
usage_logs Per-user, per-model daily usage tracking user_id, model_id, message_count, own_key_count, date Own usage read-only
user_model_limits Admin-set per-user model limit overrides user_id, model_id, daily_limit Admin-only
user_api_keys User's own provider API keys (BYOK) user_id, provider, api_key Own keys only

Database Functions

Security Note: The models.api_key column is never exposed to any frontend client. The backend uses a Supabase service-role key to access API keys server-side only. All client-facing model queries explicitly exclude api_key from the SELECT.

05 — BackendBackend API Server

The Express.js backend handles authentication, rate limiting, AI provider abstraction, SSE streaming, multimodal processing, and media generation — all optimized for minimal latency.

5.1 Authentication & Middleware

Every protected route passes through authMiddleware, which validates the Supabase JWT, fetches the user profile, and caches both for 5 minutes to avoid redundant database lookups:

1
Token Extraction

Extracts Bearer token from the Authorization header. Returns 401 immediately if missing.

2
In-Memory Cache Check

Checks a Map<token, {profile, ts}> cache with a 5-minute TTL. If valid, skips DB calls entirely.

3
Supabase Verification

On cache miss: calls supabase.auth.getUser(token), then fetches the full user profile from public.users. Caches the result.

Admin routes use an additional adminMiddleware that checks req.user.role === 'admin' and returns 403 if unauthorized.

5.2 AI Provider Abstraction Layer

The aiProvider.js module (841 lines) is the core intelligence layer. It provides a unified streaming interface across all six providers:

OpenAI-Compatible Path

Uses the OpenAI SDK with configurable baseURL for OpenAI, Anthropic, Google (via compatibility layer), Mistral, Groq, and OpenRouter. TCP connections are cached per provider+key pair.

Native Gemini Path

Uses the @google/genai SDK directly for true token-by-token streaming. The OpenAI-compatible wrapper buffers Gemini's entire response before "streaming" it, adding ~8s of latency. The native path eliminates this.

Progressive Fallback

Gemini models attempt configurations from most features to least: (1) thinking + tools, (2) tools only, (3) plain. Each failure falls back to the next level automatically.

Client Caching

Both OpenAI and Google clients are cached in Map keyed by provider:apiKey. This reuses TCP/TLS connections, saving ~100-300ms per request.

5.3 Streaming Chat Pipeline

The POST /api/chat/send endpoint implements a highly optimized streaming pipeline designed to minimize perceived latency:

1
Instant SSE Open

The SSE stream opens immediately — before any database work. This cuts ~1s off perceived latency. A 2KB comment padding flushes proxy buffers (Cloudflare, nginx, Traefik).

2
Mega Batch Pre-flight

All pre-flight checks run in a single Promise.all(): model lookup, agent data, rate limit check, and user API keys fetch. This parallelizes 4 database queries into one round-trip.

3
API Key Resolution

If BYOK is enabled, uses the user's key (bypasses rate limiting). Otherwise, uses the platform's key. For tool calling (image/TTS generation), merges both key pools to maximize available providers.

4
Typed SSE Event Streaming

The stream yields typed events: thinking_start, thinking_content, thinking_done, chunk (text), generating (skeleton), media (base64), clear_content, done, and error.

5
Fire-and-Forget Persistence

After [DONE] is sent, the response ends immediately. Message saving and usage increment happen asynchronously in background promises — the user never waits for DB writes.

5.4 Tool Calling & Media Generation

Text models can invoke two tools via function calling: generate_image and generate_tts. The system handles both real function calling (OpenAI, Gemini) and fake tool-call detection for models that don't support it.

Fake Tool-Call Detection: When models like Claude or some OpenRouter models receive tool definitions but output raw JSON instead of proper function calls, the detectFakeToolCall() function parses the JSON output, identifies the intended tool (supporting DALL-E style, direct format, and type-hint formats), clears the text from the UI, and executes the generation transparently.

Image Generation Pipeline

Text-to-Speech Pipeline

Video Generation Pipeline

5.5 Multimodal Input Processing

The platform accepts images, voice recordings, PDFs, and text files as attachments. Each provider requires different multimodal formats:

Input TypeGemini FormatOpenAI Format
Images inlineData: { mimeType, data } image_url: { url: "data:..." }
Audio inlineData input_audio: { data, format }
PDFs inlineData Extracted to text via pdf-parse, injected as [Content of attached PDF]
Text files inlineData Base64 decoded to UTF-8, injected as text block
Privacy Design: Attachments are sent as base64 in-memory to AI providers but never stored in the database. Only privacy-safe placeholder text like [Image sent], [Voice message sent], or [File sent: report.pdf] is persisted in the messages table.

5.6 Rate Limiting System

Rate limiting uses a monthly window with separate tracking for platform usage vs. own-key usage:

Rate Limit Computation — Mathematical Model

The rate-limiting algorithm uses a monthly sliding window with owner-key exemption. The effective usage is computed by subtracting own-key requests from total requests, ensuring BYOK users never consume platform quota:

Effective Usage Calculation U_platform = Σ max(message_countd - own_key_countd, 0) // for each day d in [month_start, month_end) L_effective = override.daily_limit ?? model.daily_limit ?? 25 allowed = U_platform < L_effective remaining = max(L_effective - U_platform, 0)

The nullish coalescing chain (??) implements a three-tier priority system. If an admin has set a custom limit for this specific user+model pair, it takes precedence. Otherwise, the model's default limit is used. As a final fallback, the system defaults to 25.

Usage Increment — Atomic Upsert // Two separate counters maintained per (user, model, date) tuple: message_count += 1 // always incremented own_key_count += usedOwnKey ? 1 : 0 // only if BYOK // ON CONFLICT (user_id, model_id, date) DO UPDATE // → Guarantees atomic increment even under concurrent requests // → UNIQUE constraint on (user_id, model_id, date) prevents duplicate rows

05b — LifecycleRequest Lifecycle & Timing Analysis

A complete end-to-end trace of a single chat message, from the user pressing Enter to the final token rendered in the browser. Every millisecond is accounted for.

Full Request Sequence

Browser Express Supabase AI Provider
1
Browser → Express  POST /api/chat/send

Client sends the message payload (modelId, message, attachments, settings). Express validates with Zod.parse(body).

2
Express → Browser  T+0ms

SSE stream opens instantly — before any DB work. Response headers flushed with a 2KB padding comment to force proxy buffer flush.

3
Express → Supabase  Promise.all()

Four parallel queries execute simultaneously: model lookup, agent data, rate limit check, user API keys. All 4 results return in a single round-trip.

model agent rateLimit userKeys
4
Express (internal)  T+80ms

resolveApiKey() — O(1) lookup from pre-fetched keys. buildGenApiKeys() — merges user + platform keys for tool calling. insertChat() — creates new chat session (new chats only).

5
Express → Browser  meta:{chatId}  T+120ms

Chat ID is sent to the client. The frontend updates activeChatId in the Zustand store and prepends the new chat to the sidebar list.

6
Express → AI Provider  T+200ms

buildMessages() constructs the messages array, then stream.open() initiates an HTTP/2 stream to the AI provider via cached client connection.

7
AI Provider → Express → Browser  Thinking Phase

The AI model's internal reasoning streams back in real-time:

thinking_start thinking_content ×N thinking_done
8
AI Provider → Express → Browser  Token Streaming  T+800ms first token

Text tokens stream through the async generator pipeline. Each token is forwarded as a chunk SSE event with <1ms relay overhead. May also yield generatingmedia events for tool-called content.

chunk ×N generating media clear_content
9
Express → Browser  done + [DONE]  T+3200ms

Stream completes. done:{chatId} event signals the frontend to finalize the message and enable regeneration. [DONE] terminates the SSE connection.

10
Express → Supabase  Fire-and-Forget  T+end (async)

After the client connection is closed, two background promises persist data without blocking the user: message INSERT and usage increment via atomic upsert. Zero impact on perceived latency.

saveMessage() incrementUsage()

Timing Breakdown — Critical Path Analysis

The critical path is the sequence of operations that cannot be parallelized. By moving all possible work off the critical path, time-to-first-token is minimized:

T+0ms
SSE Stream Opened

Response headers flushed immediately. Content-Type: text/event-stream is set, X-Accel-Buffering: no disables nginx buffering, socket.setNoDelay(true) disables Nagle's algorithm. A 2048-byte SSE comment (: ×2048\n\n) forces proxy buffer flush.

T+5ms → T+80ms
Mega-Batch Parallel Queries

Four independent database queries execute simultaneously via Promise.all(). Amortized cost: max(Tmodel, Tagent, TrateLimit, Tkeys) instead of Tmodel + Tagent + TrateLimit + Tkeys. Typical savings: ~60-200ms depending on Supabase region latency.

T+80ms → T+120ms
API Key Resolution & Chat Insert

Key resolution is O(1) lookup from the pre-fetched userKeyMap object — zero additional DB calls. For new chats, a single INSERT returns the UUID. For existing chats, message history is fetched (limited to last 20 messages to bound context window cost).

T+120ms → T+200ms
AI Provider Connection

Client lookup from the connection cache is O(1). If a cached client exists, TLS handshake is skipped (saving ~100-300ms). The stream.open() call initiates an HTTP/2 stream to the provider.

T+200ms → T+800ms
Provider-Side Processing (Thinking)

The AI model processes the prompt. During this period, thinking tokens stream back if supported. The client sees thinking_start instantly, with thinking_content events streaming reasoning in real-time.

T+800ms+
Token Streaming

First text token arrives. Each token is forwarded to the client as a chunk SSE event with <1ms relay overhead. The async generator yields tokens as they arrive — no buffering.

T+end (post-stream)
Fire-and-Forget Persistence

[DONE] is sent to the client, then res.end() closes the connection. Two background Promises execute asynchronously: message INSERT and usage increment. The user never waits for these writes — they have zero impact on perceived latency.

Latency Budget Formula

Time-To-First-Token (TTFT) TTFT = T_sse_open + max(T_model_q, T_agent_q, T_rate_q, T_keys_q) + T_key_resolve + T_chat_insert + T_ai_connect + T_ai_thinking // Where: T_sse_open 0ms // sync — no await max(queries) 60-80ms // parallel, NOT sequential T_key_resolve 0ms // O(1) Map lookup from pre-fetched data T_chat_insert 30-50ms // single INSERT (new chat only) T_ai_connect 50-100ms// with cached client (vs 200-400ms cold) T_ai_thinking 400-2000ms// model-dependent, not optimizable // Typical TTFT: ~600-2300ms (dominated by AI provider latency) // Without optimizations: ~1800-4500ms (2-3x slower)

05c — GeneratorsAsync Generator Streaming Pipeline

The streaming system is built on JavaScript async generators — a composable pipeline pattern that yields typed events from the AI provider through the SSE transport layer.

Generator Chain Architecture

The streaming pipeline is a three-stage chain of async generators. Each stage transforms or enriches the data before passing it downstream:

┌──────────────────────────────────────────────────────────────────────────┐ │ ASYNC GENERATOR PIPELINE │ │ │ │ ┌─────────────────┐ ┌──────────────────┐ ┌────────────────────┐ │ │ │ STAGE 1 │ │ STAGE 2 │ │ STAGE 3 │ │ │ │ Provider Stream │───►│ Tool Execution │───►│ SSE Serialization │ │ │ │ │ │ │ │ │ │ │ │ Yields: │ │ Yields: │ │ Writes: │ │ │ │ • string tokens │ │ • typed events │ │ • data: {JSON}\n\n │ │ │ │ • thinking parts │ │ • media chunks │ │ • data: [DONE]\n\n │ │ │ │ • function calls │ │ • text content │ │ │ │ │ │ • inline media │ │ • clear signals │ │ │ │ │ └────────┬────────┘ └────────┬─────────┘ └────────┬───────────┘ │ │ │ async function* │ yield* │ res.write() │ │ │ streamChatCompletion │ + executeToolCall │ │ │ └──────────────────────┴───────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘

Stage 1 — Provider-Specific Stream

The entry point is streamChatCompletion(), an async function* (async generator function) that routes to provider-specific implementations:

async function* streamChatCompletion(provider, apiKey, modelId, messages, maxTokens, temperature, topP, genApiKeys, modelType, ttsOptions) { // Route 1: TTS models → dedicated AUDIO modality path if (modelType === 'tts' && provider === 'google') { yield { type: 'generating', mediaType: 'audio' }; yield* streamGeminiTTS(apiKey, modelId, messages, ttsOptions); return; } // Route 2: Google models → native SDK (true token streaming) if (provider === 'google') { yield* streamGeminiNative(apiKey, modelId, messages, ...); return; } // Route 3: All others → OpenAI-compatible SDK // ... streams tokens, accumulates tool calls, detects fake calls }
yield* (Delegation): The yield* syntax delegates to a sub-generator, forwarding all yielded values directly to the consumer. This enables composable pipeline stages without manual iteration. Each sub-generator is itself an async function* that can yield typed event objects or plain strings.

Stage 2 — Tool Call Accumulation & Execution

For OpenAI-compatible providers, tool calls arrive as streamed deltas across multiple chunks. The system accumulates them before execution:

// Tool calls arrive fragmented across chunks: // chunk[0].delta.tool_calls = [{ index:0, id:"call_abc", function:{name:"gene"} }] // chunk[1].delta.tool_calls = [{ index:0, function:{name:"rate_image"} }] // chunk[2].delta.tool_calls = [{ index:0, function:{arguments:'{"pro'} }] // chunk[3].delta.tool_calls = [{ index:0, function:{arguments:'mpt":"cat"}'}] const pendingToolCalls = {}; // Map<index, {id, name, arguments}> for await (const chunk of stream) { const toolCalls = chunk.choices?.[0]?.delta?.tool_calls; if (toolCalls) { for (const tc of toolCalls) { if (!pendingToolCalls[tc.index]) pendingToolCalls[tc.index] = { id:'', name:'', arguments:'' }; if (tc.id) pendingToolCalls[tc.index].id += tc.id; if (tc.function?.name) pendingToolCalls[tc.index].name += tc.function.name; if (tc.function?.arguments) pendingToolCalls[tc.index].arguments += tc.function.arguments; } } } // After stream ends: parse accumulated JSON and execute each tool

Fake Tool-Call Detection — Pattern Matching Algorithm

Models that don't support native function calling sometimes emit JSON that looks like a tool call. The detectFakeToolCall() function uses a multi-pattern matching algorithm:

detectFakeToolCall(fullText) → { toolName, args } | null // Step 1: Extract JSON — try markdown-wrapped, then raw, then substring parsed = tryParse(text.match(/```json?\s*([\s\S]*?)```/)?.[1]) ?? tryParse(text) ?? tryParse(text.substring(text.indexOf('{'), text.lastIndexOf('}')+1)) // Step 2: Match against 5 known patterns: // Pattern A: DALL-E style → { action: "dalle.text2im", action_input: "{...}" } // Pattern B: Direct tool → { tool: "generate_image", prompt: "..." } // Pattern C: Function style → { function: "generate_image", arguments: {...} } // Pattern D: Type-hint → { prompt: "...", type: "image" } // Pattern E: TTS style → { action: "tts", text: "..." } // Step 3: Extract args using cascading property access: prompt = parsed.prompt ?? parsed.input?.prompt ?? parsed.arguments?.prompt ?? parsed.parameters?.prompt

05d — SSE ProtocolServer-Sent Events Protocol Internals

The SSE transport layer is hand-optimized for minimum latency across reverse proxies, CDNs, and mobile networks. Every byte of the protocol is deliberate.

SSE Wire Format

Each event is a JSON-serialized line prefixed with data:  and terminated by \n\n. The format is defined by the W3C EventSource specification:

SSE Frame Structure Padding Event (T+0ms): : \n\n │ colon = SSE comment (ignored by EventSource) │ │ 2048 spaces = fills proxy buffers │ │ \n\n = event terminator │ Meta Event: data: {"type":"meta","chatId":"550e8400-e29b-41d4-a716-446655440000"}\n\n Thinking Events: data: {"type":"thinking_start"}\n\n data: {"type":"thinking_content","content":"Let me analyze..."}\n\n data: {"type":"thinking_done","thinkingTime":1.2}\n\n Text Chunk Event: data: {"type":"chunk","content":"Hello"}\n\n Media Event (base64 payload): data: {"type":"media","mimeType":"image/png","data":"iVBORw0KGgo..."}\n\n Terminal Events: data: {"type":"done","chatId":"550e8400-..."}\n\n data: [DONE]\n\n

Proxy Buffer Flush Strategy

Reverse proxies (nginx, Cloudflare, Traefik) buffer responses until they hit a minimum size threshold. Without the 2KB padding, the first SSE event may be delayed by up to 30 seconds until enough data accumulates:

Proxy Buffer Flush Condition buffer_size proxy_threshold flush to client // Typical proxy thresholds: nginx: 4KB (proxy_buffer_size default) Cloudflare: 1KB (automatic edge buffering) Traefik: 4KB (default buffer) // Solution: 2KB SSE comment = ':' + ' '×2048 + '\n\n' = 2051 bytes // Combined with response headers (~500 bytes) = ~2.5KB // Exceeds Cloudflare threshold → instant flush // Additional transport hints: X-Accel-Buffering: no // disables nginx proxy buffering Cache-Control: no-cache, no-transform // prevents CDN caching socket.setNoDelay(true) // disables Nagle's algorithm (TCP)

SSE Event Type System — Finite State Machine

The client-side SSE parser implements a state machine that processes events in strict order. Invalid transitions are handled gracefully:

┌─────────────────────────────────────────┐ │ SSE EVENT STATE MACHINE │ └─────────────────────────────────────────┘ ┌──────────┐ meta ┌──────────────┐ thinking_start ┌───────────┐ │ INITIAL │─────────►│ CONNECTED │────────────────►│ THINKING │ └──────────┘ └──────┬───────┘ └─────┬─────┘ │ │ │ chunk │ thinking_content (loop) │ │ ▼ │ thinking_done ┌──────────────┐ │ │ STREAMING │◄───────────────────────┘ └──────┬───────┘ │ ┌───────────┼───────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ chunk │ │generating│ │clear_content │ │ (append) │ │(skeleton)│ │(reset text) │ └──────────┘ └────┬─────┘ └──────────────┘ │ ▼ ┌──────────┐ │ media │ │ (render) │ └──────────┘ │ done │ error ▼ ┌──────────┐ │ COMPLETE │ └──────────┘

05e — Audio EncodingPCM-to-WAV Binary Encoding

Google's Gemini TTS returns raw PCM audio samples. Browsers cannot play raw PCM — the pcmToWav() function constructs a valid WAV file by manually writing the 44-byte RIFF/WAVE header.

WAV File Structure — Byte-Level Layout

WAV File Format (RIFF/WAVE) Offset Size Field Value ────── ──── ───── ───── 0x00 4 ChunkID "RIFF" // ASCII magic bytes 0x04 4 ChunkSize 36 + dataSize // file size - 8 0x08 4 Format "WAVE" // ASCII format identifier ── fmt sub-chunk ── 0x0C 4 Subchunk1ID "fmt " // with trailing space 0x10 4 Subchunk1Size 16 // PCM = 16 bytes 0x14 2 AudioFormat 1 // 1 = PCM (uncompressed) 0x16 2 NumChannels 1 // mono 0x18 4 SampleRate 24000 // 24kHz (from Gemini) 0x1C 4 ByteRate 48000 // SampleRate × BlockAlign 0x20 2 BlockAlign 2 // NumChannels × BitsPerSample/8 0x22 2 BitsPerSample 16 // 16-bit samples ── data sub-chunk ── 0x24 4 Subchunk2ID "data" 0x28 4 Subchunk2Size dataSize // raw PCM byte count 0x2C N Data [PCM samples...] // little-endian int16

Audio Mathematics

WAV Encoding Formulas ByteRate = SampleRate × NumChannels × (BitsPerSample / 8) = 24000 × 1 × (16 / 8) = 48,000 bytes/sec BlockAlign = NumChannels × (BitsPerSample / 8) = 1 × (16 / 8) = 2 bytes per sample frame ChunkSize = 36 + dataSize // 36 = header bytes (44) minus RIFF header (8) FileSize = 44 + dataSize // 44 byte header + raw PCM // Duration of generated audio: Duration = dataSize / ByteRate // in seconds = dataSize / 48,000 // Sample rate extraction from MIME type: // Gemini returns: "audio/L16;rate=24000" rate = parseInt(mimeType.match(/rate=(\d+)/)?.[1]) ?? 24000

Format Auto-Detection

Before wrapping with a WAV header, the function checks if the data is already in a playable format by inspecting magic bytes:

// Check for existing WAV header (RIFF...WAVE) if (raw[0..3] === "RIFF" && raw[8..11] === "WAVE") → return as-is // Check for MP3 sync word (frame header) if (raw[0] === 0xFF && (raw[1] & 0xE0) === 0xE0) → return as-is // Otherwise: raw PCM → wrap with 44-byte WAV header const wav = Buffer.alloc(44 + dataSize); // ... write RIFF, fmt, data sub-chunks at byte offsets

05f — ConcurrencyConcurrency & Parallelization Model

The system maximizes throughput through strategic parallelization of independent operations, connection reuse, and non-blocking I/O patterns.

Promise.all() Parallelization Map

Every request triggers multiple independent operations. The codebase uses Promise.all() at every opportunity to convert sequential I/O into parallel I/O:

LocationParallel OperationsSequential CostParallel Cost
Chat pre-flight model + agent + rateLimit + userKeys ~320ms (4 × 80ms) ~80ms (max of 4)
Existing chat insertUserMsg + fetchHistory ~160ms ~80ms
Regenerate pre-flight lastMsg + agent + rateLimit + userKeys ~320ms ~80ms
Regenerate setup deleteLastMsg + fetchHistory ~160ms ~80ms
Auth init (frontend) profile + models + agents + chatHistory ~400ms ~100ms
Chat history endpoint count + dataFetch ~160ms ~80ms
Messages endpoint chatVerify + count + messages ~240ms ~80ms
Admin stats users + models + agents + monthUsage ~320ms ~80ms
Parallelization Efficiency T_sequential = Σ Ti // sum of all query times T_parallel = max(Ti) // bottleneck only Speedup = T_sequential / T_parallel // For N queries of equal cost T: Speedup = N × T / T = N // → 4 parallel queries = 4× speedup (linear scaling) // Total saved per chat request: ~240ms (pre-flight) + ~80ms (history) // Over 1000 daily requests: 320 seconds of cumulative latency eliminated

Connection Pool Cache — O(1) Client Lookup

AI provider clients are cached in Map data structures, providing O(1) amortized lookup by composite key:

// Cache key: "provider:apiKey" → deduplicated per provider+credential const clientCache = new Map(); // OpenAI-compatible clients const googleClientCache = new Map(); // Native Google AI clients // Lookup: O(1) average case (hash map) // Memory: O(P × K) where P=providers, K=unique keys // Typical: ~6-12 cached clients (6 providers × 1-2 keys each) // What's saved per cache hit: // 1. Object construction (~5ms) // 2. TCP connection establishment (~50ms) // 3. TLS handshake (~100-200ms) // Total savings per hit: ~155-255ms

Auth Cache — Token-Indexed Profile Store

Auth Cache Parameters Structure: Map<JWT_token, { profile: User, ts: number }> TTL: 300,000ms (5 minutes) Eviction: Lazy — checked on access: if (Date.now() - ts > TTL) → miss Hit ratio: ~95% for active users (tokens refresh every ~60min) // Per-request savings on cache hit: // Skips: supabase.auth.getUser() (~40ms) + users.select() (~40ms) = ~80ms // Frontend mirror: 30s TTL auth header cache // Skips: supabase.auth.getSession() (~50-150ms)

05g — FallbackGemini Progressive Fallback State Machine

Gemini models have varying capabilities (thinking, tools, image output). The system implements a three-stage fallback that automatically downgrades features when a model doesn't support them.

┌────────────────────────────────┐ │ START: streamGeminiNative() │ └────────────────┬───────────────┘ │ ▼ ┌────────────────────────────────┐ │ ATTEMPT 1 (Full Features) │ │ thinkingConfig: { include: ✓ } │ │ tools: GENERATION_TOOLS_GEMINI │ └────────────────┬───────────────┘ │ success? │ ┌────────┤ │ YES │ NO → isUnsupported? ▼ │ ┌──────────┐ │ YES │ STREAM │ │ │ RESPONSE │ ▼ └──────────┘ ┌────────────────────────────────┐ │ ATTEMPT 2 (No Thinking) │ │ thinkingConfig: none │ │ tools: GENERATION_TOOLS_GEMINI │ └────────────────┬───────────────┘ │ success? │ ┌────────┤ │ YES │ NO → isUnsupported? ▼ │ ┌──────────┐ │ YES │ STREAM │ │ │ RESPONSE │ ▼ └──────────┘ ┌────────────────────────────────┐ │ ATTEMPT 3 (Plain Text) │ │ thinkingConfig: none │ │ tools: none │ └────────────────┬───────────────┘ │ success? │ ┌────────┤ │ YES │ NO → THROW ERROR ▼ │ ┌──────────┐ │ │ STREAM │ ▼ │ RESPONSE │ ┌─────┐ └──────────┘ │ERROR│ └─────┘

Unsupported Feature Detection Heuristic

const isUnsupported = msg.includes('Thinking is not enabled') || msg.includes('not supported') || msg.includes('INVALID_ARGUMENT') || (err?.status === 400 && ( msg.includes('think') || msg.includes('tool') || msg.includes('function') )); // If isUnsupported && more attempts remain → retry with simpler config // If isUnsupported && no attempts remain → throw (fatal) // If !isUnsupported → throw immediately (don't waste retries on auth/rate errors)

05h — Token BudgetToken Budget & Context Window Management

The system implements a multi-tier priority chain for determining the token budget for each request, with hard caps based on key ownership.

Max Tokens Resolution Chain // Three-tier priority with ownership-based hard cap: hardCap = useOwnKeys ? 131,072 : 10,000 maxTokens = min( body.maxTokens // Priority 1: Frontend override (Agent Studio) ?? agent.max_tokens // Priority 2: Agent setting from DB ?? model.max_tokens ?? 4096, // Priority 3: Model default / global fallback hardCap // Clamp: prevent abuse ) // Same pattern for temperature and top_p: temperature = body.temperature ?? agent.temperature ?? 0.7 topP = body.topP ?? agent.top_p ?? 0.95

Context Window Bounding

To prevent unbounded context growth and keep costs predictable, the system limits conversation history to the most recent 20 messages:

Context Window Strategy // For existing chats: history = SELECT role, content FROM messages WHERE chat_id = chatId ORDER BY created_at ASC LIMIT 20 // This bounds the context window to approximately: max_context_tokens 20 messages × ~500 tokens/msg avg = ~10,000 tokens // For Agent Studio (conversationHistory from frontend): // No server-side limit — frontend manages via localStorage // System prompt is prepended as first message in array // Message array construction order: // [system_prompt?, ...history_messages, current_user_message]

Dual Usage Tracking for Public Agents

When a user chats with a public agent created by another user, usage is counted against both the consumer and the agent creator (unless the consumer uses their own key):

Dual Usage Attribution // Always: incrementUsage(userId, modelId, actuallyUsedOwnKey) // Additionally, if all conditions met: if (agentCreatorId // agent exists && agentCreatorId !== userId // not the creator themselves && !actuallyUsedOwnKey) { // using platform key incrementUsage(agentCreatorId, modelId, false) } // This prevents creators from consuming unlimited platform quota // by publishing popular agents that others use.

05i — MultimodalMultimodal Processing Pipeline

Input attachments flow through a provider-specific transformation pipeline. The system handles images, audio, PDFs, and text files with automatic format conversion and privacy-safe storage.

Attachment Processing Flow

┌────────────────────────────────────────────────────────────────────────┐ │ MULTIMODAL INPUT PIPELINE │ └────────────────────────────────────────────────────────────────────────┘ ┌──────────────────────┐ │ User Attachments │ │ (base64 in request) │ └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ Zod Validation │ │ type ∈ {image,voice, │ │ file} │ │ max: 5 attachments │ └──────────┬───────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ┌────────▼─────┐ ┌──────▼──────┐ ┌──────▼──────┐ │ provider │ │ provider │ │ DB Storage │ │ === 'google' │ │ !== 'google' │ │ (parallel) │ └────────┬─────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ ┌────────▼─────────┐ │ ┌───────▼────────┐ │buildGeminiParts() │ │ │placeholder only │ │ image → inlineData│ │ │"[Image sent]" │ │ audio → inlineData│ │ │"[Voice sent]" │ │ pdf → inlineData│ │ │"[File: x.pdf]" │ │ text → inlineData│ │ │ │ └──────────────────┘ │ └─────────────────┘ │ ┌────────────────────────▼────────────────────────┐ │ buildOpenAIMultimodalContent() │ │ │ │ image/* → { type:"image_url", url:"data:..." } │ │ audio/* → { type:"input_audio", data, format } │ │ application/pdf → │ │ extractPdfText(base64) → text block │ │ "[Content of attached PDF file]:\n\n..." │ │ text/* → Base64.decode() → text block │ │ + user message text → final content array │ └────────────────────────────────────────────────────┘

PDF Text Extraction

For non-Google providers that don't support inline PDF, the system uses pdf-parse to extract text content:

// Input: base64-encoded PDF bytes const buffer = Buffer.from(base64Data, 'base64'); const result = await pdfParse(buffer); const text = result.text?.trim() || ''; // Output: injected as a structured text block content.push({ type: 'text', text: `[Content of attached PDF file]:\n\n${pdfText}` }); // Error handling: graceful degradation // On parse failure → "[PDF content could not be extracted]" // PDF is still sent as base64 to Gemini (native PDF support)

Body Size Limits

Request Size Constraints JSON body limit = 50MB // express.json({ limit: '50mb' }) Max attachments = 5 // Zod: z.array().max(5) Max message = 32,000 chars // Zod: z.string().max(32000) Max prompt = 10,000 chars // Agent system_prompt Max username = 30 chars // /^[a-z0-9_]{3,30}$/ // Typical base64 overhead: ~33% larger than binary // → 50MB JSON limit supports ~37.5MB of actual file data // → A single 4K image ≈ 8-15MB base64 // → 5 images at max quality: ~50-75MB → may exceed limit // → In practice: compressed images are 200KB-2MB each

06 — FrontendUser Application

A premium dark-themed SPA built with Next.js 16 App Router, Zustand state management, and Framer Motion animations. Designed for speed — all critical data is preloaded in parallel on login.

6.1 State Management (Zustand)

A single Zustand store manages the entire application state. Key design decisions:

6.2 Auth Header Caching

The frontend API layer caches auth headers for 30 seconds, avoiding redundant supabase.auth.getSession() calls (~50-150ms each). On sign-in, seedAuthCache() pre-populates the cache so the initial data load runs without delay.

6.3 Data Preloading Strategy

On authentication, the AuthProvider fires a parallel mega-batch:

const [profile] = await Promise.all([ usersAPI.getMe(), // User profile loadAllData(), // Models + Agents + Chat History (parallel) ]); // Non-blocking background loads: usersAPI.getUsage().then(setUsage); usersAPI.getApiKeys().then(setApiKeys);

This ensures the UI is fully interactive in a single network round-trip, with usage data and API keys loading in the background.

6.4 SSE Event Processing

The frontend processes 10 distinct SSE event types from the streaming chat endpoint:

Event TypePurposeFrontend Action
metaChat ID assignmentUpdates store activeChatId, prepends to chat list
thinking_startAI is reasoningShows animated "Thinking..." indicator
thinking_contentInternal reasoning textStreams into collapsible thinking panel
thinking_doneReasoning completeCollapses panel, shows elapsed time badge
chunkText tokenAppends to assistant message content
generatingMedia generation startedShows shimmer skeleton (image/audio/video)
mediaGenerated media dataRenders inline image, audio player, or video player
clear_contentClear fake tool JSONResets accumulated text content
doneStream completeFinalizes message, enables regeneration
errorError occurredShows error message in chat

6.5 Model-Specific Composers

The chat input adapts to the selected model type:

💬
Text Composer

Standard chat input with file attachments (images, PDFs, text files), voice recording, expandable input, and thinking toggle.

🎨
Generation Composer

Purple-themed prompt input for image/video models with Sparkles icon. Focuses on creative prompt entry.

🎤
TTS Composer

Emerald-themed with voice selection pills (8 Google voices). Volume2 icon. Enters text-to-speak content.

6.6 Agent Studio

The Agent Studio provides a full-featured IDE for creating and testing custom AI agents:


07 — AdminAdmin Dashboard

A separate Next.js application that provides complete platform management: model CRUD, user management with per-user rate limits, and usage analytics.

Model Management

Full CRUD for AI models: name, provider, model_id, API key, daily limit, max tokens, model type (text/image/video/tts), and active status. API keys are masked in the UI.

👥
User Management

View all users, promote/demote admin roles, and set per-user per-model custom rate limits that override the global defaults.

📈
Usage Analytics

Dashboard stats (total users, active models, total agents, monthly requests) and detailed per-user, per-model usage logs filterable by date.

The admin API is protected by both authMiddleware (JWT validation) and adminMiddleware (role check). All admin endpoints require role === 'admin' in the user's profile.


08 — SecuritySecurity Architecture

Security is enforced at every layer: database (RLS), middleware (JWT), transport (HTTPS/CORS/Helmet), and application (Zod validation, API key isolation).

Row-Level Security

Every table has RLS enabled. Users can only access their own data. Public agents are readable by anyone. Admins have unrestricted access via a policy that checks role = 'admin'.

API Key Isolation

Platform API keys live exclusively in the models table and are only accessed server-side via the service-role key. The public models endpoint explicitly excludes api_key from the SELECT query. Admin endpoints mask keys as sk-xxxxx...xxxx.

Input Validation

Every API endpoint validates input with Zod schemas before processing. Messages are capped at 32,000 chars, attachments at 5 per request, usernames must match /^[a-z0-9_]{3,30}$/.

CORS & Headers

Helmet sets security headers. CORS is configured per-origin from environment variables with normalized trailing-slash handling. Only the frontend and admin URLs are allowed.

Privacy by Design

Binary attachments (images, voice, files) are processed in-memory and sent directly to AI providers. Only text placeholders like [Image sent] are stored in the database. No user media is ever persisted.

BYOK Safety

User API keys are stored in the user_api_keys table with RLS ensuring only the owner can read/write their keys. The backend resolves keys server-side — they're never sent to the frontend.


09 — PerformancePerformance Optimizations

Every millisecond matters for perceived AI response speed. The platform uses aggressive parallelization, caching, and streaming to minimize time-to-first-token.

OptimizationImpactTechnique
Instant SSE open ~1s faster perceived latency Stream opens before DB work; 2KB padding flushes proxy buffers
Mega-batch pre-flight 4 queries in 1 round-trip Promise.all() for model, agent, rate limit, and API keys
Auth token caching ~50-150ms saved per request Backend: 5min in-memory Map. Frontend: 30s header cache
Client connection pooling ~100-300ms saved per request OpenAI and Google clients cached by provider+key, reuse TCP/TLS
Native Gemini streaming ~8s faster than wrapper Direct @google/genai SDK instead of OpenAI-compatible endpoint
Fire-and-forget saves 0ms user wait for DB writes Message and usage inserts happen after [DONE] is sent
Parallel data preload Single round-trip on login Promise.all() for profile, models, agents, chat history
Zustand useShallow Reduced re-renders Heavy components select only needed state slices

10 — FeaturesKey Feature Summary

A comprehensive list of every user-facing and system-level feature in the platform.

💬 Streaming Chat

Real-time token-by-token streaming with thinking indicators, copy, regenerate, and auto-scroll.

🤖 Custom Agents

Create agents with system prompts, custom temperature/top_p/max_tokens, unique usernames, and public sharing.

🎨 Image Generation

Inline via tool calling (Imagen 3 + DALL-E 3). Expandable previews and one-click downloads.

🎤 Text-to-Speech

8 Google voices + 6 OpenAI voices. Custom audio player with progress bar. PCM-to-WAV conversion.

🎥 Video Generation

Google Veo 2 with async polling. Configurable aspect ratios. Inline video player.

📎 Multimodal Input

Attach images, voice recordings (iOS-compatible), PDFs (extracted to text), and text files.

🔑 Bring Your Own Key

Users add their own provider keys for unlimited access. Keys bypass rate limiting and are stored securely.

📈 Usage Tracking

Per-model monthly usage with separate platform vs. own-key tracking. Admin override limits per user.

💡 Thinking Mode

Toggle AI reasoning visibility. Collapsible thinking panel shows internal reasoning with elapsed time.

📌 Model Pinning

Pin up to 3 favorite models. Quick-switch dropdown shows Recent, Pinned, and Latest sections.

🌐 Public Agents

Share agents publicly via unique usernames. Discoverable in the Explore page. Creators' usage is tracked separately.

⚡ Agent Studio

Live testing with real-time settings reflection, conversation memory (localStorage), and full media generation support.


11 — APIAPI Endpoint Reference

All endpoints require Bearer token authentication unless noted. The backend exposes 18 REST endpoints across 7 route modules.

MethodEndpointDescription
POST/api/chat/sendSend message & stream response (SSE)
GET/api/chat/historyList user's chats (paginated)
GET/api/chat/:id/messagesGet messages for a chat (paginated)
PATCH/api/chat/:idRename a chat
DEL/api/chat/:idDelete a chat
POST/api/chat/:id/regenerateRegenerate last response (SSE)
GET/api/modelsList active models (no api_key)
GET/api/models/:id/usageUser's usage for a specific model
GET/api/agentsList user's agents
POST/api/agentsCreate agent
PATCH/api/agents/:idUpdate agent
DEL/api/agents/:idDelete agent
GET/api/agents/publicList public agents (no auth)
POST/api/generate/imageGenerate image from prompt
POST/api/generate/ttsText-to-Speech generation
POST/api/generate/videoVideo generation (Veo 2)
GET/api/users/meGet user profile
PUT/api/users/me/api-keysUpdate BYOK API keys

11b — FailuresReal Challenges & Failures

This section documents the bugs, edge cases, and painful discoveries that shaped the system's architecture. Every "optimized" solution in this case study was born from something that broke first.

💥 The Gemini Streaming Delay Discovery

!
What happened

Early on, all 6 providers used the same OpenAI-compatible SDK code path. Everything seemed fine — until I tested Gemini 2.5 Flash side-by-side with GPT-4o. Gemini had a consistent 5-8 second delay before the first token appeared, while GPT-4o started streaming in ~800ms. Users reported "Gemini is broken" even though it was technically working.

Root cause: Google's OpenAI-compatible endpoint (generativelanguage.googleapis.com/v1beta/openai/) doesn't actually stream. It buffers the entire response server-side, then sends all chunks in a rapid burst. The "streaming" is fake — you get 0 tokens for 6 seconds, then all 500 tokens in 200ms.

The fix: I built an entirely separate native Gemini code path using the @google/genai SDK, which supports real token-by-token streaming. This required duplicating all streaming logic, tool calling handling, and error management for Gemini specifically. The result cut Gemini's time-to-first-token from ~6s to ~200ms — but the cost was maintaining two parallel streaming implementations (§5g).

Lesson: Never trust "OpenAI-compatible" claims without benchmarking the actual streaming behavior. Compatibility layers optimize for correctness, not latency.

💥 Tool Call Parsing Across Providers

!
What happened

Tool calling (for image/TTS generation) worked perfectly on OpenAI and Gemini. Then I tested it on Claude via OpenRouter, and the chat just... printed raw JSON. The model understood the tool definition but instead of making a function call, it wrote {"tool": "generate_image", "prompt": "a sunset over mountains"} as plain text into the chat.

It got worse: Every model that "faked" tool calls did it differently. Some wrapped JSON in markdown code blocks. Some used action/action_input DALL-E format. Some just inferred a type: "image" field. I found at least 5 distinct JSON patterns across providers, and new ones kept appearing as I tested more models.

The fix: The detectFakeToolCall() function (§5c) — a multi-pattern matching pipeline that extracts JSON from markdown blocks or raw text, then tries to match it against 5 known tool-call formats. When detected, it clears the JSON from the UI via a clear_content SSE event and executes the generation transparently. This was the messiest code I wrote, and it still occasionally fails on novel model outputs.

Lesson: The AI ecosystem's "function calling" standard is a lie. Every provider implements it differently, and many models will simply ignore the schema and do their own thing. You need both a clean path and a dirty fallback.

💥 The Proxy Buffer Problem

!
What happened

Streaming worked perfectly in local development. The moment I deployed behind Cloudflare, SSE events were delayed by 15-30 seconds. The entire AI response would accumulate silently, then dump to the client all at once. Users saw a blank screen for half a minute, then the full response appeared instantly. It looked completely broken.

Root cause: Reverse proxies (Cloudflare, nginx, Traefik) buffer response data until they accumulate enough bytes to justify a network flush. SSE events are tiny (~50-200 bytes each), so they sit in the proxy buffer waiting for more data that never comes. The proxy's "optimization" was destroying the entire streaming UX.

The fix: Three-layer workaround: (1) A 2KB SSE comment (: + 2048 spaces) sent immediately to overflow the proxy buffer threshold. (2) X-Accel-Buffering: no header to explicitly disable nginx buffering. (3) socket.setNoDelay(true) to disable Nagle's TCP algorithm, which was batching small SSE frames. All three were necessary — removing any one brought the delay back in certain deployment configs.

Lesson: Local development is a lie for streaming applications. Always test SSE through at least one reverse proxy layer before calling it "done."

💥 Gemini's Unpredictable Feature Support

!
What happened

Gemini 2.5 Flash supports thinking mode + tool calling. Gemini 2.0 Flash supports tool calling but not thinking. Gemini 1.5 Pro supports neither. There is no API endpoint to query which features a model supports. I only discovered this through trial and error — sending a request with thinking enabled, getting a 400 error, and then realizing I had to maintain a compatibility matrix in my head.

It got worse: Google sometimes updates model capabilities silently. A model that didn't support tools on Monday might support them on Wednesday. Hardcoding a feature matrix would become stale instantly.

The fix: The three-stage progressive fallback state machine (§5g). Instead of trying to know in advance what each model supports, I attempt the most capable configuration first (thinking + tools), catch the error, check if it's an "unsupported feature" error, and retry with a simpler configuration. This makes the system self-healing — it adapts to model capabilities at runtime without any hardcoded knowledge.

Lesson: When working with third-party AI APIs, design for capability discovery at runtime rather than static configuration. APIs change under you.

💥 PCM Audio That Wouldn't Play

!
What happened

Gemini TTS returned audio data with MIME type audio/L16;rate=24000. I Base64-encoded it, sent it to the browser, and... silence. The <audio> element refused to play it. No error, no console warning — just nothing. Chrome, Firefox, Safari all silently failed.

Root cause: Browsers cannot play raw PCM audio. They need a container format (WAV, MP3, OGG). Gemini returns headerless 16-bit linear PCM samples — just raw bytes with no metadata about sample rate, channels, or bit depth. The browser has no way to interpret the data.

The fix: I wrote a manual pcmToWav() function (§5e) that constructs a 44-byte RIFF/WAVE header byte-by-byte using DataView, then prepends it to the raw PCM data. The sample rate is extracted from the MIME type string via regex. I also had to add magic byte detection (checking for RIFF and 0xFF 0xE0 MP3 sync headers) because OpenAI TTS returns MP3 directly, and wrapping an MP3 in a WAV header produces garbage.

Lesson: "Returns audio" in an API doc doesn't mean "returns playable audio." Always check the actual byte-level format, not just the MIME type.

💥 Rate Limiting Edge Case — The Public Agent Exploit

!
What happened

I built a public agents feature where users can share their custom agents. Another user can chat with your agent for free — great for discoverability. Then I realized the exploit: a user creates 10 public agents, shares them, and 50 people use them. All 500 daily requests consume the platform's API key budget, but nobody's individual rate limit is hit because usage was only tracked against each consumer individually.

The fix: Dual usage attribution (§5h). When someone uses a public agent on the platform key, usage is counted against both the consumer (who sent the message) and the agent creator (who published the agent). This prevents creators from bypassing their rate limits by laundering usage through public agents. The creator's count is only incremented when the platform key is used — if the consumer brings their own key, the creator isn't penalized.

Lesson: Any "sharing" feature in a rate-limited system is a potential bypass vector. Always ask: "Who pays for the compute when content goes viral?"

💥 iOS Voice Recording Incompatibility

!
What happened

Voice recording worked perfectly on Chrome desktop and Android. On iOS Safari, the MediaRecorder API silently produced empty blobs. The recording UI appeared to work — the timer ticked, the animation played — but the resulting audio file was 0 bytes.

Root cause: iOS Safari doesn't support audio/webm (the default format on Chrome). It supports audio/mp4 and audio/aac, but doesn't throw an error when you request an unsupported format — it just produces garbage output.

The fix: The VoiceRecorder component now probes format support at initialization with a priority list: audio/mp4audio/aacaudio/webm. It uses MediaRecorder.isTypeSupported() to find the first working format. This cascading approach handles iOS, Android, and desktop without user-agent sniffing.

Lesson: Never trust MediaRecorder to fail loudly. Always probe format support before recording, and always test audio features on actual iOS hardware (simulators lie about codec support).

What's Still Imperfect

Honest accounting of known limitations I haven't solved yet:

⚠ Auth Cache Staleness

The 5-minute backend auth cache means a user's role change (e.g., promoted to admin) doesn't take effect for up to 5 minutes. I accepted this because role changes are rare and the latency savings (~80ms/request) affect every single request.

⚠ No Shared Type Package

TypeScript interfaces (Model, Message, Agent) are duplicated between frontend and backend. A schema change requires manual sync in both codebases. A /packages/types workspace would fix this but adds Turborepo/Nx complexity I haven't justified yet.

⚠ Fire-and-Forget Risk

If the server crashes between sending [DONE] and the background message INSERT completing, the user sees the response but it's not saved to DB. On next page load, the message disappears. This is rare (~0.01% chance) but a real data consistency gap.

⚠ 50MB Body Limit

The 50MB JSON body limit accommodates most attachments, but 5 high-resolution images at full quality could exceed it. There's no chunked upload or compression — the entire payload must fit in one request. A proper solution would use presigned URLs and storage buckets.

Philosophy on Failures: Every "clean" architecture in this case study started as a messy workaround for a real bug. The progressive fallback (§5g) exists because Gemini crashed without it. The 2KB SSE padding (§5d) exists because Cloudflare swallowed my streams. The fake tool-call detector (§5c) exists because Claude ignored my function definitions. Production systems aren't designed in advance — they're shaped by failure.

12 — ConclusionTechnical Summary

Enox AI is a production-grade system that demonstrates mastery of full-stack engineering, distributed systems patterns, binary protocol encoding, and AI pipeline orchestration.

Engineering Depth — By the Numbers

8
Async Generator Pipelines
10
SSE Event Types
3
Fallback Stages
44
WAV Header Bytes

The project showcases deep expertise across every layer of the stack:

Total Sections: This case study covers 20 technical sections spanning system architecture, database design, streaming pipelines, async generators, SSE protocol internals, binary audio encoding, concurrency models, state machines, rate limiting mathematics, token budgeting, multimodal processing, and more.

Every line of code is available as open source at github.com/yad-anakin/enox.