Enox AI — Technical Case Study

01 — OverviewExecutive Summary

Enox AI is a full-stack, open-source AI platform that unifies access to the world's leading language models, image generators, video creators, and text-to-speech engines through a single, beautifully designed interface.

The platform implements a three-tier monorepo architecture: a Next.js 16 user-facing application, an Express.js API server, and a separate Next.js admin dashboard, all backed by Supabase (PostgreSQL) with Row-Level Security.

Key capabilities include real-time token-by-token streaming via Server-Sent Events, multimodal input support (images, voice, PDFs, files), inline media generation through AI tool calling, a custom agent studio with live testing, per-user rate limiting with admin override controls, and a Bring-Your-Own-Key (BYOK) system that gives users unlimited access with their own API keys.

AI Providers

Media Modalities

API Endpoints

841

Lines in AI Core

01b — ProductProduct Thinking & UX Decisions

Before writing a single line of code, I identified real user problems in the AI tool landscape and made deliberate product decisions to solve them. This section explains the why behind every major design choice.

The User Problems

When I started building Enox, I kept running into the same frustrations that every AI power-user hits:

❌

Vendor Lock-In

ChatGPT, Claude, and Gemini each live in their own walled garden. Switching between them means separate accounts, separate UIs, and separate billing. If you want to compare model outputs on the same prompt, you're juggling browser tabs.

💰

Cost Confusion

Users either pay $20/month per provider (expensive if you use multiple) or manage raw API keys with no usage visibility. There's no middle ground between "consumer subscription" and "raw developer API."

🔓

No Customization

ChatGPT's "Custom GPTs" are limited — you can't set temperature, top_p, or max_tokens. You can't live-test different system prompts side by side. Power users need an Agent Studio, not a wizard.

📷

Fragmented Media

Image generation, TTS, and video are separate tools. You can't ask a chat model to "make me a logo" and get an image inline — you have to switch to DALL-E or Midjourney. The AI experience is scattered across a dozen apps.

Who Is This For?

Enox targets two distinct user personas, and every feature maps to one or both:

🧑 AI Power Users

Developers, writers, and researchers who use multiple AI models daily. They want model comparison, BYOK for unlimited usage, fine-grained agent control (temperature, top_p), and a single interface that does text + image + audio + video.

Key features for them: BYOK system, Agent Studio with live testing, model pinning, thinking mode toggle, multimodal attachments.

🏭 Platform Operators

Team leads or small companies who want to host a private AI portal for their team. They need admin controls: model management, per-user rate limits, usage analytics, and the ability to rotate API keys without disrupting users.

Key features for them: Admin dashboard, per-user model limits, usage tracking with own-key separation, centralized model CRUD.

UX Design Decisions

Every UI choice was driven by a specific user need or pain point I observed:

🎨

Model-Adaptive Composer

Problem: Users were confused when selecting a TTS model but seeing a standard chat input. Solution: The composer dynamically changes based on model type — emerald-themed with voice pills for TTS, purple-themed with Sparkles icon for image/video, and standard chat with attachments for text. This immediately communicates what the model expects without any instructions.

💬

Inline Media Generation (Not Separate Pages)

Problem: Switching between "chat" and "image generator" tabs destroys conversational context. Solution: Media generation happens inline via tool calling. You ask "draw a sunset" in a normal chat, and the image appears in the conversation. The AI decides when to generate, keeping the experience conversational rather than transactional.

👁

Thinking Mode as a Toggle, Not Default

Problem: Extended thinking is useful for complex reasoning but annoying for quick questions — it adds latency and visual noise. Solution: A one-click toggle in the composer lets users opt into thinking mode per-message. When active, thinking streams into a collapsible panel so it doesn't overwhelm the response. Casual users never see it; power users love it.

📌

Model Pinning with Quick-Switch Dropdown

Problem: With 6+ providers and dozens of models, the model selector becomes overwhelming. Solution: Users pin up to 3 favorite models. The dropdown organizes by Pinned → Recent → All, so your daily-drivers are always one click away. Type badges (IMG/VID/TTS) provide instant visual cues about model capability.

🔒

Privacy-First Attachment Handling

Problem: Users hesitate to upload sensitive documents (contracts, medical images) to AI platforms that might store them. Solution: Attachments are explicitly never stored in the database. Only a placeholder like [Image sent] is persisted. The user can verify this — chat history shows only placeholders, not their actual files. This was a deliberate trust-building UX decision, not just a technical one.

⚡

Skeleton Shimmer for Media Generation

Problem: Image/video generation takes 5-30 seconds. Without feedback, users think the app is broken and re-submit. Solution: A generating SSE event immediately shows a type-appropriate shimmer skeleton (image aspect ratio, audio waveform, or video player shape). This communicates "I'm working on it" within 200ms, even though the media won't arrive for seconds.

Design Philosophy: The guiding principle was "one interface, zero context-switching." Every modal, every tab, every separate page is a place users get lost. Enox keeps everything in the chat — text, images, audio, video, agent testing — because that's where the user's mental model already lives.

02 — StackTechnology Stack

Every technology was chosen for a specific engineering reason — performance, developer experience, or production reliability.

Frontend (User App & Admin App)

Next.js 16 React 19 TypeScript Tailwind CSS 3.4 Framer Motion Radix UI Zustand Lucide Icons React Markdown

Backend (API Server)

Express.js 4.18 Node.js 18+ OpenAI SDK 4.24 Google GenAI SDK Zod Validation Helmet Security pdf-parse

Infrastructure & Database

Supabase PostgreSQL Row-Level Security Google OAuth 2.0 JWT Authentication

AI Providers Supported

OpenAI (GPT-4o, DALL-E 3, TTS-1) Anthropic (Claude) Google (Gemini 2.5, Imagen 3, Veo 2) Mistral Groq OpenRouter

02b — TradeoffsTechnology Tradeoffs

Every technology choice involved rejecting an alternative. This section documents what I chose, what I didn't, and — most importantly — why.

SSE

Server-Sent Events over WebSockets

I chose SSE because the data flow is strictly unidirectional: the server streams tokens to the client. WebSockets are bidirectional, which adds complexity (heartbeat pings, reconnection state, socket lifecycle management) for zero benefit here. SSE also works natively with HTTP/2 multiplexing, auto-reconnects on disconnect via the EventSource API, and passes cleanly through every reverse proxy and CDN without special upgrade headers. WebSockets require an HTTP Upgrade handshake that some corporate firewalls and load balancers block.

When I'd pick WebSockets instead: If Enox had real-time collaborative editing or live presence indicators (multiple users typing), I'd need true bidirectional communication. For a chat streaming use case, SSE is strictly simpler and more reliable.

Supa

Supabase over Raw PostgreSQL

I chose Supabase because it bundles three things I'd otherwise have to build myself: authentication (Google OAuth with JWT out of the box), Row-Level Security policies with a clean SDK, and a hosted Postgres instance with connection pooling. Setting up raw Postgres + a custom auth server + session management + OAuth provider integration would have taken weeks and introduced security surface area I'd have to maintain forever.

The tradeoff: Supabase's RLS is powerful but opaque — policy bugs are hard to debug because queries silently return empty results instead of throwing errors. I also can't use advanced Postgres features like logical replication or custom extensions without their approval. For this project, the speed-to-ship and built-in auth far outweighed those limitations.

Expr

Express.js over Fastify

I chose Express because the AI SDK ecosystem is built around it. The OpenAI SDK examples, Anthropic streaming guides, and Google GenAI tutorials all use Express. Fastify is ~2× faster in raw benchmarks, but the bottleneck in an AI chat app is never the HTTP framework — it's the AI provider response time (800-3000ms). Saving 0.5ms on routing is meaningless when you're waiting 2 seconds for GPT-4o to think. Express also has a vastly larger middleware ecosystem (Helmet, CORS, express-rate-limit) that Just Works.

When I'd pick Fastify instead: For a high-throughput API that handles 10,000+ req/s where framework overhead matters (e.g., a REST API serving cached data). Enox's backend handles maybe 50 concurrent streams — Express is not the bottleneck.

Zust

Zustand over Redux / Context API

I chose Zustand because streaming chat creates extreme state update pressure — every token triggers a store update. Redux's reducer dispatch overhead and middleware chain add measurable latency at 30+ updates/second. React Context causes full subtree re-renders on every update (catastrophic for a chat UI). Zustand's useShallow selectors and direct state mutation give me surgical re-render control with zero boilerplate.

The tradeoff: Zustand has no built-in dev tools as mature as Redux DevTools, and the "single store" pattern can get unwieldy. I mitigated this by keeping the store interface clean and using localStorage persistence selectively (7 keys) rather than syncing everything.

Mono

Monorepo over Separate Repositories

I chose a monorepo because the three projects (app, backend, admin-app) share types, constants, and deployment context. With separate repos, a database schema change would require coordinated PRs across 3 repositories. In a monorepo, one commit can update the schema, backend route, and frontend type simultaneously. The mental overhead of "which repo has the bug?" disappears.

The tradeoff: No shared package extraction (like a /packages/types workspace) — types are duplicated between frontend and backend. I accepted this because the duplication is small (a few interfaces) and a Turborepo/Nx setup would add tooling complexity disproportionate to the project's size.

Dual

Native Gemini SDK + OpenAI SDK (Dual Path) over Single SDK

I chose to maintain two code paths because Google's OpenAI-compatible endpoint lies about streaming. It buffers the entire Gemini response (~5-8 seconds), then "streams" it as a burst of chunks. The native @google/genai SDK delivers true token-by-token streaming with ~200ms time-to-first-token. For the most popular free model (Gemini 2.5 Flash), this made the difference between "feels instant" and "feels broken."

The tradeoff: Two code paths means double the maintenance surface for streaming logic, error handling, and tool calling. Every new feature has to work on both paths. I mitigated this with the yield* delegation pattern — both paths yield the same typed events to the same consumer, so the downstream code is unified.

Decision Framework: My general rule was: "Optimize for the bottleneck, not the benchmark." The bottleneck in an AI chat app is always AI provider latency (800-3000ms). Any technology choice that saves <10ms is irrelevant. Choices that save 200ms+ (like native Gemini streaming, connection pooling, parallel queries) are where I invested engineering effort.

03 — ArchitectureSystem Architecture

A clean three-tier monorepo with strict separation of concerns. The backend never exposes API keys to the frontend, and all data access is governed by Supabase RLS policies.

┌─────────────────────────────────────────────────────────────────────┐ │ MONOREPO STRUCTURE │ ├──────────────┬──────────────────┬──────────────┬───────────────────┤ │ /app │ /backend │ /admin-app │ /supabase │ │ Next.js 16 │ Express.js │ Next.js 16 │ schema.sql │ │ Port 3000 │ Port 3001 │ Port 3002 │ RLS Policies │ │ User UI │ REST + SSE API │ Admin UI │ DB Functions │ └──────┬───────┴────────┬─────────┴──────┬───────┴───────────────────┘ │ │ │ │ ┌───────────▼───────────┐ │ └────► Supabase Auth ◄────┘ │ (Google OAuth) │ └───────────┬───────────┘ │ ┌────────────────▼────────────────┐ │ PostgreSQL (RLS) │ │ users │ models │ agents │ chats │ │ messages │ usage_logs │ keys │ └────────────────┬────────────────┘ │ ┌────────────────▼────────────────┐ │ AI Provider Abstraction │ │ OpenAI │ Anthropic │ Google │ │ Mistral │ Groq │ OpenRouter │ └─────────────────────────────────┘

Directory Layout

enox/
├── app/                    # User-facing Next.js application
│   └── src/
│       ├── app/(app)/          # Route groups: chat, agents, models, settings...
│       ├── components/         # 21 component directories
│       │   ├── agents/         # AgentStudio, AgentsView
│       │   ├── chat/           # ChatView, MessageBubble, ModelSelector, VoiceRecorder...
│       │   ├── layout/         # Sidebar, AppShell
│       │   ├── settings/       # SettingsView, ApiKeysView
│       │   └── ...             # explore, models, auth, legal, usage, providers
│       ├── lib/                # api.ts, supabase.ts, utils.ts
│       └── store/              # useStore.ts (Zustand)
├── backend/                 # Express.js API server
│   └── src/
│       ├── lib/                # aiProvider.js, rateLimit.js, supabase.js...
│       ├── middleware/         # auth.js (JWT + cache), errorHandler.js
│       └── routes/             # chat.js, agents.js, admin.js, generate.js...
├── admin-app/               # Admin dashboard (Next.js)
│   └── src/app/                # Models, Users, Usage management
└── supabase/                # schema.sql — 8 tables, 14 indexes, RLS policies
  

04 — DatabaseDatabase Schema & Security

Eight tables with 14 optimized indexes, automatic timestamp triggers, and comprehensive Row-Level Security policies that enforce data isolation at the database level.

Table	Purpose	Key Columns	RLS Policy
users	User profiles (linked to auth.users)	id, email, name, avatar_url, role	Own profile read/write; admins full access
models	Admin-managed AI models	id, name, provider, model_id, api_key, daily_limit, model_type	Public read (active only); api_key never exposed to client
agents	Custom AI agents with system prompts	id, user_id, name, username, system_prompt, model_id, temperature, top_p, max_tokens	Own agents CRUD; public agents read-only
chats	Chat sessions	id, user_id, model_id, agent_id, title	Own chats only
messages	Chat messages (user/assistant/system)	id, chat_id, role, content	Messages of own chats only
usage_logs	Per-user, per-model daily usage tracking	user_id, model_id, message_count, own_key_count, date	Own usage read-only
user_model_limits	Admin-set per-user model limit overrides	user_id, model_id, daily_limit	Admin-only
user_api_keys	User's own provider API keys (BYOK)	user_id, provider, api_key	Own keys only

Database Functions

handle_new_user() — Trigger on auth.users INSERT that auto-creates a public.users profile using OAuth metadata (name, avatar).
increment_usage() — UPSERT function that atomically increments daily usage counts, avoiding race conditions with ON CONFLICT DO UPDATE.
check_rate_limit() — Returns a JSON object with allowed, used, limit, and remaining for real-time rate enforcement.
update_updated_at() — BEFORE UPDATE trigger on 6 tables that automatically sets updated_at = NOW().

Security Note: The models.api_key column is never exposed to any frontend client. The backend uses a Supabase service-role key to access API keys server-side only. All client-facing model queries explicitly exclude api_key from the SELECT.

05 — BackendBackend API Server

The Express.js backend handles authentication, rate limiting, AI provider abstraction, SSE streaming, multimodal processing, and media generation — all optimized for minimal latency.

5.1 Authentication & Middleware

Every protected route passes through authMiddleware, which validates the Supabase JWT, fetches the user profile, and caches both for 5 minutes to avoid redundant database lookups:

Token Extraction

Extracts Bearer token from the Authorization header. Returns 401 immediately if missing.

In-Memory Cache Check

Checks a Map<token, {profile, ts}> cache with a 5-minute TTL. If valid, skips DB calls entirely.

Supabase Verification

On cache miss: calls supabase.auth.getUser(token), then fetches the full user profile from public.users. Caches the result.

Admin routes use an additional adminMiddleware that checks req.user.role === 'admin' and returns 403 if unauthorized.

5.2 AI Provider Abstraction Layer

The aiProvider.js module (841 lines) is the core intelligence layer. It provides a unified streaming interface across all six providers:

◆

OpenAI-Compatible Path

Uses the OpenAI SDK with configurable baseURL for OpenAI, Anthropic, Google (via compatibility layer), Mistral, Groq, and OpenRouter. TCP connections are cached per provider+key pair.

◈

Native Gemini Path

Uses the @google/genai SDK directly for true token-by-token streaming. The OpenAI-compatible wrapper buffers Gemini's entire response before "streaming" it, adding ~8s of latency. The native path eliminates this.

◇

Progressive Fallback

Gemini models attempt configurations from most features to least: (1) thinking + tools, (2) tools only, (3) plain. Each failure falls back to the next level automatically.

◉

Client Caching

Both OpenAI and Google clients are cached in Map keyed by provider:apiKey. This reuses TCP/TLS connections, saving ~100-300ms per request.

5.3 Streaming Chat Pipeline

The POST /api/chat/send endpoint implements a highly optimized streaming pipeline designed to minimize perceived latency:

Instant SSE Open

The SSE stream opens immediately — before any database work. This cuts ~1s off perceived latency. A 2KB comment padding flushes proxy buffers (Cloudflare, nginx, Traefik).

Mega Batch Pre-flight

All pre-flight checks run in a single Promise.all(): model lookup, agent data, rate limit check, and user API keys fetch. This parallelizes 4 database queries into one round-trip.

API Key Resolution

If BYOK is enabled, uses the user's key (bypasses rate limiting). Otherwise, uses the platform's key. For tool calling (image/TTS generation), merges both key pools to maximize available providers.

Typed SSE Event Streaming

The stream yields typed events: thinking_start, thinking_content, thinking_done, chunk (text), generating (skeleton), media (base64), clear_content, done, and error.

Fire-and-Forget Persistence

After [DONE] is sent, the response ends immediately. Message saving and usage increment happen asynchronously in background promises — the user never waits for DB writes.

5.4 Tool Calling & Media Generation

Text models can invoke two tools via function calling: generate_image and generate_tts. The system handles both real function calling (OpenAI, Gemini) and fake tool-call detection for models that don't support it.

Fake Tool-Call Detection: When models like Claude or some OpenRouter models receive tool definitions but output raw JSON instead of proper function calls, the detectFakeToolCall() function parses the JSON output, identifies the intended tool (supporting DALL-E style, direct format, and type-hint formats), clears the text from the UI, and executes the generation transparently.

Image Generation Pipeline

Google Imagen 3 (imagen-3.0-generate-002) — Preferred path. Supports configurable aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4). Returns base64 PNG.
OpenAI DALL-E 3 — Fallback path. Maps aspect ratios to pixel dimensions (1024x1024, 1792x1024, 1024x1792). Returns b64_json.

Text-to-Speech Pipeline

OpenAI TTS-1 — 6 voices (alloy, echo, fable, onyx, nova, shimmer). Returns MP3 directly.
Google Gemini TTS — 8 voices (Kore, Charon, Fenrir, Aoede, Puck, Leda, Orus, Zephyr). Uses responseModalities: ['AUDIO']. Raw PCM output is converted to WAV using a manual pcmToWav() function that writes the 44-byte RIFF/WAV header.

Video Generation Pipeline

Google Veo 2 (veo-2.0-generate-001) — Asynchronous generation with polling (5s intervals, up to 5-minute timeout). Returns MP4 video bytes.

5.5 Multimodal Input Processing

The platform accepts images, voice recordings, PDFs, and text files as attachments. Each provider requires different multimodal formats:

Input Type	Gemini Format	OpenAI Format
Images	`inlineData: { mimeType, data }`	`image_url: { url: "data:..." }`
Audio	`inlineData`	`input_audio: { data, format }`
PDFs	`inlineData`	Extracted to text via `pdf-parse`, injected as `[Content of attached PDF]`
Text files	`inlineData`	Base64 decoded to UTF-8, injected as text block

Privacy Design: Attachments are sent as base64 in-memory to AI providers but never stored in the database. Only privacy-safe placeholder text like [Image sent], [Voice message sent], or [File sent: report.pdf] is persisted in the messages table.

5.6 Rate Limiting System

Rate limiting uses a monthly window with separate tracking for platform usage vs. own-key usage:

Platform keys: Subject to monthly limits (default 25/month per model, admin-configurable per-user).
Own keys (BYOK): Not rate-limited. The own_key_count column tracks these separately so they don't consume the user's platform quota.
Admin overrides: Per-user, per-model limits stored in user_model_limits. The effective limit is: override.daily_limit ?? model.daily_limit ?? 25.

Rate Limit Computation — Mathematical Model

The rate-limiting algorithm uses a monthly sliding window with owner-key exemption. The effective usage is computed by subtracting own-key requests from total requests, ensuring BYOK users never consume platform quota:

Effective Usage Calculation U_platform = Σ max(message_count_d - own_key_count_d, 0) // for each day d in [month_start, month_end) L_effective = override.daily_limit ?? model.daily_limit ?? 25 allowed = U_platform < L_effective remaining = max(L_effective - U_platform, 0)

The nullish coalescing chain (??) implements a three-tier priority system. If an admin has set a custom limit for this specific user+model pair, it takes precedence. Otherwise, the model's default limit is used. As a final fallback, the system defaults to 25.

Usage Increment — Atomic Upsert // Two separate counters maintained per (user, model, date) tuple: message_count += 1 // always incremented own_key_count += usedOwnKey ? 1 : 0 // only if BYOK // ON CONFLICT (user_id, model_id, date) DO UPDATE // → Guarantees atomic increment even under concurrent requests // → UNIQUE constraint on (user_id, model_id, date) prevents duplicate rows

05b — LifecycleRequest Lifecycle & Timing Analysis

A complete end-to-end trace of a single chat message, from the user pressing Enter to the final token rendered in the browser. Every millisecond is accounted for.

Full Request Sequence

Browser Express Supabase AI Provider

Browser → Express POST /api/chat/send

Client sends the message payload (modelId, message, attachments, settings). Express validates with Zod.parse(body).

Express → Browser T+0ms

SSE stream opens instantly — before any DB work. Response headers flushed with a 2KB padding comment to force proxy buffer flush.

Express → Supabase Promise.all()

Four parallel queries execute simultaneously: model lookup, agent data, rate limit check, user API keys. All 4 results return in a single round-trip.

model agent rateLimit userKeys

Express (internal) T+80ms

resolveApiKey() — O(1) lookup from pre-fetched keys. buildGenApiKeys() — merges user + platform keys for tool calling. insertChat() — creates new chat session (new chats only).

Express → Browser meta:{chatId} T+120ms

Chat ID is sent to the client. The frontend updates activeChatId in the Zustand store and prepends the new chat to the sidebar list.

Express → AI Provider T+200ms

buildMessages() constructs the messages array, then stream.open() initiates an HTTP/2 stream to the AI provider via cached client connection.

AI Provider → Express → Browser Thinking Phase

The AI model's internal reasoning streams back in real-time:

thinking_start thinking_content ×N thinking_done

AI Provider → Express → Browser Token Streaming T+800ms first token

Text tokens stream through the async generator pipeline. Each token is forwarded as a chunk SSE event with <1ms relay overhead. May also yield generating → media events for tool-called content.

chunk ×N generating media clear_content

Express → Browser done + [DONE] T+3200ms

Stream completes. done:{chatId} event signals the frontend to finalize the message and enable regeneration. [DONE] terminates the SSE connection.

Express → Supabase Fire-and-Forget T+end (async)

After the client connection is closed, two background promises persist data without blocking the user: message INSERT and usage increment via atomic upsert. Zero impact on perceived latency.

saveMessage() incrementUsage()

Timing Breakdown — Critical Path Analysis

The critical path is the sequence of operations that cannot be parallelized. By moving all possible work off the critical path, time-to-first-token is minimized:

T+0ms

SSE Stream Opened

Response headers flushed immediately. Content-Type: text/event-stream is set, X-Accel-Buffering: no disables nginx buffering, socket.setNoDelay(true) disables Nagle's algorithm. A 2048-byte SSE comment (: ×2048\n\n) forces proxy buffer flush.

T+5ms → T+80ms

Mega-Batch Parallel Queries

Four independent database queries execute simultaneously via Promise.all(). Amortized cost: max(T_model, T_agent, T_rateLimit, T_keys) instead of T_model + T_agent + T_rateLimit + T_keys. Typical savings: ~60-200ms depending on Supabase region latency.

T+80ms → T+120ms

API Key Resolution & Chat Insert

Key resolution is O(1) lookup from the pre-fetched userKeyMap object — zero additional DB calls. For new chats, a single INSERT returns the UUID. For existing chats, message history is fetched (limited to last 20 messages to bound context window cost).

T+120ms → T+200ms

AI Provider Connection

Client lookup from the connection cache is O(1). If a cached client exists, TLS handshake is skipped (saving ~100-300ms). The stream.open() call initiates an HTTP/2 stream to the provider.

T+200ms → T+800ms

Provider-Side Processing (Thinking)

The AI model processes the prompt. During this period, thinking tokens stream back if supported. The client sees thinking_start instantly, with thinking_content events streaming reasoning in real-time.

T+800ms+

Token Streaming

First text token arrives. Each token is forwarded to the client as a chunk SSE event with <1ms relay overhead. The async generator yields tokens as they arrive — no buffering.

T+end (post-stream)

Fire-and-Forget Persistence

[DONE] is sent to the client, then res.end() closes the connection. Two background Promises execute asynchronously: message INSERT and usage increment. The user never waits for these writes — they have zero impact on perceived latency.

Latency Budget Formula

Time-To-First-Token (TTFT) TTFT = T_sse_open + max(T_model_q, T_agent_q, T_rate_q, T_keys_q) + T_key_resolve + T_chat_insert + T_ai_connect + T_ai_thinking // Where: T_sse_open ≈ 0ms // sync — no await max(queries) ≈ 60-80ms // parallel, NOT sequential T_key_resolve ≈ 0ms // O(1) Map lookup from pre-fetched data T_chat_insert ≈ 30-50ms // single INSERT (new chat only) T_ai_connect ≈ 50-100ms// with cached client (vs 200-400ms cold) T_ai_thinking ≈ 400-2000ms// model-dependent, not optimizable // Typical TTFT: ~600-2300ms (dominated by AI provider latency) // Without optimizations: ~1800-4500ms (2-3x slower)

05c — GeneratorsAsync Generator Streaming Pipeline

The streaming system is built on JavaScript async generators — a composable pipeline pattern that yields typed events from the AI provider through the SSE transport layer.

Generator Chain Architecture

The streaming pipeline is a three-stage chain of async generators. Each stage transforms or enriches the data before passing it downstream:

┌──────────────────────────────────────────────────────────────────────────┐ │ ASYNC GENERATOR PIPELINE │ │ │ │ ┌─────────────────┐ ┌──────────────────┐ ┌────────────────────┐ │ │ │ STAGE 1 │ │ STAGE 2 │ │ STAGE 3 │ │ │ │ Provider Stream │───►│ Tool Execution │───►│ SSE Serialization │ │ │ │ │ │ │ │ │ │ │ │ Yields: │ │ Yields: │ │ Writes: │ │ │ │ • string tokens │ │ • typed events │ │ • data: {JSON}\n\n │ │ │ │ • thinking parts │ │ • media chunks │ │ • data: [DONE]\n\n │ │ │ │ • function calls │ │ • text content │ │ │ │ │ │ • inline media │ │ • clear signals │ │ │ │ │ └────────┬────────┘ └────────┬─────────┘ └────────┬───────────┘ │ │ │ async function* │ yield* │ res.write() │ │ │ streamChatCompletion │ + executeToolCall │ │ │ └──────────────────────┴───────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘

Stage 1 — Provider-Specific Stream

The entry point is streamChatCompletion(), an async function* (async generator function) that routes to provider-specific implementations:

async function* streamChatCompletion(provider, apiKey, modelId, messages, 
    maxTokens, temperature, topP, genApiKeys, modelType, ttsOptions) {

  // Route 1: TTS models → dedicated AUDIO modality path
  if (modelType === 'tts' && provider === 'google') {
    yield { type: 'generating', mediaType: 'audio' };
    yield* streamGeminiTTS(apiKey, modelId, messages, ttsOptions);
    return;
  }

  // Route 2: Google models → native SDK (true token streaming)
  if (provider === 'google') {
    yield* streamGeminiNative(apiKey, modelId, messages, ...);
    return;
  }

  // Route 3: All others → OpenAI-compatible SDK
  // ... streams tokens, accumulates tool calls, detects fake calls
}
  

yield* (Delegation): The yield* syntax delegates to a sub-generator, forwarding all yielded values directly to the consumer. This enables composable pipeline stages without manual iteration. Each sub-generator is itself an async function* that can yield typed event objects or plain strings.

Stage 2 — Tool Call Accumulation & Execution

For OpenAI-compatible providers, tool calls arrive as streamed deltas across multiple chunks. The system accumulates them before execution:

// Tool calls arrive fragmented across chunks:
//   chunk[0].delta.tool_calls = [{ index:0, id:"call_abc", function:{name:"gene"} }]
//   chunk[1].delta.tool_calls = [{ index:0, function:{name:"rate_image"} }]
//   chunk[2].delta.tool_calls = [{ index:0, function:{arguments:'{"pro'} }]
//   chunk[3].delta.tool_calls = [{ index:0, function:{arguments:'mpt":"cat"}'}]

const pendingToolCalls = {};  // Map<index, {id, name, arguments}>

for await (const chunk of stream) {
  const toolCalls = chunk.choices?.[0]?.delta?.tool_calls;
  if (toolCalls) {
    for (const tc of toolCalls) {
      if (!pendingToolCalls[tc.index]) 
        pendingToolCalls[tc.index] = { id:'', name:'', arguments:'' };
      if (tc.id)                pendingToolCalls[tc.index].id += tc.id;
      if (tc.function?.name)     pendingToolCalls[tc.index].name += tc.function.name;
      if (tc.function?.arguments) pendingToolCalls[tc.index].arguments += tc.function.arguments;
    }
  }
}
// After stream ends: parse accumulated JSON and execute each tool
  

Fake Tool-Call Detection — Pattern Matching Algorithm

Models that don't support native function calling sometimes emit JSON that looks like a tool call. The detectFakeToolCall() function uses a multi-pattern matching algorithm:

detectFakeToolCall(fullText) → { toolName, args } | null

// Step 1: Extract JSON — try markdown-wrapped, then raw, then substring
parsed = tryParse(text.match(/```json?\s*([\s\S]*?)```/)?.[1])
      ?? tryParse(text)
      ?? tryParse(text.substring(text.indexOf('{'), text.lastIndexOf('}')+1))

// Step 2: Match against 5 known patterns:
// Pattern A: DALL-E style   → { action: "dalle.text2im", action_input: "{...}" }
// Pattern B: Direct tool    → { tool: "generate_image", prompt: "..." }
// Pattern C: Function style → { function: "generate_image", arguments: {...} }
// Pattern D: Type-hint      → { prompt: "...", type: "image" }
// Pattern E: TTS style      → { action: "tts", text: "..." }

// Step 3: Extract args using cascading property access:
prompt = parsed.prompt ?? parsed.input?.prompt ?? parsed.arguments?.prompt ?? parsed.parameters?.prompt
  

05d — SSE ProtocolServer-Sent Events Protocol Internals

The SSE transport layer is hand-optimized for minimum latency across reverse proxies, CDNs, and mobile networks. Every byte of the protocol is deliberate.

SSE Wire Format

Each event is a JSON-serialized line prefixed with data: and terminated by \n\n. The format is defined by the W3C EventSource specification:

Padding Event (T+0ms): : \n\n │ colon = SSE comment (ignored by EventSource) │ │ 2048 spaces = fills proxy buffers │ │ \n\n = event terminator │ Meta Event: data: {"type":"meta","chatId":"550e8400-e29b-41d4-a716-446655440000"}\n\n Thinking Events: data: {"type":"thinking_start"}\n\n data: {"type":"thinking_content","content":"Let me analyze..."}\n\n data: {"type":"thinking_done","thinkingTime":1.2}\n\n Text Chunk Event: data: {"type":"chunk","content":"Hello"}\n\n Media Event (base64 payload): data: {"type":"media","mimeType":"image/png","data":"iVBORw0KGgo..."}\n\n Terminal Events: data: {"type":"done","chatId":"550e8400-..."}\n\n data: [DONE]\n\n

Proxy Buffer Flush Strategy

Reverse proxies (nginx, Cloudflare, Traefik) buffer responses until they hit a minimum size threshold. Without the 2KB padding, the first SSE event may be delayed by up to 30 seconds until enough data accumulates:

Proxy Buffer Flush Condition buffer_size ≥ proxy_threshold → flush to client // Typical proxy thresholds: nginx: 4KB (proxy_buffer_size default) Cloudflare: 1KB (automatic edge buffering) Traefik: 4KB (default buffer) // Solution: 2KB SSE comment = ':' + ' '×2048 + '\n\n' = 2051 bytes // Combined with response headers (~500 bytes) = ~2.5KB // Exceeds Cloudflare threshold → instant flush // Additional transport hints: X-Accel-Buffering: no // disables nginx proxy buffering Cache-Control: no-cache, no-transform // prevents CDN caching socket.setNoDelay(true) // disables Nagle's algorithm (TCP)

SSE Event Type System — Finite State Machine

The client-side SSE parser implements a state machine that processes events in strict order. Invalid transitions are handled gracefully:

┌─────────────────────────────────────────┐ │ SSE EVENT STATE MACHINE │ └─────────────────────────────────────────┘ ┌──────────┐ meta ┌──────────────┐ thinking_start ┌───────────┐ │ INITIAL │─────────►│ CONNECTED │────────────────►│ THINKING │ └──────────┘ └──────┬───────┘ └─────┬─────┘ │ │ │ chunk │ thinking_content (loop) │ │ ▼ │ thinking_done ┌──────────────┐ │ │ STREAMING │◄───────────────────────┘ └──────┬───────┘ │ ┌───────────┼───────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ chunk │ │generating│ │clear_content │ │ (append) │ │(skeleton)│ │(reset text) │ └──────────┘ └────┬─────┘ └──────────────┘ │ ▼ ┌──────────┐ │ media │ │ (render) │ └──────────┘ │ done │ error ▼ ┌──────────┐ │ COMPLETE │ └──────────┘

05e — Audio EncodingPCM-to-WAV Binary Encoding

Google's Gemini TTS returns raw PCM audio samples. Browsers cannot play raw PCM — the pcmToWav() function constructs a valid WAV file by manually writing the 44-byte RIFF/WAVE header.

WAV File Structure — Byte-Level Layout

Offset Size Field Value ────── ──── ───── ───── 0x00 4 ChunkID "RIFF" // ASCII magic bytes 0x04 4 ChunkSize 36 + dataSize // file size - 8 0x08 4 Format "WAVE" // ASCII format identifier ── fmt sub-chunk ── 0x0C 4 Subchunk1ID "fmt " // with trailing space 0x10 4 Subchunk1Size 16 // PCM = 16 bytes 0x14 2 AudioFormat 1 // 1 = PCM (uncompressed) 0x16 2 NumChannels 1 // mono 0x18 4 SampleRate 24000 // 24kHz (from Gemini) 0x1C 4 ByteRate 48000 // SampleRate × BlockAlign 0x20 2 BlockAlign 2 // NumChannels × BitsPerSample/8 0x22 2 BitsPerSample 16 // 16-bit samples ── data sub-chunk ── 0x24 4 Subchunk2ID "data" 0x28 4 Subchunk2Size dataSize // raw PCM byte count 0x2C N Data [PCM samples...] // little-endian int16

Audio Mathematics

WAV Encoding Formulas ByteRate = SampleRate × NumChannels × (BitsPerSample / 8) = 24000 × 1 × (16 / 8) = 48,000 bytes/sec BlockAlign = NumChannels × (BitsPerSample / 8) = 1 × (16 / 8) = 2 bytes per sample frame ChunkSize = 36 + dataSize // 36 = header bytes (44) minus RIFF header (8) FileSize = 44 + dataSize // 44 byte header + raw PCM // Duration of generated audio: Duration = dataSize / ByteRate // in seconds = dataSize / 48,000 // Sample rate extraction from MIME type: // Gemini returns: "audio/L16;rate=24000" rate = parseInt(mimeType.match(/rate=(\d+)/)?.[1]) ?? 24000

Format Auto-Detection

Before wrapping with a WAV header, the function checks if the data is already in a playable format by inspecting magic bytes:

// Check for existing WAV header (RIFF...WAVE)
if (raw[0..3] === "RIFF" && raw[8..11] === "WAVE") → return as-is

// Check for MP3 sync word (frame header)
if (raw[0] === 0xFF && (raw[1] & 0xE0) === 0xE0) → return as-is

// Otherwise: raw PCM → wrap with 44-byte WAV header
const wav = Buffer.alloc(44 + dataSize);
// ... write RIFF, fmt, data sub-chunks at byte offsets
  

05f — ConcurrencyConcurrency & Parallelization Model

The system maximizes throughput through strategic parallelization of independent operations, connection reuse, and non-blocking I/O patterns.

Promise.all() Parallelization Map

Every request triggers multiple independent operations. The codebase uses Promise.all() at every opportunity to convert sequential I/O into parallel I/O:

Location	Parallel Operations	Sequential Cost	Parallel Cost
Chat pre-flight	model + agent + rateLimit + userKeys	~320ms (4 × 80ms)	~80ms (max of 4)
Existing chat	insertUserMsg + fetchHistory	~160ms	~80ms
Regenerate pre-flight	lastMsg + agent + rateLimit + userKeys	~320ms	~80ms
Regenerate setup	deleteLastMsg + fetchHistory	~160ms	~80ms
Auth init (frontend)	profile + models + agents + chatHistory	~400ms	~100ms
Chat history endpoint	count + dataFetch	~160ms	~80ms
Messages endpoint	chatVerify + count + messages	~240ms	~80ms
Admin stats	users + models + agents + monthUsage	~320ms	~80ms

Parallelization Efficiency T_sequential = Σ T_i // sum of all query times T_parallel = max(T_i) // bottleneck only Speedup = T_sequential / T_parallel // For N queries of equal cost T: Speedup = N × T / T = N // → 4 parallel queries = 4× speedup (linear scaling) // Total saved per chat request: ~240ms (pre-flight) + ~80ms (history) // Over 1000 daily requests: 320 seconds of cumulative latency eliminated

Connection Pool Cache — O(1) Client Lookup

AI provider clients are cached in Map data structures, providing O(1) amortized lookup by composite key:

// Cache key: "provider:apiKey" → deduplicated per provider+credential
const clientCache = new Map();           // OpenAI-compatible clients
const googleClientCache = new Map();     // Native Google AI clients

// Lookup: O(1) average case (hash map)
// Memory: O(P × K) where P=providers, K=unique keys
// Typical: ~6-12 cached clients (6 providers × 1-2 keys each)

// What's saved per cache hit:
// 1. Object construction (~5ms)
// 2. TCP connection establishment (~50ms)
// 3. TLS handshake (~100-200ms)
// Total savings per hit: ~155-255ms
  

Auth Cache — Token-Indexed Profile Store

Auth Cache Parameters Structure: Map<JWT_token, { profile: User, ts: number }> TTL: 300,000ms (5 minutes) Eviction: Lazy — checked on access: if (Date.now() - ts > TTL) → miss Hit ratio: ~95% for active users (tokens refresh every ~60min) // Per-request savings on cache hit: // Skips: supabase.auth.getUser() (~40ms) + users.select() (~40ms) = ~80ms // Frontend mirror: 30s TTL auth header cache // Skips: supabase.auth.getSession() (~50-150ms)

05g — FallbackGemini Progressive Fallback State Machine

Gemini models have varying capabilities (thinking, tools, image output). The system implements a three-stage fallback that automatically downgrades features when a model doesn't support them.

┌────────────────────────────────┐ │ START: streamGeminiNative() │ └────────────────┬───────────────┘ │ ▼ ┌────────────────────────────────┐ │ ATTEMPT 1 (Full Features) │ │ thinkingConfig: { include: ✓ } │ │ tools: GENERATION_TOOLS_GEMINI │ └────────────────┬───────────────┘ │ success? │ ┌────────┤ │ YES │ NO → isUnsupported? ▼ │ ┌──────────┐ │ YES │ STREAM │ │ │ RESPONSE │ ▼ └──────────┘ ┌────────────────────────────────┐ │ ATTEMPT 2 (No Thinking) │ │ thinkingConfig: none │ │ tools: GENERATION_TOOLS_GEMINI │ └────────────────┬───────────────┘ │ success? │ ┌────────┤ │ YES │ NO → isUnsupported? ▼ │ ┌──────────┐ │ YES │ STREAM │ │ │ RESPONSE │ ▼ └──────────┘ ┌────────────────────────────────┐ │ ATTEMPT 3 (Plain Text) │ │ thinkingConfig: none │ │ tools: none │ └────────────────┬───────────────┘ │ success? │ ┌────────┤ │ YES │ NO → THROW ERROR ▼ │ ┌──────────┐ │ │ STREAM │ ▼ │ RESPONSE │ ┌─────┐ └──────────┘ │ERROR│ └─────┘

Unsupported Feature Detection Heuristic

const isUnsupported = 
  msg.includes('Thinking is not enabled') ||
  msg.includes('not supported') ||
  msg.includes('INVALID_ARGUMENT') ||
  (err?.status === 400 && (
    msg.includes('think') || 
    msg.includes('tool') || 
    msg.includes('function')
  ));

// If isUnsupported && more attempts remain → retry with simpler config
// If isUnsupported && no attempts remain → throw (fatal)
// If !isUnsupported → throw immediately (don't waste retries on auth/rate errors)
  

05h — Token BudgetToken Budget & Context Window Management

The system implements a multi-tier priority chain for determining the token budget for each request, with hard caps based on key ownership.

Max Tokens Resolution Chain // Three-tier priority with ownership-based hard cap: hardCap = useOwnKeys ? 131,072 : 10,000 maxTokens = min( body.maxTokens // Priority 1: Frontend override (Agent Studio) ?? agent.max_tokens // Priority 2: Agent setting from DB ?? model.max_tokens ?? 4096, // Priority 3: Model default / global fallback hardCap // Clamp: prevent abuse ) // Same pattern for temperature and top_p: temperature = body.temperature ?? agent.temperature ?? 0.7 topP = body.topP ?? agent.top_p ?? 0.95

Context Window Bounding

To prevent unbounded context growth and keep costs predictable, the system limits conversation history to the most recent 20 messages:

Context Window Strategy // For existing chats: history = SELECT role, content FROM messages WHERE chat_id = chatId ORDER BY created_at ASC LIMIT 20 // This bounds the context window to approximately: max_context_tokens ≈ 20 messages × ~500 tokens/msg avg = ~10,000 tokens // For Agent Studio (conversationHistory from frontend): // No server-side limit — frontend manages via localStorage // System prompt is prepended as first message in array // Message array construction order: // [system_prompt?, ...history_messages, current_user_message]

Dual Usage Tracking for Public Agents

When a user chats with a public agent created by another user, usage is counted against both the consumer and the agent creator (unless the consumer uses their own key):

Dual Usage Attribution // Always: incrementUsage(userId, modelId, actuallyUsedOwnKey) // Additionally, if all conditions met: if (agentCreatorId // agent exists && agentCreatorId !== userId // not the creator themselves && !actuallyUsedOwnKey) { // using platform key incrementUsage(agentCreatorId, modelId, false) } // This prevents creators from consuming unlimited platform quota // by publishing popular agents that others use.

05i — MultimodalMultimodal Processing Pipeline

Input attachments flow through a provider-specific transformation pipeline. The system handles images, audio, PDFs, and text files with automatic format conversion and privacy-safe storage.

Attachment Processing Flow

┌────────────────────────────────────────────────────────────────────────┐ │ MULTIMODAL INPUT PIPELINE │ └────────────────────────────────────────────────────────────────────────┘ ┌──────────────────────┐ │ User Attachments │ │ (base64 in request) │ └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ Zod Validation │ │ type ∈ {image,voice, │ │ file} │ │ max: 5 attachments │ └──────────┬───────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ┌────────▼─────┐ ┌──────▼──────┐ ┌──────▼──────┐ │ provider │ │ provider │ │ DB Storage │ │ === 'google' │ │ !== 'google' │ │ (parallel) │ └────────┬─────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ ┌────────▼─────────┐ │ ┌───────▼────────┐ │buildGeminiParts() │ │ │placeholder only │ │ image → inlineData│ │ │"[Image sent]" │ │ audio → inlineData│ │ │"[Voice sent]" │ │ pdf → inlineData│ │ │"[File: x.pdf]" │ │ text → inlineData│ │ │ │ └──────────────────┘ │ └─────────────────┘ │ ┌────────────────────────▼────────────────────────┐ │ buildOpenAIMultimodalContent() │ │ │ │ image/* → { type:"image_url", url:"data:..." } │ │ audio/* → { type:"input_audio", data, format } │ │ application/pdf → │ │ extractPdfText(base64) → text block │ │ "[Content of attached PDF file]:\n\n..." │ │ text/* → Base64.decode() → text block │ │ + user message text → final content array │ └────────────────────────────────────────────────────┘

PDF Text Extraction

For non-Google providers that don't support inline PDF, the system uses pdf-parse to extract text content:

// Input: base64-encoded PDF bytes
const buffer = Buffer.from(base64Data, 'base64');
const result = await pdfParse(buffer);
const text = result.text?.trim() || '';

// Output: injected as a structured text block
content.push({
  type: 'text',
  text: `[Content of attached PDF file]:\n\n${pdfText}`
});

// Error handling: graceful degradation
// On parse failure → "[PDF content could not be extracted]"
// PDF is still sent as base64 to Gemini (native PDF support)
  

Body Size Limits

Request Size Constraints JSON body limit = 50MB // express.json({ limit: '50mb' }) Max attachments = 5 // Zod: z.array().max(5) Max message = 32,000 chars // Zod: z.string().max(32000) Max prompt = 10,000 chars // Agent system_prompt Max username = 30 chars // /^[a-z0-9_]{3,30}$/ // Typical base64 overhead: ~33% larger than binary // → 50MB JSON limit supports ~37.5MB of actual file data // → A single 4K image ≈ 8-15MB base64 // → 5 images at max quality: ~50-75MB → may exceed limit // → In practice: compressed images are 200KB-2MB each

06 — FrontendUser Application

A premium dark-themed SPA built with Next.js 16 App Router, Zustand state management, and Framer Motion animations. Designed for speed — all critical data is preloaded in parallel on login.

6.1 State Management (Zustand)

A single Zustand store manages the entire application state. Key design decisions:

localStorage Persistence — Selected model, sidebar state, active view, BYOK preference, thinking toggle, and pinned models all persist across sessions via localStorage, hydrated on store creation.
Optimistic Streaming — replaceLastAssistantContent() uses structural equality checks to skip unnecessary re-renders during rapid streaming updates.
useShallow — Heavy components like ChatView use useShallow() selectors to prevent re-renders when unrelated state changes.

6.2 Auth Header Caching

The frontend API layer caches auth headers for 30 seconds, avoiding redundant supabase.auth.getSession() calls (~50-150ms each). On sign-in, seedAuthCache() pre-populates the cache so the initial data load runs without delay.

6.3 Data Preloading Strategy

On authentication, the AuthProvider fires a parallel mega-batch:

const [profile] = await Promise.all([
  usersAPI.getMe(),        // User profile
  loadAllData(),           // Models + Agents + Chat History (parallel)
]);
// Non-blocking background loads:
usersAPI.getUsage().then(setUsage);
usersAPI.getApiKeys().then(setApiKeys);
  

This ensures the UI is fully interactive in a single network round-trip, with usage data and API keys loading in the background.

6.4 SSE Event Processing

The frontend processes 10 distinct SSE event types from the streaming chat endpoint:

Event Type	Purpose	Frontend Action
`meta`	Chat ID assignment	Updates store `activeChatId`, prepends to chat list
`thinking_start`	AI is reasoning	Shows animated "Thinking..." indicator
`thinking_content`	Internal reasoning text	Streams into collapsible thinking panel
`thinking_done`	Reasoning complete	Collapses panel, shows elapsed time badge
`chunk`	Text token	Appends to assistant message content
`generating`	Media generation started	Shows shimmer skeleton (image/audio/video)
`media`	Generated media data	Renders inline image, audio player, or video player
`clear_content`	Clear fake tool JSON	Resets accumulated text content
`done`	Stream complete	Finalizes message, enables regeneration
`error`	Error occurred	Shows error message in chat

6.5 Model-Specific Composers

The chat input adapts to the selected model type:

💬

Text Composer

Standard chat input with file attachments (images, PDFs, text files), voice recording, expandable input, and thinking toggle.

🎨

Generation Composer

Purple-themed prompt input for image/video models with Sparkles icon. Focuses on creative prompt entry.

🎤

TTS Composer

Emerald-themed with voice selection pills (8 Google voices). Volume2 icon. Enters text-to-speak content.

6.6 Agent Studio

The Agent Studio provides a full-featured IDE for creating and testing custom AI agents:

Live Testing — A built-in chat panel that streams responses using the agent's configuration without saving to the main chat history.
Real-time Settings — Temperature, top_p, max_tokens, and system prompt changes are reflected immediately in the next message (no save required).
Conversation Memory — The studio maintains conversationHistory in localStorage, sending it to the backend so the AI remembers previous messages across page reloads.
Media Generation — Image/TTS/Video skeletons and rendered media are fully supported in the studio chat.
Rich Formatting — ReactMarkdown renders assistant responses with styled headings, lists, tables, code blocks, and inline formatting.

07 — AdminAdmin Dashboard

A separate Next.js application that provides complete platform management: model CRUD, user management with per-user rate limits, and usage analytics.

⚙

Model Management

Full CRUD for AI models: name, provider, model_id, API key, daily limit, max tokens, model type (text/image/video/tts), and active status. API keys are masked in the UI.

👥

User Management

View all users, promote/demote admin roles, and set per-user per-model custom rate limits that override the global defaults.

📈

Usage Analytics

Dashboard stats (total users, active models, total agents, monthly requests) and detailed per-user, per-model usage logs filterable by date.

The admin API is protected by both authMiddleware (JWT validation) and adminMiddleware (role check). All admin endpoints require role === 'admin' in the user's profile.

08 — SecuritySecurity Architecture

Security is enforced at every layer: database (RLS), middleware (JWT), transport (HTTPS/CORS/Helmet), and application (Zod validation, API key isolation).

Row-Level Security

Every table has RLS enabled. Users can only access their own data. Public agents are readable by anyone. Admins have unrestricted access via a policy that checks role = 'admin'.

API Key Isolation

Platform API keys live exclusively in the models table and are only accessed server-side via the service-role key. The public models endpoint explicitly excludes api_key from the SELECT query. Admin endpoints mask keys as sk-xxxxx...xxxx.

Input Validation

Every API endpoint validates input with Zod schemas before processing. Messages are capped at 32,000 chars, attachments at 5 per request, usernames must match /^[a-z0-9_]{3,30}$/.

CORS & Headers

Helmet sets security headers. CORS is configured per-origin from environment variables with normalized trailing-slash handling. Only the frontend and admin URLs are allowed.

Privacy by Design

Binary attachments (images, voice, files) are processed in-memory and sent directly to AI providers. Only text placeholders like [Image sent] are stored in the database. No user media is ever persisted.

BYOK Safety

User API keys are stored in the user_api_keys table with RLS ensuring only the owner can read/write their keys. The backend resolves keys server-side — they're never sent to the frontend.

09 — PerformancePerformance Optimizations

Every millisecond matters for perceived AI response speed. The platform uses aggressive parallelization, caching, and streaming to minimize time-to-first-token.

Optimization	Impact	Technique
Instant SSE open	~1s faster perceived latency	Stream opens before DB work; 2KB padding flushes proxy buffers
Mega-batch pre-flight	4 queries in 1 round-trip	`Promise.all()` for model, agent, rate limit, and API keys
Auth token caching	~50-150ms saved per request	Backend: 5min in-memory Map. Frontend: 30s header cache
Client connection pooling	~100-300ms saved per request	OpenAI and Google clients cached by provider+key, reuse TCP/TLS
Native Gemini streaming	~8s faster than wrapper	Direct `@google/genai` SDK instead of OpenAI-compatible endpoint
Fire-and-forget saves	0ms user wait for DB writes	Message and usage inserts happen after `[DONE]` is sent
Parallel data preload	Single round-trip on login	`Promise.all()` for profile, models, agents, chat history
Zustand useShallow	Reduced re-renders	Heavy components select only needed state slices

10 — FeaturesKey Feature Summary

A comprehensive list of every user-facing and system-level feature in the platform.

💬 Streaming Chat

Real-time token-by-token streaming with thinking indicators, copy, regenerate, and auto-scroll.

🤖 Custom Agents

Create agents with system prompts, custom temperature/top_p/max_tokens, unique usernames, and public sharing.

🎨 Image Generation

Inline via tool calling (Imagen 3 + DALL-E 3). Expandable previews and one-click downloads.

🎤 Text-to-Speech

8 Google voices + 6 OpenAI voices. Custom audio player with progress bar. PCM-to-WAV conversion.

🎥 Video Generation

Google Veo 2 with async polling. Configurable aspect ratios. Inline video player.

📎 Multimodal Input

Attach images, voice recordings (iOS-compatible), PDFs (extracted to text), and text files.

🔑 Bring Your Own Key

Users add their own provider keys for unlimited access. Keys bypass rate limiting and are stored securely.

📈 Usage Tracking

Per-model monthly usage with separate platform vs. own-key tracking. Admin override limits per user.

💡 Thinking Mode

Toggle AI reasoning visibility. Collapsible thinking panel shows internal reasoning with elapsed time.

📌 Model Pinning

Pin up to 3 favorite models. Quick-switch dropdown shows Recent, Pinned, and Latest sections.

🌐 Public Agents

Share agents publicly via unique usernames. Discoverable in the Explore page. Creators' usage is tracked separately.

⚡ Agent Studio

Live testing with real-time settings reflection, conversation memory (localStorage), and full media generation support.

11 — APIAPI Endpoint Reference

All endpoints require Bearer token authentication unless noted. The backend exposes 18 REST endpoints across 7 route modules.

Method	Endpoint	Description
POST	`/api/chat/send`	Send message & stream response (SSE)
GET	`/api/chat/history`	List user's chats (paginated)
GET	`/api/chat/:id/messages`	Get messages for a chat (paginated)
PATCH	`/api/chat/:id`	Rename a chat
DEL	`/api/chat/:id`	Delete a chat
POST	`/api/chat/:id/regenerate`	Regenerate last response (SSE)
GET	`/api/models`	List active models (no api_key)
GET	`/api/models/:id/usage`	User's usage for a specific model
GET	`/api/agents`	List user's agents
POST	`/api/agents`	Create agent
PATCH	`/api/agents/:id`	Update agent
DEL	`/api/agents/:id`	Delete agent
GET	`/api/agents/public`	List public agents (no auth)
POST	`/api/generate/image`	Generate image from prompt
POST	`/api/generate/tts`	Text-to-Speech generation
POST	`/api/generate/video`	Video generation (Veo 2)
GET	`/api/users/me`	Get user profile
PUT	`/api/users/me/api-keys`	Update BYOK API keys

11b — FailuresReal Challenges & Failures

This section documents the bugs, edge cases, and painful discoveries that shaped the system's architecture. Every "optimized" solution in this case study was born from something that broke first.

💥 The Gemini Streaming Delay Discovery

What happened

Early on, all 6 providers used the same OpenAI-compatible SDK code path. Everything seemed fine — until I tested Gemini 2.5 Flash side-by-side with GPT-4o. Gemini had a consistent 5-8 second delay before the first token appeared, while GPT-4o started streaming in ~800ms. Users reported "Gemini is broken" even though it was technically working.

Root cause: Google's OpenAI-compatible endpoint (generativelanguage.googleapis.com/v1beta/openai/) doesn't actually stream. It buffers the entire response server-side, then sends all chunks in a rapid burst. The "streaming" is fake — you get 0 tokens for 6 seconds, then all 500 tokens in 200ms.

The fix: I built an entirely separate native Gemini code path using the @google/genai SDK, which supports real token-by-token streaming. This required duplicating all streaming logic, tool calling handling, and error management for Gemini specifically. The result cut Gemini's time-to-first-token from ~6s to ~200ms — but the cost was maintaining two parallel streaming implementations (§5g).

Lesson: Never trust "OpenAI-compatible" claims without benchmarking the actual streaming behavior. Compatibility layers optimize for correctness, not latency.

💥 Tool Call Parsing Across Providers

What happened

Tool calling (for image/TTS generation) worked perfectly on OpenAI and Gemini. Then I tested it on Claude via OpenRouter, and the chat just... printed raw JSON. The model understood the tool definition but instead of making a function call, it wrote {"tool": "generate_image", "prompt": "a sunset over mountains"} as plain text into the chat.

It got worse: Every model that "faked" tool calls did it differently. Some wrapped JSON in markdown code blocks. Some used action/action_input DALL-E format. Some just inferred a type: "image" field. I found at least 5 distinct JSON patterns across providers, and new ones kept appearing as I tested more models.

The fix: The detectFakeToolCall() function (§5c) — a multi-pattern matching pipeline that extracts JSON from markdown blocks or raw text, then tries to match it against 5 known tool-call formats. When detected, it clears the JSON from the UI via a clear_content SSE event and executes the generation transparently. This was the messiest code I wrote, and it still occasionally fails on novel model outputs.

Lesson: The AI ecosystem's "function calling" standard is a lie. Every provider implements it differently, and many models will simply ignore the schema and do their own thing. You need both a clean path and a dirty fallback.

💥 The Proxy Buffer Problem

What happened

Streaming worked perfectly in local development. The moment I deployed behind Cloudflare, SSE events were delayed by 15-30 seconds. The entire AI response would accumulate silently, then dump to the client all at once. Users saw a blank screen for half a minute, then the full response appeared instantly. It looked completely broken.

Root cause: Reverse proxies (Cloudflare, nginx, Traefik) buffer response data until they accumulate enough bytes to justify a network flush. SSE events are tiny (~50-200 bytes each), so they sit in the proxy buffer waiting for more data that never comes. The proxy's "optimization" was destroying the entire streaming UX.

The fix: Three-layer workaround: (1) A 2KB SSE comment (: + 2048 spaces) sent immediately to overflow the proxy buffer threshold. (2) X-Accel-Buffering: no header to explicitly disable nginx buffering. (3) socket.setNoDelay(true) to disable Nagle's TCP algorithm, which was batching small SSE frames. All three were necessary — removing any one brought the delay back in certain deployment configs.

Lesson: Local development is a lie for streaming applications. Always test SSE through at least one reverse proxy layer before calling it "done."

💥 Gemini's Unpredictable Feature Support

What happened

Gemini 2.5 Flash supports thinking mode + tool calling. Gemini 2.0 Flash supports tool calling but not thinking. Gemini 1.5 Pro supports neither. There is no API endpoint to query which features a model supports. I only discovered this through trial and error — sending a request with thinking enabled, getting a 400 error, and then realizing I had to maintain a compatibility matrix in my head.

It got worse: Google sometimes updates model capabilities silently. A model that didn't support tools on Monday might support them on Wednesday. Hardcoding a feature matrix would become stale instantly.

The fix: The three-stage progressive fallback state machine (§5g). Instead of trying to know in advance what each model supports, I attempt the most capable configuration first (thinking + tools), catch the error, check if it's an "unsupported feature" error, and retry with a simpler configuration. This makes the system self-healing — it adapts to model capabilities at runtime without any hardcoded knowledge.

Lesson: When working with third-party AI APIs, design for capability discovery at runtime rather than static configuration. APIs change under you.

💥 PCM Audio That Wouldn't Play

What happened

Gemini TTS returned audio data with MIME type audio/L16;rate=24000. I Base64-encoded it, sent it to the browser, and... silence. The <audio> element refused to play it. No error, no console warning — just nothing. Chrome, Firefox, Safari all silently failed.

Root cause: Browsers cannot play raw PCM audio. They need a container format (WAV, MP3, OGG). Gemini returns headerless 16-bit linear PCM samples — just raw bytes with no metadata about sample rate, channels, or bit depth. The browser has no way to interpret the data.

The fix: I wrote a manual pcmToWav() function (§5e) that constructs a 44-byte RIFF/WAVE header byte-by-byte using DataView, then prepends it to the raw PCM data. The sample rate is extracted from the MIME type string via regex. I also had to add magic byte detection (checking for RIFF and 0xFF 0xE0 MP3 sync headers) because OpenAI TTS returns MP3 directly, and wrapping an MP3 in a WAV header produces garbage.

Lesson: "Returns audio" in an API doc doesn't mean "returns playable audio." Always check the actual byte-level format, not just the MIME type.

💥 Rate Limiting Edge Case — The Public Agent Exploit

What happened

I built a public agents feature where users can share their custom agents. Another user can chat with your agent for free — great for discoverability. Then I realized the exploit: a user creates 10 public agents, shares them, and 50 people use them. All 500 daily requests consume the platform's API key budget, but nobody's individual rate limit is hit because usage was only tracked against each consumer individually.

The fix: Dual usage attribution (§5h). When someone uses a public agent on the platform key, usage is counted against both the consumer (who sent the message) and the agent creator (who published the agent). This prevents creators from bypassing their rate limits by laundering usage through public agents. The creator's count is only incremented when the platform key is used — if the consumer brings their own key, the creator isn't penalized.

Lesson: Any "sharing" feature in a rate-limited system is a potential bypass vector. Always ask: "Who pays for the compute when content goes viral?"

💥 iOS Voice Recording Incompatibility

What happened

Voice recording worked perfectly on Chrome desktop and Android. On iOS Safari, the MediaRecorder API silently produced empty blobs. The recording UI appeared to work — the timer ticked, the animation played — but the resulting audio file was 0 bytes.

Root cause: iOS Safari doesn't support audio/webm (the default format on Chrome). It supports audio/mp4 and audio/aac, but doesn't throw an error when you request an unsupported format — it just produces garbage output.

The fix: The VoiceRecorder component now probes format support at initialization with a priority list: audio/mp4 → audio/aac → audio/webm. It uses MediaRecorder.isTypeSupported() to find the first working format. This cascading approach handles iOS, Android, and desktop without user-agent sniffing.

Lesson: Never trust MediaRecorder to fail loudly. Always probe format support before recording, and always test audio features on actual iOS hardware (simulators lie about codec support).

What's Still Imperfect

Honest accounting of known limitations I haven't solved yet:

⚠ Auth Cache Staleness

The 5-minute backend auth cache means a user's role change (e.g., promoted to admin) doesn't take effect for up to 5 minutes. I accepted this because role changes are rare and the latency savings (~80ms/request) affect every single request.

⚠ No Shared Type Package

TypeScript interfaces (Model, Message, Agent) are duplicated between frontend and backend. A schema change requires manual sync in both codebases. A /packages/types workspace would fix this but adds Turborepo/Nx complexity I haven't justified yet.

⚠ Fire-and-Forget Risk

If the server crashes between sending [DONE] and the background message INSERT completing, the user sees the response but it's not saved to DB. On next page load, the message disappears. This is rare (~0.01% chance) but a real data consistency gap.

⚠ 50MB Body Limit

The 50MB JSON body limit accommodates most attachments, but 5 high-resolution images at full quality could exceed it. There's no chunked upload or compression — the entire payload must fit in one request. A proper solution would use presigned URLs and storage buckets.

Philosophy on Failures: Every "clean" architecture in this case study started as a messy workaround for a real bug. The progressive fallback (§5g) exists because Gemini crashed without it. The 2KB SSE padding (§5d) exists because Cloudflare swallowed my streams. The fake tool-call detector (§5c) exists because Claude ignored my function definitions. Production systems aren't designed in advance — they're shaped by failure.

12 — ConclusionTechnical Summary

Enox AI is a production-grade system that demonstrates mastery of full-stack engineering, distributed systems patterns, binary protocol encoding, and AI pipeline orchestration.

Engineering Depth — By the Numbers

Async Generator Pipelines

SSE Event Types

Fallback Stages

WAV Header Bytes

The project showcases deep expertise across every layer of the stack:

Streaming Architecture — Three-stage async generator pipeline (async function* → yield* delegation → SSE serialization) with composable tool-call execution. Typed SSE events with a formal finite state machine governing transitions (§5c, §5d).
Concurrency & Parallelization — Eight Promise.all() parallelization points across the codebase achieving up to 4× speedup over sequential I/O. Fire-and-forget persistence pattern eliminates all post-stream DB write latency (§5f).
Multi-Provider Abstraction — Unified streaming interface across 6 providers with dual code paths (OpenAI-compatible SDK vs. native Google GenAI SDK). Progressive fallback state machine with 3 retry levels and error classification heuristics (§5g).
Binary Audio Engineering — Manual RIFF/WAVE header construction at the byte level (44-byte header with fmt and data sub-chunks). Magic byte detection for format auto-identification. Sample rate extraction from MIME type parameters (§5e).
Tool Calling Pipeline — Streamed delta accumulation across fragmented chunks, fake tool-call detection with 5-pattern matching algorithm, and transparent media generation with UI skeleton rendering (§5c).
Mathematical Rate Limiting — Monthly sliding window with dual-counter tracking (platform vs. BYOK), atomic PostgreSQL upserts via ON CONFLICT DO UPDATE, three-tier nullish coalescing priority chain, and dual usage attribution for public agents (§5h).
Multimodal Processing — Provider-specific format transformation pipeline handling 4 input types across 2 provider formats. PDF text extraction via pdf-parse with graceful degradation. Privacy-safe placeholder storage pattern (§5i).
Latency Optimization — Time-to-first-token budget of ~600-2300ms achieved through instant SSE open, 2KB proxy buffer flush padding, Nagle's algorithm disabling, O(1) connection pool cache, and parallel data preloading (§5b).
Database Design — 8 normalized tables, 14 optimized indexes, Row-Level Security on every table, automatic timestamp triggers, and atomic usage-tracking functions that prevent race conditions.
Frontend State Machine — Zustand store with localStorage persistence across 7 keys, structural equality checks on streaming updates, useShallow selectors to minimize re-renders, and a 30-second auth header cache that eliminates redundant session calls.

Total Sections: This case study covers 20 technical sections spanning system architecture, database design, streaming pipelines, async generators, SSE protocol internals, binary audio encoding, concurrency models, state machines, rate limiting mathematics, token budgeting, multimodal processing, and more.

Every line of code is available as open source at github.com/yad-anakin/enox.