A production-grade, multi-provider AI platform with streaming chat, multimodal generation, custom agents, and a full admin dashboard.
Enox AI is a full-stack, open-source AI platform that unifies access to the world's leading language models, image generators, video creators, and text-to-speech engines through a single, beautifully designed interface.
The platform implements a three-tier monorepo architecture: a Next.js 16 user-facing application, an Express.js API server, and a separate Next.js admin dashboard, all backed by Supabase (PostgreSQL) with Row-Level Security.
Key capabilities include real-time token-by-token streaming via Server-Sent Events, multimodal input support (images, voice, PDFs, files), inline media generation through AI tool calling, a custom agent studio with live testing, per-user rate limiting with admin override controls, and a Bring-Your-Own-Key (BYOK) system that gives users unlimited access with their own API keys.
Before writing a single line of code, I identified real user problems in the AI tool landscape and made deliberate product decisions to solve them. This section explains the why behind every major design choice.
When I started building Enox, I kept running into the same frustrations that every AI power-user hits:
ChatGPT, Claude, and Gemini each live in their own walled garden. Switching between them means separate accounts, separate UIs, and separate billing. If you want to compare model outputs on the same prompt, you're juggling browser tabs.
Users either pay $20/month per provider (expensive if you use multiple) or manage raw API keys with no usage visibility. There's no middle ground between "consumer subscription" and "raw developer API."
ChatGPT's "Custom GPTs" are limited — you can't set temperature, top_p, or max_tokens. You can't live-test different system prompts side by side. Power users need an Agent Studio, not a wizard.
Image generation, TTS, and video are separate tools. You can't ask a chat model to "make me a logo" and get an image inline — you have to switch to DALL-E or Midjourney. The AI experience is scattered across a dozen apps.
Enox targets two distinct user personas, and every feature maps to one or both:
Developers, writers, and researchers who use multiple AI models daily. They want model comparison, BYOK for unlimited usage, fine-grained agent control (temperature, top_p), and a single interface that does text + image + audio + video.
Key features for them: BYOK system, Agent Studio with live testing, model pinning, thinking mode toggle, multimodal attachments.
Team leads or small companies who want to host a private AI portal for their team. They need admin controls: model management, per-user rate limits, usage analytics, and the ability to rotate API keys without disrupting users.
Key features for them: Admin dashboard, per-user model limits, usage tracking with own-key separation, centralized model CRUD.
Every UI choice was driven by a specific user need or pain point I observed:
Problem: Users were confused when selecting a TTS model but seeing a standard chat input. Solution: The composer dynamically changes based on model type — emerald-themed with voice pills for TTS, purple-themed with Sparkles icon for image/video, and standard chat with attachments for text. This immediately communicates what the model expects without any instructions.
Problem: Switching between "chat" and "image generator" tabs destroys conversational context. Solution: Media generation happens inline via tool calling. You ask "draw a sunset" in a normal chat, and the image appears in the conversation. The AI decides when to generate, keeping the experience conversational rather than transactional.
Problem: Extended thinking is useful for complex reasoning but annoying for quick questions — it adds latency and visual noise. Solution: A one-click toggle in the composer lets users opt into thinking mode per-message. When active, thinking streams into a collapsible panel so it doesn't overwhelm the response. Casual users never see it; power users love it.
Problem: With 6+ providers and dozens of models, the model selector becomes overwhelming. Solution: Users pin up to 3 favorite models. The dropdown organizes by Pinned → Recent → All, so your daily-drivers are always one click away. Type badges (IMG/VID/TTS) provide instant visual cues about model capability.
Problem: Users hesitate to upload sensitive documents (contracts, medical images) to AI platforms that might store them. Solution: Attachments are explicitly never stored in the database. Only a placeholder like [Image sent] is persisted. The user can verify this — chat history shows only placeholders, not their actual files. This was a deliberate trust-building UX decision, not just a technical one.
Problem: Image/video generation takes 5-30 seconds. Without feedback, users think the app is broken and re-submit. Solution: A generating SSE event immediately shows a type-appropriate shimmer skeleton (image aspect ratio, audio waveform, or video player shape). This communicates "I'm working on it" within 200ms, even though the media won't arrive for seconds.
Every technology was chosen for a specific engineering reason — performance, developer experience, or production reliability.
Every technology choice involved rejecting an alternative. This section documents what I chose, what I didn't, and — most importantly — why.
I chose SSE because the data flow is strictly unidirectional: the server streams tokens to the client. WebSockets are bidirectional, which adds complexity (heartbeat pings, reconnection state, socket lifecycle management) for zero benefit here. SSE also works natively with HTTP/2 multiplexing, auto-reconnects on disconnect via the EventSource API, and passes cleanly through every reverse proxy and CDN without special upgrade headers. WebSockets require an HTTP Upgrade handshake that some corporate firewalls and load balancers block.
When I'd pick WebSockets instead: If Enox had real-time collaborative editing or live presence indicators (multiple users typing), I'd need true bidirectional communication. For a chat streaming use case, SSE is strictly simpler and more reliable.
I chose Supabase because it bundles three things I'd otherwise have to build myself: authentication (Google OAuth with JWT out of the box), Row-Level Security policies with a clean SDK, and a hosted Postgres instance with connection pooling. Setting up raw Postgres + a custom auth server + session management + OAuth provider integration would have taken weeks and introduced security surface area I'd have to maintain forever.
The tradeoff: Supabase's RLS is powerful but opaque — policy bugs are hard to debug because queries silently return empty results instead of throwing errors. I also can't use advanced Postgres features like logical replication or custom extensions without their approval. For this project, the speed-to-ship and built-in auth far outweighed those limitations.
I chose Express because the AI SDK ecosystem is built around it. The OpenAI SDK examples, Anthropic streaming guides, and Google GenAI tutorials all use Express. Fastify is ~2× faster in raw benchmarks, but the bottleneck in an AI chat app is never the HTTP framework — it's the AI provider response time (800-3000ms). Saving 0.5ms on routing is meaningless when you're waiting 2 seconds for GPT-4o to think. Express also has a vastly larger middleware ecosystem (Helmet, CORS, express-rate-limit) that Just Works.
When I'd pick Fastify instead: For a high-throughput API that handles 10,000+ req/s where framework overhead matters (e.g., a REST API serving cached data). Enox's backend handles maybe 50 concurrent streams — Express is not the bottleneck.
I chose Zustand because streaming chat creates extreme state update pressure — every token triggers a store update. Redux's reducer dispatch overhead and middleware chain add measurable latency at 30+ updates/second. React Context causes full subtree re-renders on every update (catastrophic for a chat UI). Zustand's useShallow selectors and direct state mutation give me surgical re-render control with zero boilerplate.
The tradeoff: Zustand has no built-in dev tools as mature as Redux DevTools, and the "single store" pattern can get unwieldy. I mitigated this by keeping the store interface clean and using localStorage persistence selectively (7 keys) rather than syncing everything.
I chose a monorepo because the three projects (app, backend, admin-app) share types, constants, and deployment context. With separate repos, a database schema change would require coordinated PRs across 3 repositories. In a monorepo, one commit can update the schema, backend route, and frontend type simultaneously. The mental overhead of "which repo has the bug?" disappears.
The tradeoff: No shared package extraction (like a /packages/types workspace) — types are duplicated between frontend and backend. I accepted this because the duplication is small (a few interfaces) and a Turborepo/Nx setup would add tooling complexity disproportionate to the project's size.
I chose to maintain two code paths because Google's OpenAI-compatible endpoint lies about streaming. It buffers the entire Gemini response (~5-8 seconds), then "streams" it as a burst of chunks. The native @google/genai SDK delivers true token-by-token streaming with ~200ms time-to-first-token. For the most popular free model (Gemini 2.5 Flash), this made the difference between "feels instant" and "feels broken."
The tradeoff: Two code paths means double the maintenance surface for streaming logic, error handling, and tool calling. Every new feature has to work on both paths. I mitigated this with the yield* delegation pattern — both paths yield the same typed events to the same consumer, so the downstream code is unified.
A clean three-tier monorepo with strict separation of concerns. The backend never exposes API keys to the frontend, and all data access is governed by Supabase RLS policies.
enox/
├── app/ # User-facing Next.js application
│ └── src/
│ ├── app/(app)/ # Route groups: chat, agents, models, settings...
│ ├── components/ # 21 component directories
│ │ ├── agents/ # AgentStudio, AgentsView
│ │ ├── chat/ # ChatView, MessageBubble, ModelSelector, VoiceRecorder...
│ │ ├── layout/ # Sidebar, AppShell
│ │ ├── settings/ # SettingsView, ApiKeysView
│ │ └── ... # explore, models, auth, legal, usage, providers
│ ├── lib/ # api.ts, supabase.ts, utils.ts
│ └── store/ # useStore.ts (Zustand)
├── backend/ # Express.js API server
│ └── src/
│ ├── lib/ # aiProvider.js, rateLimit.js, supabase.js...
│ ├── middleware/ # auth.js (JWT + cache), errorHandler.js
│ └── routes/ # chat.js, agents.js, admin.js, generate.js...
├── admin-app/ # Admin dashboard (Next.js)
│ └── src/app/ # Models, Users, Usage management
└── supabase/ # schema.sql — 8 tables, 14 indexes, RLS policies
Eight tables with 14 optimized indexes, automatic timestamp triggers, and comprehensive Row-Level Security policies that enforce data isolation at the database level.
| Table | Purpose | Key Columns | RLS Policy |
|---|---|---|---|
| users | User profiles (linked to auth.users) | id, email, name, avatar_url, role | Own profile read/write; admins full access |
| models | Admin-managed AI models | id, name, provider, model_id, api_key, daily_limit, model_type | Public read (active only); api_key never exposed to client |
| agents | Custom AI agents with system prompts | id, user_id, name, username, system_prompt, model_id, temperature, top_p, max_tokens | Own agents CRUD; public agents read-only |
| chats | Chat sessions | id, user_id, model_id, agent_id, title | Own chats only |
| messages | Chat messages (user/assistant/system) | id, chat_id, role, content | Messages of own chats only |
| usage_logs | Per-user, per-model daily usage tracking | user_id, model_id, message_count, own_key_count, date | Own usage read-only |
| user_model_limits | Admin-set per-user model limit overrides | user_id, model_id, daily_limit | Admin-only |
| user_api_keys | User's own provider API keys (BYOK) | user_id, provider, api_key | Own keys only |
auth.users INSERT that auto-creates a public.users profile using OAuth metadata (name, avatar).ON CONFLICT DO UPDATE.allowed, used, limit, and remaining for real-time rate enforcement.updated_at = NOW().models.api_key column is never exposed to any frontend client. The backend uses a Supabase service-role key to access API keys server-side only. All client-facing model queries explicitly exclude api_key from the SELECT.
The Express.js backend handles authentication, rate limiting, AI provider abstraction, SSE streaming, multimodal processing, and media generation — all optimized for minimal latency.
Every protected route passes through authMiddleware, which validates the Supabase JWT, fetches the user profile, and caches both for 5 minutes to avoid redundant database lookups:
Extracts Bearer token from the Authorization header. Returns 401 immediately if missing.
Checks a Map<token, {profile, ts}> cache with a 5-minute TTL. If valid, skips DB calls entirely.
On cache miss: calls supabase.auth.getUser(token), then fetches the full user profile from public.users. Caches the result.
Admin routes use an additional adminMiddleware that checks req.user.role === 'admin' and returns 403 if unauthorized.
The aiProvider.js module (841 lines) is the core intelligence layer. It provides a unified streaming interface across all six providers:
Uses the OpenAI SDK with configurable baseURL for OpenAI, Anthropic, Google (via compatibility layer), Mistral, Groq, and OpenRouter. TCP connections are cached per provider+key pair.
Uses the @google/genai SDK directly for true token-by-token streaming. The OpenAI-compatible wrapper buffers Gemini's entire response before "streaming" it, adding ~8s of latency. The native path eliminates this.
Gemini models attempt configurations from most features to least: (1) thinking + tools, (2) tools only, (3) plain. Each failure falls back to the next level automatically.
Both OpenAI and Google clients are cached in Map keyed by provider:apiKey. This reuses TCP/TLS connections, saving ~100-300ms per request.
The POST /api/chat/send endpoint implements a highly optimized streaming pipeline designed to minimize perceived latency:
The SSE stream opens immediately — before any database work. This cuts ~1s off perceived latency. A 2KB comment padding flushes proxy buffers (Cloudflare, nginx, Traefik).
All pre-flight checks run in a single Promise.all(): model lookup, agent data, rate limit check, and user API keys fetch. This parallelizes 4 database queries into one round-trip.
If BYOK is enabled, uses the user's key (bypasses rate limiting). Otherwise, uses the platform's key. For tool calling (image/TTS generation), merges both key pools to maximize available providers.
The stream yields typed events: thinking_start, thinking_content, thinking_done, chunk (text), generating (skeleton), media (base64), clear_content, done, and error.
After [DONE] is sent, the response ends immediately. Message saving and usage increment happen asynchronously in background promises — the user never waits for DB writes.
Text models can invoke two tools via function calling: generate_image and generate_tts. The system handles both real function calling (OpenAI, Gemini) and fake tool-call detection for models that don't support it.
detectFakeToolCall() function parses the JSON output, identifies the intended tool (supporting DALL-E style, direct format, and type-hint formats), clears the text from the UI, and executes the generation transparently.
imagen-3.0-generate-002) — Preferred path. Supports configurable aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4). Returns base64 PNG.b64_json.responseModalities: ['AUDIO']. Raw PCM output is converted to WAV using a manual pcmToWav() function that writes the 44-byte RIFF/WAV header.veo-2.0-generate-001) — Asynchronous generation with polling (5s intervals, up to 5-minute timeout). Returns MP4 video bytes.The platform accepts images, voice recordings, PDFs, and text files as attachments. Each provider requires different multimodal formats:
| Input Type | Gemini Format | OpenAI Format |
|---|---|---|
| Images | inlineData: { mimeType, data } |
image_url: { url: "data:..." } |
| Audio | inlineData |
input_audio: { data, format } |
| PDFs | inlineData |
Extracted to text via pdf-parse, injected as [Content of attached PDF] |
| Text files | inlineData |
Base64 decoded to UTF-8, injected as text block |
[Image sent], [Voice message sent], or [File sent: report.pdf] is persisted in the messages table.
Rate limiting uses a monthly window with separate tracking for platform usage vs. own-key usage:
own_key_count column tracks these separately so they don't consume the user's platform quota.user_model_limits. The effective limit is: override.daily_limit ?? model.daily_limit ?? 25.The rate-limiting algorithm uses a monthly sliding window with owner-key exemption. The effective usage is computed by subtracting own-key requests from total requests, ensuring BYOK users never consume platform quota:
The nullish coalescing chain (??) implements a three-tier priority system. If an admin has set a custom limit for this specific user+model pair, it takes precedence. Otherwise, the model's default limit is used. As a final fallback, the system defaults to 25.
A complete end-to-end trace of a single chat message, from the user pressing Enter to the final token rendered in the browser. Every millisecond is accounted for.
Client sends the message payload (modelId, message, attachments, settings). Express validates with Zod.parse(body).
SSE stream opens instantly — before any DB work. Response headers flushed with a 2KB padding comment to force proxy buffer flush.
Four parallel queries execute simultaneously: model lookup, agent data, rate limit check, user API keys. All 4 results return in a single round-trip.
resolveApiKey() — O(1) lookup from pre-fetched keys. buildGenApiKeys() — merges user + platform keys for tool calling. insertChat() — creates new chat session (new chats only).
Chat ID is sent to the client. The frontend updates activeChatId in the Zustand store and prepends the new chat to the sidebar list.
buildMessages() constructs the messages array, then stream.open() initiates an HTTP/2 stream to the AI provider via cached client connection.
The AI model's internal reasoning streams back in real-time:
Text tokens stream through the async generator pipeline. Each token is forwarded as a chunk SSE event with <1ms relay overhead. May also yield generating → media events for tool-called content.
Stream completes. done:{chatId} event signals the frontend to finalize the message and enable regeneration. [DONE] terminates the SSE connection.
After the client connection is closed, two background promises persist data without blocking the user: message INSERT and usage increment via atomic upsert. Zero impact on perceived latency.
The critical path is the sequence of operations that cannot be parallelized. By moving all possible work off the critical path, time-to-first-token is minimized:
Response headers flushed immediately. Content-Type: text/event-stream is set, X-Accel-Buffering: no disables nginx buffering, socket.setNoDelay(true) disables Nagle's algorithm. A 2048-byte SSE comment (: ×2048\n\n) forces proxy buffer flush.
Four independent database queries execute simultaneously via Promise.all(). Amortized cost: max(Tmodel, Tagent, TrateLimit, Tkeys) instead of Tmodel + Tagent + TrateLimit + Tkeys. Typical savings: ~60-200ms depending on Supabase region latency.
Key resolution is O(1) lookup from the pre-fetched userKeyMap object — zero additional DB calls. For new chats, a single INSERT returns the UUID. For existing chats, message history is fetched (limited to last 20 messages to bound context window cost).
Client lookup from the connection cache is O(1). If a cached client exists, TLS handshake is skipped (saving ~100-300ms). The stream.open() call initiates an HTTP/2 stream to the provider.
The AI model processes the prompt. During this period, thinking tokens stream back if supported. The client sees thinking_start instantly, with thinking_content events streaming reasoning in real-time.
First text token arrives. Each token is forwarded to the client as a chunk SSE event with <1ms relay overhead. The async generator yields tokens as they arrive — no buffering.
[DONE] is sent to the client, then res.end() closes the connection. Two background Promises execute asynchronously: message INSERT and usage increment. The user never waits for these writes — they have zero impact on perceived latency.
The streaming system is built on JavaScript async generators — a composable pipeline pattern that yields typed events from the AI provider through the SSE transport layer.
The streaming pipeline is a three-stage chain of async generators. Each stage transforms or enriches the data before passing it downstream:
The entry point is streamChatCompletion(), an async function* (async generator function) that routes to provider-specific implementations:
async function* streamChatCompletion(provider, apiKey, modelId, messages,
maxTokens, temperature, topP, genApiKeys, modelType, ttsOptions) {
// Route 1: TTS models → dedicated AUDIO modality path
if (modelType === 'tts' && provider === 'google') {
yield { type: 'generating', mediaType: 'audio' };
yield* streamGeminiTTS(apiKey, modelId, messages, ttsOptions);
return;
}
// Route 2: Google models → native SDK (true token streaming)
if (provider === 'google') {
yield* streamGeminiNative(apiKey, modelId, messages, ...);
return;
}
// Route 3: All others → OpenAI-compatible SDK
// ... streams tokens, accumulates tool calls, detects fake calls
}
yield* syntax delegates to a sub-generator, forwarding all yielded values directly to the consumer. This enables composable pipeline stages without manual iteration. Each sub-generator is itself an async function* that can yield typed event objects or plain strings.
For OpenAI-compatible providers, tool calls arrive as streamed deltas across multiple chunks. The system accumulates them before execution:
// Tool calls arrive fragmented across chunks:
// chunk[0].delta.tool_calls = [{ index:0, id:"call_abc", function:{name:"gene"} }]
// chunk[1].delta.tool_calls = [{ index:0, function:{name:"rate_image"} }]
// chunk[2].delta.tool_calls = [{ index:0, function:{arguments:'{"pro'} }]
// chunk[3].delta.tool_calls = [{ index:0, function:{arguments:'mpt":"cat"}'}]
const pendingToolCalls = {}; // Map<index, {id, name, arguments}>
for await (const chunk of stream) {
const toolCalls = chunk.choices?.[0]?.delta?.tool_calls;
if (toolCalls) {
for (const tc of toolCalls) {
if (!pendingToolCalls[tc.index])
pendingToolCalls[tc.index] = { id:'', name:'', arguments:'' };
if (tc.id) pendingToolCalls[tc.index].id += tc.id;
if (tc.function?.name) pendingToolCalls[tc.index].name += tc.function.name;
if (tc.function?.arguments) pendingToolCalls[tc.index].arguments += tc.function.arguments;
}
}
}
// After stream ends: parse accumulated JSON and execute each tool
Models that don't support native function calling sometimes emit JSON that looks like a tool call. The detectFakeToolCall() function uses a multi-pattern matching algorithm:
detectFakeToolCall(fullText) → { toolName, args } | null
// Step 1: Extract JSON — try markdown-wrapped, then raw, then substring
parsed = tryParse(text.match(/```json?\s*([\s\S]*?)```/)?.[1])
?? tryParse(text)
?? tryParse(text.substring(text.indexOf('{'), text.lastIndexOf('}')+1))
// Step 2: Match against 5 known patterns:
// Pattern A: DALL-E style → { action: "dalle.text2im", action_input: "{...}" }
// Pattern B: Direct tool → { tool: "generate_image", prompt: "..." }
// Pattern C: Function style → { function: "generate_image", arguments: {...} }
// Pattern D: Type-hint → { prompt: "...", type: "image" }
// Pattern E: TTS style → { action: "tts", text: "..." }
// Step 3: Extract args using cascading property access:
prompt = parsed.prompt ?? parsed.input?.prompt ?? parsed.arguments?.prompt ?? parsed.parameters?.prompt
The SSE transport layer is hand-optimized for minimum latency across reverse proxies, CDNs, and mobile networks. Every byte of the protocol is deliberate.
Each event is a JSON-serialized line prefixed with data: and terminated by \n\n. The format is defined by the W3C EventSource specification:
Reverse proxies (nginx, Cloudflare, Traefik) buffer responses until they hit a minimum size threshold. Without the 2KB padding, the first SSE event may be delayed by up to 30 seconds until enough data accumulates:
The client-side SSE parser implements a state machine that processes events in strict order. Invalid transitions are handled gracefully:
Google's Gemini TTS returns raw PCM audio samples. Browsers cannot play raw PCM — the pcmToWav() function constructs a valid WAV file by manually writing the 44-byte RIFF/WAVE header.
Before wrapping with a WAV header, the function checks if the data is already in a playable format by inspecting magic bytes:
// Check for existing WAV header (RIFF...WAVE)
if (raw[0..3] === "RIFF" && raw[8..11] === "WAVE") → return as-is
// Check for MP3 sync word (frame header)
if (raw[0] === 0xFF && (raw[1] & 0xE0) === 0xE0) → return as-is
// Otherwise: raw PCM → wrap with 44-byte WAV header
const wav = Buffer.alloc(44 + dataSize);
// ... write RIFF, fmt, data sub-chunks at byte offsets
The system maximizes throughput through strategic parallelization of independent operations, connection reuse, and non-blocking I/O patterns.
Every request triggers multiple independent operations. The codebase uses Promise.all() at every opportunity to convert sequential I/O into parallel I/O:
| Location | Parallel Operations | Sequential Cost | Parallel Cost |
|---|---|---|---|
| Chat pre-flight | model + agent + rateLimit + userKeys | ~320ms (4 × 80ms) | ~80ms (max of 4) |
| Existing chat | insertUserMsg + fetchHistory | ~160ms | ~80ms |
| Regenerate pre-flight | lastMsg + agent + rateLimit + userKeys | ~320ms | ~80ms |
| Regenerate setup | deleteLastMsg + fetchHistory | ~160ms | ~80ms |
| Auth init (frontend) | profile + models + agents + chatHistory | ~400ms | ~100ms |
| Chat history endpoint | count + dataFetch | ~160ms | ~80ms |
| Messages endpoint | chatVerify + count + messages | ~240ms | ~80ms |
| Admin stats | users + models + agents + monthUsage | ~320ms | ~80ms |
AI provider clients are cached in Map data structures, providing O(1) amortized lookup by composite key:
// Cache key: "provider:apiKey" → deduplicated per provider+credential
const clientCache = new Map(); // OpenAI-compatible clients
const googleClientCache = new Map(); // Native Google AI clients
// Lookup: O(1) average case (hash map)
// Memory: O(P × K) where P=providers, K=unique keys
// Typical: ~6-12 cached clients (6 providers × 1-2 keys each)
// What's saved per cache hit:
// 1. Object construction (~5ms)
// 2. TCP connection establishment (~50ms)
// 3. TLS handshake (~100-200ms)
// Total savings per hit: ~155-255ms
Gemini models have varying capabilities (thinking, tools, image output). The system implements a three-stage fallback that automatically downgrades features when a model doesn't support them.
const isUnsupported =
msg.includes('Thinking is not enabled') ||
msg.includes('not supported') ||
msg.includes('INVALID_ARGUMENT') ||
(err?.status === 400 && (
msg.includes('think') ||
msg.includes('tool') ||
msg.includes('function')
));
// If isUnsupported && more attempts remain → retry with simpler config
// If isUnsupported && no attempts remain → throw (fatal)
// If !isUnsupported → throw immediately (don't waste retries on auth/rate errors)
The system implements a multi-tier priority chain for determining the token budget for each request, with hard caps based on key ownership.
To prevent unbounded context growth and keep costs predictable, the system limits conversation history to the most recent 20 messages:
When a user chats with a public agent created by another user, usage is counted against both the consumer and the agent creator (unless the consumer uses their own key):
Input attachments flow through a provider-specific transformation pipeline. The system handles images, audio, PDFs, and text files with automatic format conversion and privacy-safe storage.
For non-Google providers that don't support inline PDF, the system uses pdf-parse to extract text content:
// Input: base64-encoded PDF bytes
const buffer = Buffer.from(base64Data, 'base64');
const result = await pdfParse(buffer);
const text = result.text?.trim() || '';
// Output: injected as a structured text block
content.push({
type: 'text',
text: `[Content of attached PDF file]:\n\n${pdfText}`
});
// Error handling: graceful degradation
// On parse failure → "[PDF content could not be extracted]"
// PDF is still sent as base64 to Gemini (native PDF support)
A premium dark-themed SPA built with Next.js 16 App Router, Zustand state management, and Framer Motion animations. Designed for speed — all critical data is preloaded in parallel on login.
A single Zustand store manages the entire application state. Key design decisions:
replaceLastAssistantContent() uses structural equality checks to skip unnecessary re-renders during rapid streaming updates.ChatView use useShallow() selectors to prevent re-renders when unrelated state changes.The frontend API layer caches auth headers for 30 seconds, avoiding redundant supabase.auth.getSession() calls (~50-150ms each). On sign-in, seedAuthCache() pre-populates the cache so the initial data load runs without delay.
On authentication, the AuthProvider fires a parallel mega-batch:
const [profile] = await Promise.all([
usersAPI.getMe(), // User profile
loadAllData(), // Models + Agents + Chat History (parallel)
]);
// Non-blocking background loads:
usersAPI.getUsage().then(setUsage);
usersAPI.getApiKeys().then(setApiKeys);
This ensures the UI is fully interactive in a single network round-trip, with usage data and API keys loading in the background.
The frontend processes 10 distinct SSE event types from the streaming chat endpoint:
| Event Type | Purpose | Frontend Action |
|---|---|---|
meta | Chat ID assignment | Updates store activeChatId, prepends to chat list |
thinking_start | AI is reasoning | Shows animated "Thinking..." indicator |
thinking_content | Internal reasoning text | Streams into collapsible thinking panel |
thinking_done | Reasoning complete | Collapses panel, shows elapsed time badge |
chunk | Text token | Appends to assistant message content |
generating | Media generation started | Shows shimmer skeleton (image/audio/video) |
media | Generated media data | Renders inline image, audio player, or video player |
clear_content | Clear fake tool JSON | Resets accumulated text content |
done | Stream complete | Finalizes message, enables regeneration |
error | Error occurred | Shows error message in chat |
The chat input adapts to the selected model type:
Standard chat input with file attachments (images, PDFs, text files), voice recording, expandable input, and thinking toggle.
Purple-themed prompt input for image/video models with Sparkles icon. Focuses on creative prompt entry.
Emerald-themed with voice selection pills (8 Google voices). Volume2 icon. Enters text-to-speak content.
The Agent Studio provides a full-featured IDE for creating and testing custom AI agents:
conversationHistory in localStorage, sending it to the backend so the AI remembers previous messages across page reloads.A separate Next.js application that provides complete platform management: model CRUD, user management with per-user rate limits, and usage analytics.
Full CRUD for AI models: name, provider, model_id, API key, daily limit, max tokens, model type (text/image/video/tts), and active status. API keys are masked in the UI.
View all users, promote/demote admin roles, and set per-user per-model custom rate limits that override the global defaults.
Dashboard stats (total users, active models, total agents, monthly requests) and detailed per-user, per-model usage logs filterable by date.
The admin API is protected by both authMiddleware (JWT validation) and adminMiddleware (role check). All admin endpoints require role === 'admin' in the user's profile.
Security is enforced at every layer: database (RLS), middleware (JWT), transport (HTTPS/CORS/Helmet), and application (Zod validation, API key isolation).
Every table has RLS enabled. Users can only access their own data. Public agents are readable by anyone. Admins have unrestricted access via a policy that checks role = 'admin'.
Platform API keys live exclusively in the models table and are only accessed server-side via the service-role key. The public models endpoint explicitly excludes api_key from the SELECT query. Admin endpoints mask keys as sk-xxxxx...xxxx.
Every API endpoint validates input with Zod schemas before processing. Messages are capped at 32,000 chars, attachments at 5 per request, usernames must match /^[a-z0-9_]{3,30}$/.
Helmet sets security headers. CORS is configured per-origin from environment variables with normalized trailing-slash handling. Only the frontend and admin URLs are allowed.
Binary attachments (images, voice, files) are processed in-memory and sent directly to AI providers. Only text placeholders like [Image sent] are stored in the database. No user media is ever persisted.
User API keys are stored in the user_api_keys table with RLS ensuring only the owner can read/write their keys. The backend resolves keys server-side — they're never sent to the frontend.
Every millisecond matters for perceived AI response speed. The platform uses aggressive parallelization, caching, and streaming to minimize time-to-first-token.
| Optimization | Impact | Technique |
|---|---|---|
| Instant SSE open | ~1s faster perceived latency | Stream opens before DB work; 2KB padding flushes proxy buffers |
| Mega-batch pre-flight | 4 queries in 1 round-trip | Promise.all() for model, agent, rate limit, and API keys |
| Auth token caching | ~50-150ms saved per request | Backend: 5min in-memory Map. Frontend: 30s header cache |
| Client connection pooling | ~100-300ms saved per request | OpenAI and Google clients cached by provider+key, reuse TCP/TLS |
| Native Gemini streaming | ~8s faster than wrapper | Direct @google/genai SDK instead of OpenAI-compatible endpoint |
| Fire-and-forget saves | 0ms user wait for DB writes | Message and usage inserts happen after [DONE] is sent |
| Parallel data preload | Single round-trip on login | Promise.all() for profile, models, agents, chat history |
| Zustand useShallow | Reduced re-renders | Heavy components select only needed state slices |
A comprehensive list of every user-facing and system-level feature in the platform.
Real-time token-by-token streaming with thinking indicators, copy, regenerate, and auto-scroll.
Create agents with system prompts, custom temperature/top_p/max_tokens, unique usernames, and public sharing.
Inline via tool calling (Imagen 3 + DALL-E 3). Expandable previews and one-click downloads.
8 Google voices + 6 OpenAI voices. Custom audio player with progress bar. PCM-to-WAV conversion.
Google Veo 2 with async polling. Configurable aspect ratios. Inline video player.
Attach images, voice recordings (iOS-compatible), PDFs (extracted to text), and text files.
Users add their own provider keys for unlimited access. Keys bypass rate limiting and are stored securely.
Per-model monthly usage with separate platform vs. own-key tracking. Admin override limits per user.
Toggle AI reasoning visibility. Collapsible thinking panel shows internal reasoning with elapsed time.
Pin up to 3 favorite models. Quick-switch dropdown shows Recent, Pinned, and Latest sections.
Share agents publicly via unique usernames. Discoverable in the Explore page. Creators' usage is tracked separately.
Live testing with real-time settings reflection, conversation memory (localStorage), and full media generation support.
All endpoints require Bearer token authentication unless noted. The backend exposes 18 REST endpoints across 7 route modules.
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/chat/send | Send message & stream response (SSE) |
| GET | /api/chat/history | List user's chats (paginated) |
| GET | /api/chat/:id/messages | Get messages for a chat (paginated) |
| PATCH | /api/chat/:id | Rename a chat |
| DEL | /api/chat/:id | Delete a chat |
| POST | /api/chat/:id/regenerate | Regenerate last response (SSE) |
| GET | /api/models | List active models (no api_key) |
| GET | /api/models/:id/usage | User's usage for a specific model |
| GET | /api/agents | List user's agents |
| POST | /api/agents | Create agent |
| PATCH | /api/agents/:id | Update agent |
| DEL | /api/agents/:id | Delete agent |
| GET | /api/agents/public | List public agents (no auth) |
| POST | /api/generate/image | Generate image from prompt |
| POST | /api/generate/tts | Text-to-Speech generation |
| POST | /api/generate/video | Video generation (Veo 2) |
| GET | /api/users/me | Get user profile |
| PUT | /api/users/me/api-keys | Update BYOK API keys |
This section documents the bugs, edge cases, and painful discoveries that shaped the system's architecture. Every "optimized" solution in this case study was born from something that broke first.
Early on, all 6 providers used the same OpenAI-compatible SDK code path. Everything seemed fine — until I tested Gemini 2.5 Flash side-by-side with GPT-4o. Gemini had a consistent 5-8 second delay before the first token appeared, while GPT-4o started streaming in ~800ms. Users reported "Gemini is broken" even though it was technically working.
Root cause: Google's OpenAI-compatible endpoint (generativelanguage.googleapis.com/v1beta/openai/) doesn't actually stream. It buffers the entire response server-side, then sends all chunks in a rapid burst. The "streaming" is fake — you get 0 tokens for 6 seconds, then all 500 tokens in 200ms.
The fix: I built an entirely separate native Gemini code path using the @google/genai SDK, which supports real token-by-token streaming. This required duplicating all streaming logic, tool calling handling, and error management for Gemini specifically. The result cut Gemini's time-to-first-token from ~6s to ~200ms — but the cost was maintaining two parallel streaming implementations (§5g).
Lesson: Never trust "OpenAI-compatible" claims without benchmarking the actual streaming behavior. Compatibility layers optimize for correctness, not latency.
Tool calling (for image/TTS generation) worked perfectly on OpenAI and Gemini. Then I tested it on Claude via OpenRouter, and the chat just... printed raw JSON. The model understood the tool definition but instead of making a function call, it wrote {"tool": "generate_image", "prompt": "a sunset over mountains"} as plain text into the chat.
It got worse: Every model that "faked" tool calls did it differently. Some wrapped JSON in markdown code blocks. Some used action/action_input DALL-E format. Some just inferred a type: "image" field. I found at least 5 distinct JSON patterns across providers, and new ones kept appearing as I tested more models.
The fix: The detectFakeToolCall() function (§5c) — a multi-pattern matching pipeline that extracts JSON from markdown blocks or raw text, then tries to match it against 5 known tool-call formats. When detected, it clears the JSON from the UI via a clear_content SSE event and executes the generation transparently. This was the messiest code I wrote, and it still occasionally fails on novel model outputs.
Lesson: The AI ecosystem's "function calling" standard is a lie. Every provider implements it differently, and many models will simply ignore the schema and do their own thing. You need both a clean path and a dirty fallback.
Streaming worked perfectly in local development. The moment I deployed behind Cloudflare, SSE events were delayed by 15-30 seconds. The entire AI response would accumulate silently, then dump to the client all at once. Users saw a blank screen for half a minute, then the full response appeared instantly. It looked completely broken.
Root cause: Reverse proxies (Cloudflare, nginx, Traefik) buffer response data until they accumulate enough bytes to justify a network flush. SSE events are tiny (~50-200 bytes each), so they sit in the proxy buffer waiting for more data that never comes. The proxy's "optimization" was destroying the entire streaming UX.
The fix: Three-layer workaround: (1) A 2KB SSE comment (: + 2048 spaces) sent immediately to overflow the proxy buffer threshold. (2) X-Accel-Buffering: no header to explicitly disable nginx buffering. (3) socket.setNoDelay(true) to disable Nagle's TCP algorithm, which was batching small SSE frames. All three were necessary — removing any one brought the delay back in certain deployment configs.
Lesson: Local development is a lie for streaming applications. Always test SSE through at least one reverse proxy layer before calling it "done."
Gemini 2.5 Flash supports thinking mode + tool calling. Gemini 2.0 Flash supports tool calling but not thinking. Gemini 1.5 Pro supports neither. There is no API endpoint to query which features a model supports. I only discovered this through trial and error — sending a request with thinking enabled, getting a 400 error, and then realizing I had to maintain a compatibility matrix in my head.
It got worse: Google sometimes updates model capabilities silently. A model that didn't support tools on Monday might support them on Wednesday. Hardcoding a feature matrix would become stale instantly.
The fix: The three-stage progressive fallback state machine (§5g). Instead of trying to know in advance what each model supports, I attempt the most capable configuration first (thinking + tools), catch the error, check if it's an "unsupported feature" error, and retry with a simpler configuration. This makes the system self-healing — it adapts to model capabilities at runtime without any hardcoded knowledge.
Lesson: When working with third-party AI APIs, design for capability discovery at runtime rather than static configuration. APIs change under you.
Gemini TTS returned audio data with MIME type audio/L16;rate=24000. I Base64-encoded it, sent it to the browser, and... silence. The <audio> element refused to play it. No error, no console warning — just nothing. Chrome, Firefox, Safari all silently failed.
Root cause: Browsers cannot play raw PCM audio. They need a container format (WAV, MP3, OGG). Gemini returns headerless 16-bit linear PCM samples — just raw bytes with no metadata about sample rate, channels, or bit depth. The browser has no way to interpret the data.
The fix: I wrote a manual pcmToWav() function (§5e) that constructs a 44-byte RIFF/WAVE header byte-by-byte using DataView, then prepends it to the raw PCM data. The sample rate is extracted from the MIME type string via regex. I also had to add magic byte detection (checking for RIFF and 0xFF 0xE0 MP3 sync headers) because OpenAI TTS returns MP3 directly, and wrapping an MP3 in a WAV header produces garbage.
Lesson: "Returns audio" in an API doc doesn't mean "returns playable audio." Always check the actual byte-level format, not just the MIME type.
I built a public agents feature where users can share their custom agents. Another user can chat with your agent for free — great for discoverability. Then I realized the exploit: a user creates 10 public agents, shares them, and 50 people use them. All 500 daily requests consume the platform's API key budget, but nobody's individual rate limit is hit because usage was only tracked against each consumer individually.
The fix: Dual usage attribution (§5h). When someone uses a public agent on the platform key, usage is counted against both the consumer (who sent the message) and the agent creator (who published the agent). This prevents creators from bypassing their rate limits by laundering usage through public agents. The creator's count is only incremented when the platform key is used — if the consumer brings their own key, the creator isn't penalized.
Lesson: Any "sharing" feature in a rate-limited system is a potential bypass vector. Always ask: "Who pays for the compute when content goes viral?"
Voice recording worked perfectly on Chrome desktop and Android. On iOS Safari, the MediaRecorder API silently produced empty blobs. The recording UI appeared to work — the timer ticked, the animation played — but the resulting audio file was 0 bytes.
Root cause: iOS Safari doesn't support audio/webm (the default format on Chrome). It supports audio/mp4 and audio/aac, but doesn't throw an error when you request an unsupported format — it just produces garbage output.
The fix: The VoiceRecorder component now probes format support at initialization with a priority list: audio/mp4 → audio/aac → audio/webm. It uses MediaRecorder.isTypeSupported() to find the first working format. This cascading approach handles iOS, Android, and desktop without user-agent sniffing.
Lesson: Never trust MediaRecorder to fail loudly. Always probe format support before recording, and always test audio features on actual iOS hardware (simulators lie about codec support).
Honest accounting of known limitations I haven't solved yet:
The 5-minute backend auth cache means a user's role change (e.g., promoted to admin) doesn't take effect for up to 5 minutes. I accepted this because role changes are rare and the latency savings (~80ms/request) affect every single request.
TypeScript interfaces (Model, Message, Agent) are duplicated between frontend and backend. A schema change requires manual sync in both codebases. A /packages/types workspace would fix this but adds Turborepo/Nx complexity I haven't justified yet.
If the server crashes between sending [DONE] and the background message INSERT completing, the user sees the response but it's not saved to DB. On next page load, the message disappears. This is rare (~0.01% chance) but a real data consistency gap.
The 50MB JSON body limit accommodates most attachments, but 5 high-resolution images at full quality could exceed it. There's no chunked upload or compression — the entire payload must fit in one request. A proper solution would use presigned URLs and storage buckets.
Enox AI is a production-grade system that demonstrates mastery of full-stack engineering, distributed systems patterns, binary protocol encoding, and AI pipeline orchestration.
The project showcases deep expertise across every layer of the stack:
async function* → yield* delegation → SSE serialization) with composable tool-call execution. Typed SSE events with a formal finite state machine governing transitions (§5c, §5d).Promise.all() parallelization points across the codebase achieving up to 4× speedup over sequential I/O. Fire-and-forget persistence pattern eliminates all post-stream DB write latency (§5f).ON CONFLICT DO UPDATE, three-tier nullish coalescing priority chain, and dual usage attribution for public agents (§5h).pdf-parse with graceful degradation. Privacy-safe placeholder storage pattern (§5i).useShallow selectors to minimize re-renders, and a 30-second auth header cache that eliminates redundant session calls.Every line of code is available as open source at github.com/yad-anakin/enox.