- Core trio: Next.js 15 (App Router) + Anthropic SDK + Vercel AI SDK handles 80% of AI application patterns
- Vector database: pgvector first (if you're already on Postgres), Qdrant for high-scale or multi-tenancy requirements
- Structured output: Zod + Claude's
tool_useAPI is the most reliable pattern — don't use JSON mode for production - Streaming: Edge Runtime + Server-Sent Events via Vercel AI SDK's
streamTextcovers most use cases - Observability: Helicone for hosted, custom OpenTelemetry spans for self-managed — LangSmith is great for LangChain shops but adds overhead otherwise
Section 1 — Why TypeScript for AI Engineering?
The AI application layer has consolidated around two languages: Python (model training, research, data pipelines) and TypeScript (production applications, user-facing products).
TypeScript's advantages for AI application development:
- End-to-end type safety from API response to UI — Zod schemas validate AI outputs at runtime
- Next.js App Router — streaming, edge functions, and server components are first-class
- Vercel AI SDK — the best streaming + multi-provider abstraction in any language right now
- Ecosystem maturity — Anthropic, OpenAI, and Mistral all publish TypeScript SDKs with full type definitions
If you're building a product that users interact with, TypeScript is the right choice. If you're running data science pipelines or fine-tuning models, stay in Python.
Section 2 — The Core Application Layer
Next.js 15 + App Router
Next.js 15 with the App Router is the foundation. Key configuration for AI workloads:
// next.config.ts
import type { NextConfig } from "next";
const config: NextConfig = {
experimental: {
// Required for Edge Runtime streaming
serverComponentsExternalPackages: ["@anthropic-ai/sdk"],
},
};
export default config;
Vercel AI SDK — The Streaming Abstraction
Vercel AI SDK (ai package) handles the hard parts of streaming AI responses:
// app/api/chat/route.ts
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";
export const runtime = "edge";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = await streamText({
model: anthropic("claude-sonnet-4-6"),
system: "You are a helpful assistant.",
messages,
maxTokens: 2048,
});
return result.toDataStreamResponse();
}
On the client side:
// components/Chat.tsx
"use client";
import { useChat } from "ai/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({ api: "/api/chat" });
return (
<div>
{messages.map((m) => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} disabled={isLoading} />
<button type="submit" disabled={isLoading}>Send</button>
</form>
</div>
);
}
This pattern handles reconnection, partial streaming, and multi-modal content out of the box.
For streaming AI responses, Edge Runtime is the right choice — it starts faster and handles long-lived streaming connections better. However, Edge Runtime has constraints: no native Node.js modules, limited file system access. If your route needs heavy Node.js dependencies (like langchain or puppeteer), use Node.js runtime and accept the cold start penalty. For pure Anthropic/OpenAI SDK calls, always use Edge.
Section 3 — Structured Output with Zod
Getting reliable structured data from LLMs is one of the most important problems in AI engineering. JSON mode is unreliable in production — models hallucinate extra fields, miss required ones, or produce malformed JSON under load.
The correct pattern: Zod schema + Claude's tool_use API.
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
const client = new Anthropic();
// Define your schema with Zod
const ProductSchema = z.object({
name: z.string().describe("Product name"),
price: z.number().positive().describe("Price in USD"),
category: z.enum(["electronics", "clothing", "food", "other"]),
inStock: z.boolean(),
tags: z.array(z.string()).max(5),
});
type Product = z.infer<typeof ProductSchema>;
// Convert Zod schema to Claude tool definition
function zodToClaudeTool(name: string, description: string, schema: z.ZodObject<any>) {
return {
name,
description,
input_schema: {
type: "object" as const,
properties: Object.fromEntries(
Object.entries(schema.shape).map(([key, value]) => [
key,
{
type: getJsonType(value),
description: (value as any)._def.description ?? "",
},
])
),
required: Object.keys(schema.shape).filter(
(key) => !(schema.shape[key] instanceof z.ZodOptional)
),
},
};
}
async function extractProduct(rawText: string): Promise<Product> {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
tools: [zodToClaudeTool("extract_product", "Extract product information", ProductSchema)],
tool_choice: { type: "tool", name: "extract_product" },
messages: [
{
role: "user",
content: `Extract product information from this text:\n\n${rawText}`,
},
],
});
const toolUse = response.content.find((block) => block.type === "tool_use");
if (!toolUse || toolUse.type !== "tool_use") {
throw new Error("No tool use response");
}
// Zod validates at runtime — throws on schema mismatch
return ProductSchema.parse(toolUse.input);
}
Using Vercel AI SDK's built-in generateObject:
import { generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";
const { object } = await generateObject({
model: anthropic("claude-sonnet-4-6"),
schema: z.object({
sentiment: z.enum(["positive", "negative", "neutral"]),
confidence: z.number().min(0).max(1),
summary: z.string().max(100),
}),
prompt: "Analyze the sentiment of this review: " + reviewText,
});
// object is fully typed — TypeScript knows the shape
console.log(object.sentiment); // "positive" | "negative" | "neutral"
Claude's JSON mode ("type": "json_object") and OpenAI's equivalent are tempting but dangerous in production. Under high load, models occasionally produce truncated JSON. The tool_use approach forces the model to fill a schema — if it can't, you get an error rather than corrupt data. Always use tool_use for structured extraction in production.
Section 4 — Vector Database Selection
Choosing a vector database is a decision you'll live with for years. Here's an honest comparison for 2026:
| Factor | pgvector | Qdrant | Pinecone | Weaviate |
|---|---|---|---|---|
| Infrastructure | Add-on to Postgres | Standalone service | Managed cloud only | Standalone or cloud |
| Self-hosted | Yes | Yes (Docker) | No | Yes (Docker) |
| Vectors per collection | ~10M practical | Unlimited (sharding) | Unlimited (expensive) | ~50M practical |
| Multi-tenancy | Schema/table isolation | Native collections | Namespaces | Multi-tenancy API |
| Filtering | SQL WHERE clauses | Payload filtering (fast) | Metadata filtering | GraphQL-based |
| Managed cost (1M vectors) | $0 (your Postgres) | ~$25/month | ~$70/month | ~$25/month |
| TypeScript SDK quality | Drizzle / Prisma | Official TS SDK | Official TS SDK | Official TS SDK |
pgvector is the right default if you're already on Postgres. Add the extension, store vectors in a column, and you're done — no new infrastructure.
// With Drizzle ORM
import { pgTable, text, vector, index } from "drizzle-orm/pg-core";
export const documents = pgTable(
"documents",
{
id: text("id").primaryKey(),
content: text("content").notNull(),
embedding: vector("embedding", { dimensions: 1536 }),
},
(table) => ({
embeddingIndex: index("embedding_idx").using(
"hnsw",
table.embedding.op("vector_cosine_ops")
),
})
);
// Similarity search
const similar = await db
.select()
.from(documents)
.orderBy(sql`embedding <=> ${queryEmbedding}`)
.limit(10);
Switch to Qdrant when:
- You need sub-10ms p99 search at 10M+ vectors
- Your multi-tenancy requirements are complex (isolating by user, organization)
- You want native sparse+dense hybrid search
import { QdrantClient } from "@qdrant/js-client-rest";
const client = new QdrantClient({ url: "http://localhost:6333" });
// Upsert vectors
await client.upsert("documents", {
wait: true,
points: [
{
id: "doc_123",
vector: embeddingArray,
payload: { content: "...", userId: "user_456", createdAt: Date.now() },
},
],
});
// Search with filter
const results = await client.search("documents", {
vector: queryEmbedding,
limit: 10,
filter: {
must: [{ key: "userId", match: { value: "user_456" } }],
},
});
Pinecone is overpriced for most use cases. Unless you're an enterprise team with dedicated managed infrastructure requirements, Qdrant Cloud at $25/month serves the same use case at 1/3 the cost.
Section 5 — Streaming Architecture
Full streaming architecture for a production AI application:
User Browser
↕ EventSource / fetch streaming
Next.js Edge Function
↕ Anthropic SDK (streaming)
Claude API
↕ (parallel, if needed)
MCP Server / Tool calls
↕
Postgres / Qdrant / External APIs
For complex multi-step flows (RAG + generation), structure your route handler carefully:
// app/api/rag/route.ts
import { anthropic } from "@ai-sdk/anthropic";
import { streamText, tool } from "ai";
import { z } from "zod";
import { getEmbedding, searchDocuments } from "@/lib/vector";
export const runtime = "edge";
export async function POST(req: Request) {
const { query, userId } = await req.json();
const result = await streamText({
model: anthropic("claude-sonnet-4-6"),
system: `You are a helpful assistant with access to a knowledge base.
Use the search_knowledge_base tool to find relevant information before answering.`,
messages: [{ role: "user", content: query }],
tools: {
search_knowledge_base: tool({
description: "Search the knowledge base for relevant documents",
parameters: z.object({
query: z.string().describe("Search query"),
limit: z.number().min(1).max(10).default(5),
}),
execute: async ({ query, limit }) => {
const embedding = await getEmbedding(query);
const docs = await searchDocuments(embedding, { userId, limit });
return docs.map((d) => ({ id: d.id, content: d.content, score: d.score }));
},
}),
},
maxSteps: 3, // Allow up to 3 tool-use rounds
onFinish: async ({ usage, finishReason }) => {
// Log usage for billing/observability
await logUsage({ userId, tokens: usage, finishReason });
},
});
return result.toDataStreamResponse();
}
Section 6 — Observability
You cannot optimize what you cannot measure. AI applications have three observability layers:
Layer 1: LLM Call Tracking (Helicone)
Helicone is a proxy that sits between your application and the LLM API. Zero code changes for basic tracking:
// Before
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
// After (Helicone proxy)
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
baseURL: "https://anthropic.helicone.ai",
defaultHeaders: {
"Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
"Helicone-User-Id": userId, // Track per-user costs
"Helicone-Session-Id": sessionId, // Group related calls
},
});
Helicone gives you: latency histograms, cost per request, error rates, prompt version comparison. It's $0 for the first 10K requests/month — essential for early-stage AI products.
Layer 2: Application Traces (OpenTelemetry)
For production, instrument your AI routes with OpenTelemetry:
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("ai-service");
export async function generateWithTracing(prompt: string, userId: string) {
return tracer.startActiveSpan("ai.generate", async (span) => {
span.setAttributes({
"ai.model": "claude-sonnet-4-6",
"user.id": userId,
"ai.prompt_length": prompt.length,
});
try {
const result = await generate(prompt);
span.setAttributes({
"ai.output_tokens": result.usage.outputTokens,
"ai.input_tokens": result.usage.inputTokens,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
throw error;
} finally {
span.end();
}
});
}
Layer 3: Output Quality (Evals)
The hardest and most important layer. For production AI apps, build a lightweight eval pipeline:
// scripts/eval.ts
const testCases = [
{ input: "What is the refund policy?", expectedTopics: ["30 days", "receipt"] },
{ input: "How do I reset my password?", expectedTopics: ["email", "reset link"] },
];
for (const testCase of testCases) {
const output = await runRag(testCase.input);
const score = await gradeOutput(output, testCase.expectedTopics);
console.log({ input: testCase.input, score, output: output.slice(0, 100) });
}
Run evals on every model change, prompt change, and weekly in CI.
Section 7 — Complete Architecture Reference
Here's the full directory structure for a production TypeScript AI application:
my-ai-app/
├── app/
│ ├── api/
│ │ ├── chat/route.ts # Streaming chat endpoint (Edge)
│ │ ├── rag/route.ts # RAG endpoint (Edge)
│ │ └── ingest/route.ts # Document ingestion (Node.js)
│ └── (ui pages)
├── lib/
│ ├── ai/
│ │ ├── client.ts # Configured Anthropic client (Helicone proxy)
│ │ ├── structured.ts # Zod + tool_use helpers
│ │ └── prompts.ts # Versioned system prompts
│ ├── vector/
│ │ ├── embed.ts # Embedding generation
│ │ ├── search.ts # Vector similarity search
│ │ └── ingest.ts # Document chunking + upsert
│ └── db/
│ ├── schema.ts # Drizzle schema (includes pgvector)
│ └── queries.ts # Typed database queries
├── scripts/
│ ├── eval.ts # Output quality evaluation
│ └── ingest-docs.ts # Batch document ingestion
└── instrumentation.ts # OpenTelemetry initialization
Most teams default to OpenAI's text-embedding-3-small for embeddings. It's a solid choice. But consider: if you're using Anthropic's Claude for generation and you want to minimize vendor dependency, Voyage AI's voyage-3 embeddings have better retrieval quality and integrate cleanly into TypeScript. For most applications, the difference is marginal — but for search-heavy RAG apps, it's measurable.
Section 8 — Cost Management at Scale
AI applications have a fundamentally different cost structure from traditional software. Token costs scale with usage in ways that CPU costs don't.
Practical cost controls:
// 1. Cache embeddings — never re-embed the same content
const embeddingCache = new Map<string, number[]>();
async function getCachedEmbedding(text: string): Promise<number[]> {
const key = createHash("sha256").update(text).digest("hex");
if (embeddingCache.has(key)) return embeddingCache.get(key)!;
const embedding = await generateEmbedding(text);
embeddingCache.set(key, embedding);
return embedding;
}
// 2. Cache LLM responses for identical prompts (semantic cache in production)
// 3. Set hard token limits per user per day
// 4. Use smaller models for classification tasks
const classificationModel = anthropic("claude-haiku-3-5"); // 10x cheaper
const generationModel = anthropic("claude-sonnet-4-6"); // full quality
// 5. Implement context window management
function trimMessagesToFit(messages: Message[], maxTokens = 80000): Message[] {
// Keep system message + last N messages within budget
let tokenCount = 0;
const trimmed = [];
for (const msg of [...messages].reverse()) {
const estimated = msg.content.length / 4; // rough token estimate
if (tokenCount + estimated > maxTokens) break;
trimmed.unshift(msg);
tokenCount += estimated;
}
return trimmed;
}
Section 9 — Takeaways
The TypeScript AI stack has matured enough in 2026 that there are clear right answers for most decisions:
- Next.js 15 + Vercel AI SDK: Use it. The streaming and multi-provider abstractions are genuinely excellent.
- Zod + tool_use: Non-negotiable for structured outputs in production.
- pgvector first, Qdrant when you scale: Don't add new infrastructure before you need it.
- Helicone from day one: Visibility into token costs and latency is essential, not optional.
- Evals are infrastructure: Treat them that way.
The teams shipping the best AI products in 2026 aren't using exotic tools — they're using this stack and executing well. The framework choices are the easy part.
Stack versions: Next.js 15.2, Vercel AI SDK 4.x, Anthropic SDK 0.39, Drizzle ORM 0.30, Qdrant JS 1.9. Current as of March 2026.
— iBuidl Research Team