The TypeScript AI Engineering Stack That Ships in 2026

TL;DR

Core trio: Next.js 15 (App Router) + Anthropic SDK + Vercel AI SDK handles 80% of AI application patterns
Vector database: pgvector first (if you're already on Postgres), Qdrant for high-scale or multi-tenancy requirements
Structured output: Zod + Claude's tool_use API is the most reliable pattern — don't use JSON mode for production
Streaming: Edge Runtime + Server-Sent Events via Vercel AI SDK's streamText covers most use cases
Observability: Helicone for hosted, custom OpenTelemetry spans for self-managed — LangSmith is great for LangChain shops but adds overhead otherwise

Section 1 — Why TypeScript for AI Engineering?

The AI application layer has consolidated around two languages: Python (model training, research, data pipelines) and TypeScript (production applications, user-facing products).

TypeScript's advantages for AI application development:

End-to-end type safety from API response to UI — Zod schemas validate AI outputs at runtime
Next.js App Router — streaming, edge functions, and server components are first-class
Vercel AI SDK — the best streaming + multi-provider abstraction in any language right now
Ecosystem maturity — Anthropic, OpenAI, and Mistral all publish TypeScript SDKs with full type definitions

If you're building a product that users interact with, TypeScript is the right choice. If you're running data science pipelines or fine-tuning models, stay in Python.

Section 2 — The Core Application Layer

Next.js 15 + App Router

Next.js 15 with the App Router is the foundation. Key configuration for AI workloads:

// next.config.ts
import type { NextConfig } from "next";

const config: NextConfig = {
  experimental: {
    // Required for Edge Runtime streaming
    serverComponentsExternalPackages: ["@anthropic-ai/sdk"],
  },
};

export default config;

Vercel AI SDK — The Streaming Abstraction

Vercel AI SDK (ai package) handles the hard parts of streaming AI responses:

// app/api/chat/route.ts
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";

export const runtime = "edge";

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = await streamText({
    model: anthropic("claude-sonnet-4-6"),
    system: "You are a helpful assistant.",
    messages,
    maxTokens: 2048,
  });

  return result.toDataStreamResponse();
}

On the client side:

// components/Chat.tsx
"use client";
import { useChat } from "ai/react";

export function Chat() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } =
    useChat({ api: "/api/chat" });

  return (
    <div>
      {messages.map((m) => (
        <div key={m.id}>
          <strong>{m.role}:</strong> {m.content}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} disabled={isLoading} />
        <button type="submit" disabled={isLoading}>Send</button>
      </form>
    </div>
  );
}

This pattern handles reconnection, partial streaming, and multi-modal content out of the box.

Edge Runtime vs Node.js Runtime

For streaming AI responses, Edge Runtime is the right choice — it starts faster and handles long-lived streaming connections better. However, Edge Runtime has constraints: no native Node.js modules, limited file system access. If your route needs heavy Node.js dependencies (like langchain or puppeteer), use Node.js runtime and accept the cold start penalty. For pure Anthropic/OpenAI SDK calls, always use Edge.

Section 3 — Structured Output with Zod

Getting reliable structured data from LLMs is one of the most important problems in AI engineering. JSON mode is unreliable in production — models hallucinate extra fields, miss required ones, or produce malformed JSON under load.

The correct pattern: Zod schema + Claude's tool_use API.

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";

const client = new Anthropic();

// Define your schema with Zod
const ProductSchema = z.object({
  name: z.string().describe("Product name"),
  price: z.number().positive().describe("Price in USD"),
  category: z.enum(["electronics", "clothing", "food", "other"]),
  inStock: z.boolean(),
  tags: z.array(z.string()).max(5),
});

type Product = z.infer<typeof ProductSchema>;

// Convert Zod schema to Claude tool definition
function zodToClaudeTool(name: string, description: string, schema: z.ZodObject<any>) {
  return {
    name,
    description,
    input_schema: {
      type: "object" as const,
      properties: Object.fromEntries(
        Object.entries(schema.shape).map(([key, value]) => [
          key,
          {
            type: getJsonType(value),
            description: (value as any)._def.description ?? "",
          },
        ])
      ),
      required: Object.keys(schema.shape).filter(
        (key) => !(schema.shape[key] instanceof z.ZodOptional)
      ),
    },
  };
}

async function extractProduct(rawText: string): Promise<Product> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    tools: [zodToClaudeTool("extract_product", "Extract product information", ProductSchema)],
    tool_choice: { type: "tool", name: "extract_product" },
    messages: [
      {
        role: "user",
        content: `Extract product information from this text:\n\n${rawText}`,
      },
    ],
  });

  const toolUse = response.content.find((block) => block.type === "tool_use");
  if (!toolUse || toolUse.type !== "tool_use") {
    throw new Error("No tool use response");
  }

  // Zod validates at runtime — throws on schema mismatch
  return ProductSchema.parse(toolUse.input);
}

Using Vercel AI SDK's built-in generateObject:

import { generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

const { object } = await generateObject({
  model: anthropic("claude-sonnet-4-6"),
  schema: z.object({
    sentiment: z.enum(["positive", "negative", "neutral"]),
    confidence: z.number().min(0).max(1),
    summary: z.string().max(100),
  }),
  prompt: "Analyze the sentiment of this review: " + reviewText,
});

// object is fully typed — TypeScript knows the shape
console.log(object.sentiment); // "positive" | "negative" | "neutral"

Never Trust JSON Mode in Production

Claude's JSON mode ("type": "json_object") and OpenAI's equivalent are tempting but dangerous in production. Under high load, models occasionally produce truncated JSON. The tool_use approach forces the model to fill a schema — if it can't, you get an error rather than corrupt data. Always use tool_use for structured extraction in production.

Section 4 — Vector Database Selection

Choosing a vector database is a decision you'll live with for years. Here's an honest comparison for 2026:

Factor	pgvector	Qdrant	Pinecone	Weaviate
Infrastructure	Add-on to Postgres	Standalone service	Managed cloud only	Standalone or cloud
Self-hosted	Yes	Yes (Docker)	No	Yes (Docker)
Vectors per collection	~10M practical	Unlimited (sharding)	Unlimited (expensive)	~50M practical
Multi-tenancy	Schema/table isolation	Native collections	Namespaces	Multi-tenancy API
Filtering	SQL WHERE clauses	Payload filtering (fast)	Metadata filtering	GraphQL-based
Managed cost (1M vectors)	$0 (your Postgres)	~$25/month	~$70/month	~$25/month
TypeScript SDK quality	Drizzle / Prisma	Official TS SDK	Official TS SDK	Official TS SDK

pgvector is the right default if you're already on Postgres. Add the extension, store vectors in a column, and you're done — no new infrastructure.

// With Drizzle ORM
import { pgTable, text, vector, index } from "drizzle-orm/pg-core";

export const documents = pgTable(
  "documents",
  {
    id: text("id").primaryKey(),
    content: text("content").notNull(),
    embedding: vector("embedding", { dimensions: 1536 }),
  },
  (table) => ({
    embeddingIndex: index("embedding_idx").using(
      "hnsw",
      table.embedding.op("vector_cosine_ops")
    ),
  })
);

// Similarity search
const similar = await db
  .select()
  .from(documents)
  .orderBy(sql`embedding <=> ${queryEmbedding}`)
  .limit(10);

Switch to Qdrant when:

You need sub-10ms p99 search at 10M+ vectors
Your multi-tenancy requirements are complex (isolating by user, organization)
You want native sparse+dense hybrid search

import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ url: "http://localhost:6333" });

// Upsert vectors
await client.upsert("documents", {
  wait: true,
  points: [
    {
      id: "doc_123",
      vector: embeddingArray,
      payload: { content: "...", userId: "user_456", createdAt: Date.now() },
    },
  ],
});

// Search with filter
const results = await client.search("documents", {
  vector: queryEmbedding,
  limit: 10,
  filter: {
    must: [{ key: "userId", match: { value: "user_456" } }],
  },
});

Pinecone is overpriced for most use cases. Unless you're an enterprise team with dedicated managed infrastructure requirements, Qdrant Cloud at $25/month serves the same use case at 1/3 the cost.

Section 5 — Streaming Architecture

Full streaming architecture for a production AI application:

User Browser
    ↕ EventSource / fetch streaming
Next.js Edge Function
    ↕ Anthropic SDK (streaming)
Claude API
    ↕ (parallel, if needed)
MCP Server / Tool calls
    ↕
Postgres / Qdrant / External APIs

For complex multi-step flows (RAG + generation), structure your route handler carefully:

// app/api/rag/route.ts
import { anthropic } from "@ai-sdk/anthropic";
import { streamText, tool } from "ai";
import { z } from "zod";
import { getEmbedding, searchDocuments } from "@/lib/vector";

export const runtime = "edge";

export async function POST(req: Request) {
  const { query, userId } = await req.json();

  const result = await streamText({
    model: anthropic("claude-sonnet-4-6"),
    system: `You are a helpful assistant with access to a knowledge base.
Use the search_knowledge_base tool to find relevant information before answering.`,
    messages: [{ role: "user", content: query }],
    tools: {
      search_knowledge_base: tool({
        description: "Search the knowledge base for relevant documents",
        parameters: z.object({
          query: z.string().describe("Search query"),
          limit: z.number().min(1).max(10).default(5),
        }),
        execute: async ({ query, limit }) => {
          const embedding = await getEmbedding(query);
          const docs = await searchDocuments(embedding, { userId, limit });
          return docs.map((d) => ({ id: d.id, content: d.content, score: d.score }));
        },
      }),
    },
    maxSteps: 3, // Allow up to 3 tool-use rounds
    onFinish: async ({ usage, finishReason }) => {
      // Log usage for billing/observability
      await logUsage({ userId, tokens: usage, finishReason });
    },
  });

  return result.toDataStreamResponse();
}

Section 6 — Observability

You cannot optimize what you cannot measure. AI applications have three observability layers:

Layer 1: LLM Call Tracking (Helicone)

Helicone is a proxy that sits between your application and the LLM API. Zero code changes for basic tracking:

// Before
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

// After (Helicone proxy)
const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  baseURL: "https://anthropic.helicone.ai",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
    "Helicone-User-Id": userId, // Track per-user costs
    "Helicone-Session-Id": sessionId, // Group related calls
  },
});

Helicone gives you: latency histograms, cost per request, error rates, prompt version comparison. It's $0 for the first 10K requests/month — essential for early-stage AI products.

Layer 2: Application Traces (OpenTelemetry)

For production, instrument your AI routes with OpenTelemetry:

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("ai-service");

export async function generateWithTracing(prompt: string, userId: string) {
  return tracer.startActiveSpan("ai.generate", async (span) => {
    span.setAttributes({
      "ai.model": "claude-sonnet-4-6",
      "user.id": userId,
      "ai.prompt_length": prompt.length,
    });

    try {
      const result = await generate(prompt);
      span.setAttributes({
        "ai.output_tokens": result.usage.outputTokens,
        "ai.input_tokens": result.usage.inputTokens,
      });
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
      throw error;
    } finally {
      span.end();
    }
  });
}

Layer 3: Output Quality (Evals)

The hardest and most important layer. For production AI apps, build a lightweight eval pipeline:

// scripts/eval.ts
const testCases = [
  { input: "What is the refund policy?", expectedTopics: ["30 days", "receipt"] },
  { input: "How do I reset my password?", expectedTopics: ["email", "reset link"] },
];

for (const testCase of testCases) {
  const output = await runRag(testCase.input);
  const score = await gradeOutput(output, testCase.expectedTopics);
  console.log({ input: testCase.input, score, output: output.slice(0, 100) });
}

Run evals on every model change, prompt change, and weekly in CI.

Section 7 — Complete Architecture Reference

Here's the full directory structure for a production TypeScript AI application:

my-ai-app/
├── app/
│   ├── api/
│   │   ├── chat/route.ts          # Streaming chat endpoint (Edge)
│   │   ├── rag/route.ts           # RAG endpoint (Edge)
│   │   └── ingest/route.ts        # Document ingestion (Node.js)
│   └── (ui pages)
├── lib/
│   ├── ai/
│   │   ├── client.ts              # Configured Anthropic client (Helicone proxy)
│   │   ├── structured.ts          # Zod + tool_use helpers
│   │   └── prompts.ts             # Versioned system prompts
│   ├── vector/
│   │   ├── embed.ts               # Embedding generation
│   │   ├── search.ts              # Vector similarity search
│   │   └── ingest.ts              # Document chunking + upsert
│   └── db/
│       ├── schema.ts              # Drizzle schema (includes pgvector)
│       └── queries.ts             # Typed database queries
├── scripts/
│   ├── eval.ts                    # Output quality evaluation
│   └── ingest-docs.ts             # Batch document ingestion
└── instrumentation.ts             # OpenTelemetry initialization

The Embedding Model Decision

Most teams default to OpenAI's text-embedding-3-small for embeddings. It's a solid choice. But consider: if you're using Anthropic's Claude for generation and you want to minimize vendor dependency, Voyage AI's voyage-3 embeddings have better retrieval quality and integrate cleanly into TypeScript. For most applications, the difference is marginal — but for search-heavy RAG apps, it's measurable.

Section 8 — Cost Management at Scale

AI applications have a fundamentally different cost structure from traditional software. Token costs scale with usage in ways that CPU costs don't.

Practical cost controls:

// 1. Cache embeddings — never re-embed the same content
const embeddingCache = new Map<string, number[]>();

async function getCachedEmbedding(text: string): Promise<number[]> {
  const key = createHash("sha256").update(text).digest("hex");
  if (embeddingCache.has(key)) return embeddingCache.get(key)!;
  const embedding = await generateEmbedding(text);
  embeddingCache.set(key, embedding);
  return embedding;
}

// 2. Cache LLM responses for identical prompts (semantic cache in production)
// 3. Set hard token limits per user per day
// 4. Use smaller models for classification tasks
const classificationModel = anthropic("claude-haiku-3-5"); // 10x cheaper
const generationModel = anthropic("claude-sonnet-4-6");    // full quality

// 5. Implement context window management
function trimMessagesToFit(messages: Message[], maxTokens = 80000): Message[] {
  // Keep system message + last N messages within budget
  let tokenCount = 0;
  const trimmed = [];
  for (const msg of [...messages].reverse()) {
    const estimated = msg.content.length / 4; // rough token estimate
    if (tokenCount + estimated > maxTokens) break;
    trimmed.unshift(msg);
    tokenCount += estimated;
  }
  return trimmed;
}

Section 9 — Takeaways

The TypeScript AI stack has matured enough in 2026 that there are clear right answers for most decisions:

Next.js 15 + Vercel AI SDK: Use it. The streaming and multi-provider abstractions are genuinely excellent.
Zod + tool_use: Non-negotiable for structured outputs in production.
pgvector first, Qdrant when you scale: Don't add new infrastructure before you need it.
Helicone from day one: Visibility into token costs and latency is essential, not optional.
Evals are infrastructure: Treat them that way.

The teams shipping the best AI products in 2026 aren't using exotic tools — they're using this stack and executing well. The framework choices are the easy part.

Stack versions: Next.js 15.2, Vercel AI SDK 4.x, Anthropic SDK 0.39, Drizzle ORM 0.30, Qdrant JS 1.9. Current as of March 2026.

— iBuidl Research Team