nn-deploy

Neural Network Compiler Stack & Deployment Platform

nn-deploy is a full-stack neural network compiler and deployment platform that runs entirely in the browser. Define models using a simple DSL, compile them through an optimization pipeline, generate executable code, and run inference — all client-side.

Compilation Pipeline

◆

Multi-Level IR

High-level graph IR with immutable transformations and full pass history for visualization.

⚡

7 Optimization Passes

Shape inference, constant folding, DCE, operator fusion, quantization, layout optimization, memory planning.

⚙

3 Code Gen Backends

Target JavaScript (reference), WebGPU WGSL compute shaders, or WASM dispatch from the same IR.

▶

In-Browser Inference

Run compiled models directly in the browser with auto-detection of the best available engine.

Quick Example

import { parseDSL, compileModel } from '@nn-deploy/compiler';
import { Tensor, InferenceSession } from '@nn-deploy/runtime';

// 1. Parse model DSL
const graph = parseDSL(`
  model MLP {
    input x: Tensor<float32>[1, 784]
    h = MatMul(x, w)
    out = Softmax(h)
    output out
  }
`);

// 2. Compile with optimizations
const { model } = compileModel(graph, { target: 'js' });

// 3. Run inference
const session = await InferenceSession.create(model);
const result = await session.run({ x: Tensor.rand([1, 784]) });
console.log(result.outputs);
session.dispose();

Getting Started

nn-deploy is a Turborepo monorepo with three packages and a Next.js web application.

Project Structure

nn-deploy/
  apps/
    web/ # Next.js frontend (Playground, Inference, Landing)
  packages/
    compiler/ # @nn-deploy/compiler - IR, passes, codegen
    runtime/ # @nn-deploy/runtime - Tensor, engines, session
    ui/ # @nn-deploy/ui - Shared components
  examples/ # JSON model definitions
  turbo.json # Build pipeline config
  vercel.json # Deployment config

Installation

git clone https://github.com/0xtkey256/nn-deploy.git
cd nn-deploy
npm install

Development

# Start dev server (compiles packages + starts Next.js)
npm run dev

# Build all packages
npm run build

Tip: Open the playground to try the compiler interactively without cloning.

DSL Guide

nn-deploy uses a custom domain-specific language for defining neural network models. The DSL compiles to an immutable graph IR that passes through the optimization pipeline.

Grammar

A model definition follows this structure:

model ModelName {
  // Declare inputs with tensor type
  input x: Tensor<float32>[1, 784]

  // Operations: target = Op(args, kwargs)
  h1 = MatMul(x, weights)
  h1b = Add(h1, bias)
  activated = ReLU(h1b)

  // Declare output
  output activated
}

Syntax Details

Comments: // line comments
Tensor types: Tensor<dtype>[dim1, dim2, ...]
Arrows: -> or → (Unicode) for data flow annotations
Keyword arguments: key=value (e.g., filters=16, kernel=3)
Array values: [1, 2, 3] in kwargs
Auto constants: Undefined references (e.g., w1, bias) are automatically created as Constant nodes with random weights

Data Types

Type	Bytes	Description
`float32`	4	32-bit floating point (default)
`float16`	2	16-bit floating point
`int32`	4	32-bit integer
`int8`	1	8-bit integer (quantized)
`uint8`	1	Unsigned 8-bit integer
`bool`	1	Boolean

Operations Reference

The compiler supports 38 operations across 10 categories:

I/O

Op	Inputs	Description
`Input`	0	Model input tensor
`Output`	1	Model output tensor
`Constant`	0	Constant tensor value (auto-created for weights)

Linear Algebra

Op	Inputs	Description
`MatMul`	2	Matrix multiplication
`Add`	2	Element-wise addition
`Sub`	2	Element-wise subtraction
`Mul`	2	Element-wise multiplication
`Div`	2	Element-wise division

Convolution

Op	Inputs	Description	Key Args
`Conv2D`	2-3	2D convolution	`filters, kernel, stride, padding`
`DepthwiseConv2D`	2-3	Depthwise separable convolution	`kernel, stride, padding`
`ConvTranspose2D`	2-3	Transposed 2D convolution	`filters, kernel, stride`

Normalization

Op	Inputs	Description
`BatchNorm`	1-5	Batch normalization
`LayerNorm`	1-3	Layer normalization
`GroupNorm`	1-3	Group normalization
`InstanceNorm`	1-3	Instance normalization

Activation

Op	Description	Formula
`ReLU`	Rectified linear unit	`max(0, x)`
`GELU`	Gaussian error linear unit	`x * Φ(x)`
`Sigmoid`	Sigmoid activation	`1 / (1 + e^(-x))`
`Tanh`	Hyperbolic tangent	`tanh(x)`
`Softmax`	Softmax normalization	`e^(xi) / Σe^(xj)`
`SiLU`	Sigmoid linear unit (Swish)	`x * sigmoid(x)`

Pooling

Op	Description	Key Args
`MaxPool2D`	Max pooling 2D	`kernel, stride`
`AvgPool2D`	Average pooling 2D	`kernel, stride`
`GlobalAvgPool`	Global average pooling	—
`AdaptiveAvgPool`	Adaptive average pooling	`output_size`

Shape

Op	Description
`Reshape`	Reshape tensor dimensions
`Transpose`	Transpose tensor axes
`Flatten`	Flatten to 2D
`Concat`	Concatenate tensors along axis
`Split`	Split tensor along axis
`Squeeze`	Remove size-1 dimensions
`Unsqueeze`	Insert size-1 dimension

Reduction & Attention

Op	Category	Description
`ReduceSum`	Reduce	Sum reduction along axes
`ReduceMean`	Reduce	Mean reduction along axes
`ReduceMax`	Reduce	Max reduction along axes
`Embedding`	Embedding	Embedding lookup
`ScaledDotProductAttention`	Embedding	Scaled dot-product attention (Q, K, V)

Fused Operations

Created automatically by the operator fusion pass:

Op	Fuses	Description
`FusedConvBNReLU`	Conv2D + BatchNorm + ReLU	Single fused convolution kernel
`FusedConvBN`	Conv2D + BatchNorm	Fused conv with batch norm
`FusedMatMulAdd`	MatMul + Add	Fused linear layer
`FusedLinearReLU`	MatMul + Add + ReLU	Fused linear + activation

DSL Examples

A simple multi-layer perceptron for MNIST digit classification (784 → 128 → 10):

model MNIST_MLP {
  input x: Tensor<float32>[1, 784]

  // Hidden layer
  h1 = MatMul(x, w1)
  h1b = Add(h1, b1)
  a1 = ReLU(h1b)

  // Output layer
  h2 = MatMul(a1, w2)
  h2b = Add(h2, b2)
  probs = Softmax(h2b)

  output probs
}

Fusion: The operator fusion pass will fuse MatMul + Add into FusedMatMulAdd, and the first MatMul + Add + ReLU chain into FusedLinearReLU.

A small CNN with two convolution blocks followed by a classifier:

model TinyCNN {
  input x: Tensor<float32>[1, 3, 32, 32]

  // Conv block 1
  c1 = Conv2D(x, w1, filters=16, kernel=3, stride=1, padding=same)
  bn1 = BatchNorm(c1)
  r1 = ReLU(bn1)
  p1 = MaxPool2D(r1, kernel=2, stride=2)

  // Conv block 2
  c2 = Conv2D(p1, w2, filters=32, kernel=3, stride=1, padding=same)
  bn2 = BatchNorm(c2)
  r2 = ReLU(bn2)
  p2 = MaxPool2D(r2, kernel=2, stride=2)

  // Classifier
  gap = GlobalAvgPool(p2)
  flat = Flatten(gap)
  logits = MatMul(flat, wfc)
  out = Softmax(logits)

  output out
}

Fusion: Each Conv2D → BatchNorm → ReLU chain fuses into a single FusedConvBNReLU node.

A residual block with skip connection:

model ResNetBlock {
  input x: Tensor<float32>[1, 64, 16, 16]

  // Main path
  c1 = Conv2D(x, w1, filters=64, kernel=3, stride=1, padding=same)
  bn1 = BatchNorm(c1)
  r1 = ReLU(bn1)

  c2 = Conv2D(r1, w2, filters=64, kernel=3, stride=1, padding=same)
  bn2 = BatchNorm(c2)

  // Residual connection
  res = Add(x, bn2)
  out = ReLU(res)

  output out
}

A single transformer layer with self-attention and feed-forward network:

model TransformerBlock {
  input tokens: Tensor<float32>[1, 32, 64]

  // Self-attention
  ln1 = LayerNorm(tokens)
  q = MatMul(ln1, wq)
  k = MatMul(ln1, wk)
  v = MatMul(ln1, wv)
  attn = ScaledDotProductAttention(q, k, v)
  proj = MatMul(attn, wo)
  res1 = Add(tokens, proj)

  // Feed-forward
  ln2 = LayerNorm(res1)
  ff1 = MatMul(ln2, w1)
  ff1b = Add(ff1, b1)
  act = GELU(ff1b)
  ff2 = MatMul(act, w2)
  ff2b = Add(ff2, b2)
  out = Add(res1, ff2b)

  output out
}

MobileNet-style depthwise separable convolution:

model DepthwiseSeparable {
  input x: Tensor<float32>[1, 32, 16, 16]

  // Depthwise conv
  dw = DepthwiseConv2D(x, dw_w, kernel=3, stride=1, padding=same)
  dw_bn = BatchNorm(dw)
  dw_relu = ReLU(dw_bn)

  // Pointwise conv (1x1)
  pw = Conv2D(dw_relu, pw_w, filters=64, kernel=1, stride=1)
  pw_bn = BatchNorm(pw)
  out = ReLU(pw_bn)

  output out
}

JSON Format

Alternatively, models can be defined in an ONNX-like JSON format:

{
  "name": "SimpleMLP",
  "nodes": [
    { "name": "x", "op": "Input", "inputs": [],
      "outputs": [{ "name": "x", "tensorType": { "dtype": "float32", "shape": [1, 784] } }] },
    { "name": "h1", "op": "MatMul", "inputs": ["x", "w1"],
      "outputs": [{ "name": "h1" }] },
    { "name": "out", "op": "Softmax", "inputs": ["h1"],
      "outputs": [{ "name": "out" }] }
  ]
}

Optimization Passes

The compiler includes 7 optimization passes that transform the graph to improve performance. Passes run sequentially, and the full history is preserved for visualization in the playground.

1 Shape Inference

Propagates tensor shapes through the graph in topological order. Each operation computes its output shape from its input shapes (e.g., MatMul computes [M,K] × [K,N] = [M,N]).

Annotates every edge with a tensorType (dtype + shape)
Handles Conv2D output: floor((H - K + 2P) / S) + 1
Enables all downstream passes that need shape information

2 Constant Folding

Evaluates subgraphs where all inputs are constants at compile time. Iterates until a fixed point (no more folding possible).

Replaces computed constant chains with single Constant nodes
Reduces runtime computation by moving work to compile time

3 Dead Code Elimination

Removes nodes that don't contribute to any model output. Performs backward BFS from Output nodes to find reachable nodes.

Eliminates unreachable nodes and their edges
Cleans up artifacts from other passes

4 Operator Fusion

Detects and fuses common operation patterns into optimized single-kernel nodes. Requires each intermediate node to have exactly one consumer.

Four fusion patterns:

Conv2D + BatchNorm + ReLU → FusedConvBNReLU
Conv2D + BatchNorm → FusedConvBN
MatMul + Add + ReLU → FusedLinearReLU
MatMul + Add → FusedMatMulAdd

5 Quantization

Converts eligible operations from float32 to int8 symmetric quantization for faster inference and smaller model size.

Targets: MatMul, Conv2D, DepthwiseConv2D, Add, and all fused variants
Annotates nodes with _quantized, _quantScheme, _quantBits
Updates edge tensor types along quantized paths

6 Layout Optimization

Converts tensor layouts from NCHW to NHWC for GPU-friendly memory access patterns.

Targets spatial operations: Conv2D, DepthwiseConv2D, pooling, BatchNorm
Transposes shapes: [N,C,H,W] → [N,H,W,C]

7 Memory Planning

Performs liveness analysis and allocates memory offsets using a greedy first-fit decreasing algorithm to minimize peak memory usage.

Computes tensor lifetimes (first use → last use)
Allocates non-overlapping memory blocks
Annotates nodes with _memOffset, _memSize, _peakMemory

Pass Pipeline API

import { runPipeline, ALL_PASSES } from '@nn-deploy/compiler';

// Run all 7 passes
const result = runPipeline(graph, ALL_PASSES);
// result.graph: optimized graph
// result.history: array of { passName, graph } for each step

// Or run individual passes
import { shapeInferencePass, operatorFusionPass } from '@nn-deploy/compiler';
const result = runPipeline(graph, [shapeInferencePass, operatorFusionPass]);

Code Generation

After optimization, the compiler generates executable code for one of three backends:

JS

JavaScript

Reference implementation. Always works in any browser.

GPU

WebGPU (WGSL)

Compute shaders for GPU acceleration.

WA

WASM

Structured op dispatch for near-native speed.

JavaScript Backend

Generates a self-contained JS module with a Tensor class and per-operation kernel functions:

// Example generated kernel for MatMul
function kernel_h1(inputs, output) {
  const A = inputs[0], B = inputs[1];
  const M = A.shape[A.shape.length - 2];
  const K = A.shape[A.shape.length - 1];
  const N = B.shape[B.shape.length - 1];
  for (let m = 0; m < M; m++) {
    for (let n = 0; n < N; n++) {
      let sum = 0;
      for (let k = 0; k < K; k++) {
        sum += A.data[m * K + k] * B.data[k * N + n];
      }
      output.data[m * N + n] = sum;
    }
  }
}

WebGPU WGSL Backend

Generates compute shaders for GPU execution:

@group(0) @binding(0) var<storage, read> A: array<f32>;
@group(0) @binding(1) var<storage, read> B: array<f32>;
@group(0) @binding(2) var<storage, read_write> C: array<f32>;

struct Params { M: u32, N: u32, K: u32 }
@group(0) @binding(3) var<uniform> params: Params;

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  let row = gid.x;
  let col = gid.y;
  if (row >= params.M || col >= params.N) { return; }
  var sum: f32 = 0.0;
  for (var k: u32 = 0u; k < params.K; k = k + 1u) {
    sum = sum + A[row * params.K + k] * B[k * params.N + col];
  }
  C[row * params.N + col] = sum;
}

Compile API

import { compileModel } from '@nn-deploy/compiler';

const result = compileModel(graph, {
  target: 'js',              // 'js' | 'webgpu' | 'wasm'
  passes: ALL_PASSES,        // which passes to run (default: all)
  enableQuantization: false,  // enable INT8 quantization
});

// result.model: CompiledModel (kernels + memory plan)
// result.code: GeneratedCode (source string + kernel list)
// result.history: pass-by-pass graph snapshots
// result.metrics: { before, after } GraphMetrics

Runtime API

The @nn-deploy/runtime package provides the execution engine for running compiled models in the browser.

Tensor

The Tensor class manages typed array data with shape and stride information:

import { Tensor } from '@nn-deploy/runtime';

// Create tensors
const zeros = Tensor.zeros([2, 3]);          // 2x3 zero tensor
const ones  = Tensor.ones([4, 4]);           // 4x4 ones tensor
const rand  = Tensor.rand([1, 784]);         // random uniform [0, 1)
const randn = Tensor.randn([1, 128]);        // random normal (0, 1)

// From data
const t = new Tensor(
  new Float32Array([1, 2, 3, 4, 5, 6]),
  [2, 3]  // shape
);

// Properties
t.shape;      // [2, 3]
t.numel;      // 6
t.ndim;       // 2
t.byteSize;   // 24
t.strides;    // [3, 1]

// Methods
t.reshape([3, 2]);  // new tensor with different shape
t.clone();          // deep copy
t.toArray();        // Float32Array -> number[]
t.toString();       // "Tensor<float32>[2,3]"

InferenceSession

The main API for running compiled models. Automatically selects the best available engine.

import { InferenceSession, Tensor } from '@nn-deploy/runtime';
import { parseDSL, compileModel } from '@nn-deploy/compiler';

// Compile
const graph = parseDSL(dslSource);
const { model } = compileModel(graph, { target: 'js' });

// Create session (auto-selects best engine)
const session = await InferenceSession.create(model);

// Run inference
const result = await session.run({
  x: Tensor.rand([1, 784])
});

// Result
result.outputs;    // Record<string, Tensor>
result.latencyMs;  // execution time in ms
result.backend;    // 'js' | 'webgpu'

// Metadata
session.getMetadata();
// { name, target, nodeCount, edgeCount }

// Cleanup
session.dispose();

Engine Selection

The runtime supports two execution engines:

Engine	Requirement	Speed	Compatibility
JSEngine	None	Baseline	All browsers
WebGPUEngine	WebGPU API	GPU-accelerated	Chrome 113+, Edge 113+

When target: 'webgpu' is specified but WebGPU is unavailable, the session automatically falls back to the JS engine.

Examples

nn-deploy ships with 5 pre-built model examples covering different architectures. All are available in the playground.

Model	Category	Input Shape	Architecture
MNIST MLP	Classic	`[1, 784]`	2 hidden layers (MatMul + Add + ReLU) with Softmax output
Tiny CNN	CNN	`[1, 3, 32, 32]`	2 conv blocks (Conv2D + BN + ReLU + Pool) + FC classifier
ResNet Block	CNN	`[1, 64, 16, 16]`	2 conv layers with residual skip connection
Transformer Block	Transformer	`[1, 32, 64]`	Self-attention (Q/K/V) + FFN (GELU) with residual connections
DepthSep Conv	Efficient	`[1, 32, 16, 16]`	MobileNet-style depthwise + pointwise (1x1) conv

Optimization Effects

Here's what the optimization passes do to each model:

MNIST MLP: MatMul + Add fuses to FusedMatMulAdd, first chain fuses to FusedLinearReLU (9 nodes → ~7 nodes)
Tiny CNN: Both Conv2D + BN + ReLU chains fuse to FusedConvBNReLU (16 nodes → ~12 nodes)
ResNet Block: First conv chain fuses to FusedConvBNReLU, second to FusedConvBN
Transformer: MatMul + Add chains fuse in the FFN, attention weights get quantized
DepthSep Conv: Both DW and PW blocks undergo BN fusion and layout optimization to NHWC

Try it: Open the playground, select any example, and click "Compile & Optimize" to see the full pass-by-pass transformation timeline.

Architecture

nn-deploy follows a classic compiler architecture: frontend (parser) → IR → optimization passes → backend (codegen) → runtime.

Immutable Graph IR

The core data structure is an immutable graph. Every transformation returns a new Graph object, preserving the full history for visualization and debugging:

interface Graph {
  name: string;
  nodes: Node[];
  edges: Edge[];
  passHistory: PassRecord[];  // full transformation log
}

interface Node {
  id: string;
  op: OpType;           // one of 38 operation types
  name: string;
  inputs: Port[];
  outputs: Port[];
  attributes: Record<string, any>;
}

interface Edge {
  id: string;
  sourceNodeId: string;
  sourcePort: number;
  targetNodeId: string;
  targetPort: number;
  tensorType?: TensorType;  // annotated by shape inference
}

Data Flow

Parse: parseDSL(source) → tokenize → parse → build Graph
Optimize: runPipeline(graph, passes) → each pass returns a new Graph
Compile: compileModel(graph, options) → runs passes + codegen → CompiledModel
Execute: InferenceSession.create(model) → select engine → session.run(inputs)

Key Design Decisions

Immutable transformations: Each pass returns a new graph. No mutations, no side effects. Enables pass history visualization and easy debugging.
Auto-constant creation: Undefined weight references in DSL are auto-created as Constant nodes, simplifying model definitions.
Multi-target codegen: Same optimized IR compiles to JS, WebGPU, or WASM. Backend choice is deferred to compile time.
Sandboxed execution: JS engine uses new Function() for sandboxed code execution in the browser.
Greedy memory planning: First-fit decreasing algorithm minimizes peak memory without complex ILP solvers.

Tech Stack

Layer	Technology
Language	TypeScript (strict mode)
Build	Turborepo + npm workspaces
Frontend	Next.js 15, React 19
State	Zustand
Visualization	D3.js + ELK.js (graph layout)
GPU	WebGPU (WGSL compute shaders)
Styling	Tailwind CSS v4
Deployment	Vercel