nn-deploy

Neural Network Compiler Stack & Deployment Platform

nn-deploy is a full-stack neural network compiler and deployment platform that runs entirely in the browser. Define models using a simple DSL, compile them through an optimization pipeline, generate executable code, and run inference — all client-side.

Compilation Pipeline

Multi-Level IR

High-level graph IR with immutable transformations and full pass history for visualization.

7 Optimization Passes

Shape inference, constant folding, DCE, operator fusion, quantization, layout optimization, memory planning.

3 Code Gen Backends

Target JavaScript (reference), WebGPU WGSL compute shaders, or WASM dispatch from the same IR.

In-Browser Inference

Run compiled models directly in the browser with auto-detection of the best available engine.

Quick Example

import { parseDSL, compileModel } from '@nn-deploy/compiler';
import { Tensor, InferenceSession } from '@nn-deploy/runtime';

// 1. Parse model DSL
const graph = parseDSL(`
  model MLP {
    input x: Tensor<float32>[1, 784]
    h = MatMul(x, w)
    out = Softmax(h)
    output out
  }
`);

// 2. Compile with optimizations
const { model } = compileModel(graph, { target: 'js' });

// 3. Run inference
const session = await InferenceSession.create(model);
const result = await session.run({ x: Tensor.rand([1, 784]) });
console.log(result.outputs);
session.dispose();

Getting Started

nn-deploy is a Turborepo monorepo with three packages and a Next.js web application.

Project Structure

nn-deploy/
  apps/
    web/ # Next.js frontend (Playground, Inference, Landing)
  packages/
    compiler/ # @nn-deploy/compiler - IR, passes, codegen
    runtime/ # @nn-deploy/runtime - Tensor, engines, session
    ui/ # @nn-deploy/ui - Shared components
  examples/ # JSON model definitions
  turbo.json # Build pipeline config
  vercel.json # Deployment config

Installation

git clone https://github.com/0xtkey256/nn-deploy.git
cd nn-deploy
npm install

Development

# Start dev server (compiles packages + starts Next.js)
npm run dev

# Build all packages
npm run build

Tip: Open the playground to try the compiler interactively without cloning.

DSL Guide

nn-deploy uses a custom domain-specific language for defining neural network models. The DSL compiles to an immutable graph IR that passes through the optimization pipeline.

Grammar

A model definition follows this structure:

model ModelName {
  // Declare inputs with tensor type
  input x: Tensor<float32>[1, 784]

  // Operations: target = Op(args, kwargs)
  h1 = MatMul(x, weights)
  h1b = Add(h1, bias)
  activated = ReLU(h1b)

  // Declare output
  output activated
}

Syntax Details

  • Comments: // line comments
  • Tensor types: Tensor<dtype>[dim1, dim2, ...]
  • Arrows: -> or (Unicode) for data flow annotations
  • Keyword arguments: key=value (e.g., filters=16, kernel=3)
  • Array values: [1, 2, 3] in kwargs
  • Auto constants: Undefined references (e.g., w1, bias) are automatically created as Constant nodes with random weights

Data Types

TypeBytesDescription
float32432-bit floating point (default)
float16216-bit floating point
int32432-bit integer
int818-bit integer (quantized)
uint81Unsigned 8-bit integer
bool1Boolean

Operations Reference

The compiler supports 38 operations across 10 categories:

I/O

OpInputsDescription
Input0Model input tensor
Output1Model output tensor
Constant0Constant tensor value (auto-created for weights)

Linear Algebra

OpInputsDescription
MatMul2Matrix multiplication
Add2Element-wise addition
Sub2Element-wise subtraction
Mul2Element-wise multiplication
Div2Element-wise division

Convolution

OpInputsDescriptionKey Args
Conv2D2-32D convolutionfilters, kernel, stride, padding
DepthwiseConv2D2-3Depthwise separable convolutionkernel, stride, padding
ConvTranspose2D2-3Transposed 2D convolutionfilters, kernel, stride

Normalization

OpInputsDescription
BatchNorm1-5Batch normalization
LayerNorm1-3Layer normalization
GroupNorm1-3Group normalization
InstanceNorm1-3Instance normalization

Activation

OpDescriptionFormula
ReLURectified linear unitmax(0, x)
GELUGaussian error linear unitx * Φ(x)
SigmoidSigmoid activation1 / (1 + e^(-x))
TanhHyperbolic tangenttanh(x)
SoftmaxSoftmax normalizatione^(xi) / Σe^(xj)
SiLUSigmoid linear unit (Swish)x * sigmoid(x)

Pooling

OpDescriptionKey Args
MaxPool2DMax pooling 2Dkernel, stride
AvgPool2DAverage pooling 2Dkernel, stride
GlobalAvgPoolGlobal average pooling
AdaptiveAvgPoolAdaptive average poolingoutput_size

Shape

OpDescription
ReshapeReshape tensor dimensions
TransposeTranspose tensor axes
FlattenFlatten to 2D
ConcatConcatenate tensors along axis
SplitSplit tensor along axis
SqueezeRemove size-1 dimensions
UnsqueezeInsert size-1 dimension

Reduction & Attention

OpCategoryDescription
ReduceSumReduceSum reduction along axes
ReduceMeanReduceMean reduction along axes
ReduceMaxReduceMax reduction along axes
EmbeddingEmbeddingEmbedding lookup
ScaledDotProductAttentionEmbeddingScaled dot-product attention (Q, K, V)

Fused Operations

Created automatically by the operator fusion pass:

OpFusesDescription
FusedConvBNReLUConv2D + BatchNorm + ReLUSingle fused convolution kernel
FusedConvBNConv2D + BatchNormFused conv with batch norm
FusedMatMulAddMatMul + AddFused linear layer
FusedLinearReLUMatMul + Add + ReLUFused linear + activation

DSL Examples

A simple multi-layer perceptron for MNIST digit classification (784 → 128 → 10):

model MNIST_MLP {
  input x: Tensor<float32>[1, 784]

  // Hidden layer
  h1 = MatMul(x, w1)
  h1b = Add(h1, b1)
  a1 = ReLU(h1b)

  // Output layer
  h2 = MatMul(a1, w2)
  h2b = Add(h2, b2)
  probs = Softmax(h2b)

  output probs
}

Fusion: The operator fusion pass will fuse MatMul + Add into FusedMatMulAdd, and the first MatMul + Add + ReLU chain into FusedLinearReLU.

A small CNN with two convolution blocks followed by a classifier:

model TinyCNN {
  input x: Tensor<float32>[1, 3, 32, 32]

  // Conv block 1
  c1 = Conv2D(x, w1, filters=16, kernel=3, stride=1, padding=same)
  bn1 = BatchNorm(c1)
  r1 = ReLU(bn1)
  p1 = MaxPool2D(r1, kernel=2, stride=2)

  // Conv block 2
  c2 = Conv2D(p1, w2, filters=32, kernel=3, stride=1, padding=same)
  bn2 = BatchNorm(c2)
  r2 = ReLU(bn2)
  p2 = MaxPool2D(r2, kernel=2, stride=2)

  // Classifier
  gap = GlobalAvgPool(p2)
  flat = Flatten(gap)
  logits = MatMul(flat, wfc)
  out = Softmax(logits)

  output out
}

Fusion: Each Conv2D → BatchNorm → ReLU chain fuses into a single FusedConvBNReLU node.

A residual block with skip connection:

model ResNetBlock {
  input x: Tensor<float32>[1, 64, 16, 16]

  // Main path
  c1 = Conv2D(x, w1, filters=64, kernel=3, stride=1, padding=same)
  bn1 = BatchNorm(c1)
  r1 = ReLU(bn1)

  c2 = Conv2D(r1, w2, filters=64, kernel=3, stride=1, padding=same)
  bn2 = BatchNorm(c2)

  // Residual connection
  res = Add(x, bn2)
  out = ReLU(res)

  output out
}

A single transformer layer with self-attention and feed-forward network:

model TransformerBlock {
  input tokens: Tensor<float32>[1, 32, 64]

  // Self-attention
  ln1 = LayerNorm(tokens)
  q = MatMul(ln1, wq)
  k = MatMul(ln1, wk)
  v = MatMul(ln1, wv)
  attn = ScaledDotProductAttention(q, k, v)
  proj = MatMul(attn, wo)
  res1 = Add(tokens, proj)

  // Feed-forward
  ln2 = LayerNorm(res1)
  ff1 = MatMul(ln2, w1)
  ff1b = Add(ff1, b1)
  act = GELU(ff1b)
  ff2 = MatMul(act, w2)
  ff2b = Add(ff2, b2)
  out = Add(res1, ff2b)

  output out
}

MobileNet-style depthwise separable convolution:

model DepthwiseSeparable {
  input x: Tensor<float32>[1, 32, 16, 16]

  // Depthwise conv
  dw = DepthwiseConv2D(x, dw_w, kernel=3, stride=1, padding=same)
  dw_bn = BatchNorm(dw)
  dw_relu = ReLU(dw_bn)

  // Pointwise conv (1x1)
  pw = Conv2D(dw_relu, pw_w, filters=64, kernel=1, stride=1)
  pw_bn = BatchNorm(pw)
  out = ReLU(pw_bn)

  output out
}

JSON Format

Alternatively, models can be defined in an ONNX-like JSON format:

{
  "name": "SimpleMLP",
  "nodes": [
    { "name": "x", "op": "Input", "inputs": [],
      "outputs": [{ "name": "x", "tensorType": { "dtype": "float32", "shape": [1, 784] } }] },
    { "name": "h1", "op": "MatMul", "inputs": ["x", "w1"],
      "outputs": [{ "name": "h1" }] },
    { "name": "out", "op": "Softmax", "inputs": ["h1"],
      "outputs": [{ "name": "out" }] }
  ]
}

Optimization Passes

The compiler includes 7 optimization passes that transform the graph to improve performance. Passes run sequentially, and the full history is preserved for visualization in the playground.

1 Shape Inference

Propagates tensor shapes through the graph in topological order. Each operation computes its output shape from its input shapes (e.g., MatMul computes [M,K] × [K,N] = [M,N]).

  • Annotates every edge with a tensorType (dtype + shape)
  • Handles Conv2D output: floor((H - K + 2P) / S) + 1
  • Enables all downstream passes that need shape information
2 Constant Folding

Evaluates subgraphs where all inputs are constants at compile time. Iterates until a fixed point (no more folding possible).

  • Replaces computed constant chains with single Constant nodes
  • Reduces runtime computation by moving work to compile time
3 Dead Code Elimination

Removes nodes that don't contribute to any model output. Performs backward BFS from Output nodes to find reachable nodes.

  • Eliminates unreachable nodes and their edges
  • Cleans up artifacts from other passes
4 Operator Fusion

Detects and fuses common operation patterns into optimized single-kernel nodes. Requires each intermediate node to have exactly one consumer.

Four fusion patterns:

  • Conv2D + BatchNorm + ReLUFusedConvBNReLU
  • Conv2D + BatchNormFusedConvBN
  • MatMul + Add + ReLUFusedLinearReLU
  • MatMul + AddFusedMatMulAdd
5 Quantization

Converts eligible operations from float32 to int8 symmetric quantization for faster inference and smaller model size.

  • Targets: MatMul, Conv2D, DepthwiseConv2D, Add, and all fused variants
  • Annotates nodes with _quantized, _quantScheme, _quantBits
  • Updates edge tensor types along quantized paths
6 Layout Optimization

Converts tensor layouts from NCHW to NHWC for GPU-friendly memory access patterns.

  • Targets spatial operations: Conv2D, DepthwiseConv2D, pooling, BatchNorm
  • Transposes shapes: [N,C,H,W][N,H,W,C]
7 Memory Planning

Performs liveness analysis and allocates memory offsets using a greedy first-fit decreasing algorithm to minimize peak memory usage.

  • Computes tensor lifetimes (first use → last use)
  • Allocates non-overlapping memory blocks
  • Annotates nodes with _memOffset, _memSize, _peakMemory

Pass Pipeline API

import { runPipeline, ALL_PASSES } from '@nn-deploy/compiler';

// Run all 7 passes
const result = runPipeline(graph, ALL_PASSES);
// result.graph: optimized graph
// result.history: array of { passName, graph } for each step

// Or run individual passes
import { shapeInferencePass, operatorFusionPass } from '@nn-deploy/compiler';
const result = runPipeline(graph, [shapeInferencePass, operatorFusionPass]);

Code Generation

After optimization, the compiler generates executable code for one of three backends:

JS

JavaScript

Reference implementation. Always works in any browser.

GPU

WebGPU (WGSL)

Compute shaders for GPU acceleration.

WA

WASM

Structured op dispatch for near-native speed.

JavaScript Backend

Generates a self-contained JS module with a Tensor class and per-operation kernel functions:

// Example generated kernel for MatMul
function kernel_h1(inputs, output) {
  const A = inputs[0], B = inputs[1];
  const M = A.shape[A.shape.length - 2];
  const K = A.shape[A.shape.length - 1];
  const N = B.shape[B.shape.length - 1];
  for (let m = 0; m < M; m++) {
    for (let n = 0; n < N; n++) {
      let sum = 0;
      for (let k = 0; k < K; k++) {
        sum += A.data[m * K + k] * B.data[k * N + n];
      }
      output.data[m * N + n] = sum;
    }
  }
}

WebGPU WGSL Backend

Generates compute shaders for GPU execution:

@group(0) @binding(0) var<storage, read> A: array<f32>;
@group(0) @binding(1) var<storage, read> B: array<f32>;
@group(0) @binding(2) var<storage, read_write> C: array<f32>;

struct Params { M: u32, N: u32, K: u32 }
@group(0) @binding(3) var<uniform> params: Params;

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  let row = gid.x;
  let col = gid.y;
  if (row >= params.M || col >= params.N) { return; }
  var sum: f32 = 0.0;
  for (var k: u32 = 0u; k < params.K; k = k + 1u) {
    sum = sum + A[row * params.K + k] * B[k * params.N + col];
  }
  C[row * params.N + col] = sum;
}

Compile API

import { compileModel } from '@nn-deploy/compiler';

const result = compileModel(graph, {
  target: 'js',              // 'js' | 'webgpu' | 'wasm'
  passes: ALL_PASSES,        // which passes to run (default: all)
  enableQuantization: false,  // enable INT8 quantization
});

// result.model: CompiledModel (kernels + memory plan)
// result.code: GeneratedCode (source string + kernel list)
// result.history: pass-by-pass graph snapshots
// result.metrics: { before, after } GraphMetrics

Runtime API

The @nn-deploy/runtime package provides the execution engine for running compiled models in the browser.

Tensor

The Tensor class manages typed array data with shape and stride information:

import { Tensor } from '@nn-deploy/runtime';

// Create tensors
const zeros = Tensor.zeros([2, 3]);          // 2x3 zero tensor
const ones  = Tensor.ones([4, 4]);           // 4x4 ones tensor
const rand  = Tensor.rand([1, 784]);         // random uniform [0, 1)
const randn = Tensor.randn([1, 128]);        // random normal (0, 1)

// From data
const t = new Tensor(
  new Float32Array([1, 2, 3, 4, 5, 6]),
  [2, 3]  // shape
);

// Properties
t.shape;      // [2, 3]
t.numel;      // 6
t.ndim;       // 2
t.byteSize;   // 24
t.strides;    // [3, 1]

// Methods
t.reshape([3, 2]);  // new tensor with different shape
t.clone();          // deep copy
t.toArray();        // Float32Array -> number[]
t.toString();       // "Tensor<float32>[2,3]"

InferenceSession

The main API for running compiled models. Automatically selects the best available engine.

import { InferenceSession, Tensor } from '@nn-deploy/runtime';
import { parseDSL, compileModel } from '@nn-deploy/compiler';

// Compile
const graph = parseDSL(dslSource);
const { model } = compileModel(graph, { target: 'js' });

// Create session (auto-selects best engine)
const session = await InferenceSession.create(model);

// Run inference
const result = await session.run({
  x: Tensor.rand([1, 784])
});

// Result
result.outputs;    // Record<string, Tensor>
result.latencyMs;  // execution time in ms
result.backend;    // 'js' | 'webgpu'

// Metadata
session.getMetadata();
// { name, target, nodeCount, edgeCount }

// Cleanup
session.dispose();

Engine Selection

The runtime supports two execution engines:

EngineRequirementSpeedCompatibility
JSEngineNoneBaselineAll browsers
WebGPUEngineWebGPU APIGPU-acceleratedChrome 113+, Edge 113+

When target: 'webgpu' is specified but WebGPU is unavailable, the session automatically falls back to the JS engine.

Examples

nn-deploy ships with 5 pre-built model examples covering different architectures. All are available in the playground.

ModelCategoryInput ShapeArchitecture
MNIST MLP Classic [1, 784] 2 hidden layers (MatMul + Add + ReLU) with Softmax output
Tiny CNN CNN [1, 3, 32, 32] 2 conv blocks (Conv2D + BN + ReLU + Pool) + FC classifier
ResNet Block CNN [1, 64, 16, 16] 2 conv layers with residual skip connection
Transformer Block Transformer [1, 32, 64] Self-attention (Q/K/V) + FFN (GELU) with residual connections
DepthSep Conv Efficient [1, 32, 16, 16] MobileNet-style depthwise + pointwise (1x1) conv

Optimization Effects

Here's what the optimization passes do to each model:

  • MNIST MLP: MatMul + Add fuses to FusedMatMulAdd, first chain fuses to FusedLinearReLU (9 nodes → ~7 nodes)
  • Tiny CNN: Both Conv2D + BN + ReLU chains fuse to FusedConvBNReLU (16 nodes → ~12 nodes)
  • ResNet Block: First conv chain fuses to FusedConvBNReLU, second to FusedConvBN
  • Transformer: MatMul + Add chains fuse in the FFN, attention weights get quantized
  • DepthSep Conv: Both DW and PW blocks undergo BN fusion and layout optimization to NHWC

Try it: Open the playground, select any example, and click "Compile & Optimize" to see the full pass-by-pass transformation timeline.

Architecture

nn-deploy follows a classic compiler architecture: frontend (parser) → IR → optimization passes → backend (codegen) → runtime.

Immutable Graph IR

The core data structure is an immutable graph. Every transformation returns a new Graph object, preserving the full history for visualization and debugging:

interface Graph {
  name: string;
  nodes: Node[];
  edges: Edge[];
  passHistory: PassRecord[];  // full transformation log
}

interface Node {
  id: string;
  op: OpType;           // one of 38 operation types
  name: string;
  inputs: Port[];
  outputs: Port[];
  attributes: Record<string, any>;
}

interface Edge {
  id: string;
  sourceNodeId: string;
  sourcePort: number;
  targetNodeId: string;
  targetPort: number;
  tensorType?: TensorType;  // annotated by shape inference
}

Data Flow

  1. Parse: parseDSL(source) → tokenize → parse → build Graph
  2. Optimize: runPipeline(graph, passes) → each pass returns a new Graph
  3. Compile: compileModel(graph, options) → runs passes + codegen → CompiledModel
  4. Execute: InferenceSession.create(model) → select engine → session.run(inputs)

Key Design Decisions

  • Immutable transformations: Each pass returns a new graph. No mutations, no side effects. Enables pass history visualization and easy debugging.
  • Auto-constant creation: Undefined weight references in DSL are auto-created as Constant nodes, simplifying model definitions.
  • Multi-target codegen: Same optimized IR compiles to JS, WebGPU, or WASM. Backend choice is deferred to compile time.
  • Sandboxed execution: JS engine uses new Function() for sandboxed code execution in the browser.
  • Greedy memory planning: First-fit decreasing algorithm minimizes peak memory without complex ILP solvers.

Tech Stack

LayerTechnology
LanguageTypeScript (strict mode)
BuildTurborepo + npm workspaces
FrontendNext.js 15, React 19
StateZustand
VisualizationD3.js + ELK.js (graph layout)
GPUWebGPU (WGSL compute shaders)
StylingTailwind CSS v4
DeploymentVercel