nn-deploy is a full-stack neural network compiler and deployment platform that runs entirely in the browser. Define models using a simple DSL, compile them through an optimization pipeline, generate executable code, and run inference — all client-side.
Compilation Pipeline
Multi-Level IR
High-level graph IR with immutable transformations and full pass history for visualization.
7 Optimization Passes
Shape inference, constant folding, DCE, operator fusion, quantization, layout optimization, memory planning.
3 Code Gen Backends
Target JavaScript (reference), WebGPU WGSL compute shaders, or WASM dispatch from the same IR.
In-Browser Inference
Run compiled models directly in the browser with auto-detection of the best available engine.
Quick Example
import { parseDSL, compileModel } from '@nn-deploy/compiler';
import { Tensor, InferenceSession } from '@nn-deploy/runtime';
// 1. Parse model DSL
const graph = parseDSL(`
model MLP {
input x: Tensor<float32>[1, 784]
h = MatMul(x, w)
out = Softmax(h)
output out
}
`);
// 2. Compile with optimizations
const { model } = compileModel(graph, { target: 'js' });
// 3. Run inference
const session = await InferenceSession.create(model);
const result = await session.run({ x: Tensor.rand([1, 784]) });
console.log(result.outputs);
session.dispose();
Getting Started
nn-deploy is a Turborepo monorepo with three packages and a Next.js web application.
Project Structure
apps/
web/ # Next.js frontend (Playground, Inference, Landing)
packages/
compiler/ # @nn-deploy/compiler - IR, passes, codegen
runtime/ # @nn-deploy/runtime - Tensor, engines, session
ui/ # @nn-deploy/ui - Shared components
examples/ # JSON model definitions
turbo.json # Build pipeline config
vercel.json # Deployment config
Installation
git clone https://github.com/0xtkey256/nn-deploy.git
cd nn-deploy
npm install
Development
# Start dev server (compiles packages + starts Next.js)
npm run dev
# Build all packages
npm run build
Tip: Open the playground to try the compiler interactively without cloning.
DSL Guide
nn-deploy uses a custom domain-specific language for defining neural network models. The DSL compiles to an immutable graph IR that passes through the optimization pipeline.
Grammar
A model definition follows this structure:
model ModelName {
// Declare inputs with tensor type
input x: Tensor<float32>[1, 784]
// Operations: target = Op(args, kwargs)
h1 = MatMul(x, weights)
h1b = Add(h1, bias)
activated = ReLU(h1b)
// Declare output
output activated
}
Syntax Details
- Comments:
// line comments - Tensor types:
Tensor<dtype>[dim1, dim2, ...] - Arrows:
->or→(Unicode) for data flow annotations - Keyword arguments:
key=value(e.g.,filters=16,kernel=3) - Array values:
[1, 2, 3]in kwargs - Auto constants: Undefined references (e.g.,
w1,bias) are automatically created as Constant nodes with random weights
Data Types
| Type | Bytes | Description |
|---|---|---|
float32 | 4 | 32-bit floating point (default) |
float16 | 2 | 16-bit floating point |
int32 | 4 | 32-bit integer |
int8 | 1 | 8-bit integer (quantized) |
uint8 | 1 | Unsigned 8-bit integer |
bool | 1 | Boolean |
Operations Reference
The compiler supports 38 operations across 10 categories:
I/O
| Op | Inputs | Description |
|---|---|---|
Input | 0 | Model input tensor |
Output | 1 | Model output tensor |
Constant | 0 | Constant tensor value (auto-created for weights) |
Linear Algebra
| Op | Inputs | Description |
|---|---|---|
MatMul | 2 | Matrix multiplication |
Add | 2 | Element-wise addition |
Sub | 2 | Element-wise subtraction |
Mul | 2 | Element-wise multiplication |
Div | 2 | Element-wise division |
Convolution
| Op | Inputs | Description | Key Args |
|---|---|---|---|
Conv2D | 2-3 | 2D convolution | filters, kernel, stride, padding |
DepthwiseConv2D | 2-3 | Depthwise separable convolution | kernel, stride, padding |
ConvTranspose2D | 2-3 | Transposed 2D convolution | filters, kernel, stride |
Normalization
| Op | Inputs | Description |
|---|---|---|
BatchNorm | 1-5 | Batch normalization |
LayerNorm | 1-3 | Layer normalization |
GroupNorm | 1-3 | Group normalization |
InstanceNorm | 1-3 | Instance normalization |
Activation
| Op | Description | Formula |
|---|---|---|
ReLU | Rectified linear unit | max(0, x) |
GELU | Gaussian error linear unit | x * Φ(x) |
Sigmoid | Sigmoid activation | 1 / (1 + e^(-x)) |
Tanh | Hyperbolic tangent | tanh(x) |
Softmax | Softmax normalization | e^(xi) / Σe^(xj) |
SiLU | Sigmoid linear unit (Swish) | x * sigmoid(x) |
Pooling
| Op | Description | Key Args |
|---|---|---|
MaxPool2D | Max pooling 2D | kernel, stride |
AvgPool2D | Average pooling 2D | kernel, stride |
GlobalAvgPool | Global average pooling | — |
AdaptiveAvgPool | Adaptive average pooling | output_size |
Shape
| Op | Description |
|---|---|
Reshape | Reshape tensor dimensions |
Transpose | Transpose tensor axes |
Flatten | Flatten to 2D |
Concat | Concatenate tensors along axis |
Split | Split tensor along axis |
Squeeze | Remove size-1 dimensions |
Unsqueeze | Insert size-1 dimension |
Reduction & Attention
| Op | Category | Description |
|---|---|---|
ReduceSum | Reduce | Sum reduction along axes |
ReduceMean | Reduce | Mean reduction along axes |
ReduceMax | Reduce | Max reduction along axes |
Embedding | Embedding | Embedding lookup |
ScaledDotProductAttention | Embedding | Scaled dot-product attention (Q, K, V) |
Fused Operations
Created automatically by the operator fusion pass:
| Op | Fuses | Description |
|---|---|---|
FusedConvBNReLU | Conv2D + BatchNorm + ReLU | Single fused convolution kernel |
FusedConvBN | Conv2D + BatchNorm | Fused conv with batch norm |
FusedMatMulAdd | MatMul + Add | Fused linear layer |
FusedLinearReLU | MatMul + Add + ReLU | Fused linear + activation |
DSL Examples
A simple multi-layer perceptron for MNIST digit classification (784 → 128 → 10):
model MNIST_MLP {
input x: Tensor<float32>[1, 784]
// Hidden layer
h1 = MatMul(x, w1)
h1b = Add(h1, b1)
a1 = ReLU(h1b)
// Output layer
h2 = MatMul(a1, w2)
h2b = Add(h2, b2)
probs = Softmax(h2b)
output probs
}
Fusion: The operator fusion pass will fuse MatMul + Add into FusedMatMulAdd, and the first MatMul + Add + ReLU chain into FusedLinearReLU.
A small CNN with two convolution blocks followed by a classifier:
model TinyCNN {
input x: Tensor<float32>[1, 3, 32, 32]
// Conv block 1
c1 = Conv2D(x, w1, filters=16, kernel=3, stride=1, padding=same)
bn1 = BatchNorm(c1)
r1 = ReLU(bn1)
p1 = MaxPool2D(r1, kernel=2, stride=2)
// Conv block 2
c2 = Conv2D(p1, w2, filters=32, kernel=3, stride=1, padding=same)
bn2 = BatchNorm(c2)
r2 = ReLU(bn2)
p2 = MaxPool2D(r2, kernel=2, stride=2)
// Classifier
gap = GlobalAvgPool(p2)
flat = Flatten(gap)
logits = MatMul(flat, wfc)
out = Softmax(logits)
output out
}
Fusion: Each Conv2D → BatchNorm → ReLU chain fuses into a single FusedConvBNReLU node.
A residual block with skip connection:
model ResNetBlock {
input x: Tensor<float32>[1, 64, 16, 16]
// Main path
c1 = Conv2D(x, w1, filters=64, kernel=3, stride=1, padding=same)
bn1 = BatchNorm(c1)
r1 = ReLU(bn1)
c2 = Conv2D(r1, w2, filters=64, kernel=3, stride=1, padding=same)
bn2 = BatchNorm(c2)
// Residual connection
res = Add(x, bn2)
out = ReLU(res)
output out
}
A single transformer layer with self-attention and feed-forward network:
model TransformerBlock {
input tokens: Tensor<float32>[1, 32, 64]
// Self-attention
ln1 = LayerNorm(tokens)
q = MatMul(ln1, wq)
k = MatMul(ln1, wk)
v = MatMul(ln1, wv)
attn = ScaledDotProductAttention(q, k, v)
proj = MatMul(attn, wo)
res1 = Add(tokens, proj)
// Feed-forward
ln2 = LayerNorm(res1)
ff1 = MatMul(ln2, w1)
ff1b = Add(ff1, b1)
act = GELU(ff1b)
ff2 = MatMul(act, w2)
ff2b = Add(ff2, b2)
out = Add(res1, ff2b)
output out
}
MobileNet-style depthwise separable convolution:
model DepthwiseSeparable {
input x: Tensor<float32>[1, 32, 16, 16]
// Depthwise conv
dw = DepthwiseConv2D(x, dw_w, kernel=3, stride=1, padding=same)
dw_bn = BatchNorm(dw)
dw_relu = ReLU(dw_bn)
// Pointwise conv (1x1)
pw = Conv2D(dw_relu, pw_w, filters=64, kernel=1, stride=1)
pw_bn = BatchNorm(pw)
out = ReLU(pw_bn)
output out
}
JSON Format
Alternatively, models can be defined in an ONNX-like JSON format:
{
"name": "SimpleMLP",
"nodes": [
{ "name": "x", "op": "Input", "inputs": [],
"outputs": [{ "name": "x", "tensorType": { "dtype": "float32", "shape": [1, 784] } }] },
{ "name": "h1", "op": "MatMul", "inputs": ["x", "w1"],
"outputs": [{ "name": "h1" }] },
{ "name": "out", "op": "Softmax", "inputs": ["h1"],
"outputs": [{ "name": "out" }] }
]
}
Optimization Passes
The compiler includes 7 optimization passes that transform the graph to improve performance. Passes run sequentially, and the full history is preserved for visualization in the playground.
Propagates tensor shapes through the graph in topological order. Each operation computes its output shape from its input shapes (e.g., MatMul computes [M,K] × [K,N] = [M,N]).
- Annotates every edge with a
tensorType(dtype + shape) - Handles Conv2D output:
floor((H - K + 2P) / S) + 1 - Enables all downstream passes that need shape information
Evaluates subgraphs where all inputs are constants at compile time. Iterates until a fixed point (no more folding possible).
- Replaces computed constant chains with single Constant nodes
- Reduces runtime computation by moving work to compile time
Removes nodes that don't contribute to any model output. Performs backward BFS from Output nodes to find reachable nodes.
- Eliminates unreachable nodes and their edges
- Cleans up artifacts from other passes
Detects and fuses common operation patterns into optimized single-kernel nodes. Requires each intermediate node to have exactly one consumer.
Four fusion patterns:
Conv2D + BatchNorm + ReLU→FusedConvBNReLUConv2D + BatchNorm→FusedConvBNMatMul + Add + ReLU→FusedLinearReLUMatMul + Add→FusedMatMulAdd
Converts eligible operations from float32 to int8 symmetric quantization for faster inference and smaller model size.
- Targets: MatMul, Conv2D, DepthwiseConv2D, Add, and all fused variants
- Annotates nodes with
_quantized,_quantScheme,_quantBits - Updates edge tensor types along quantized paths
Converts tensor layouts from NCHW to NHWC for GPU-friendly memory access patterns.
- Targets spatial operations: Conv2D, DepthwiseConv2D, pooling, BatchNorm
- Transposes shapes:
[N,C,H,W]→[N,H,W,C]
Performs liveness analysis and allocates memory offsets using a greedy first-fit decreasing algorithm to minimize peak memory usage.
- Computes tensor lifetimes (first use → last use)
- Allocates non-overlapping memory blocks
- Annotates nodes with
_memOffset,_memSize,_peakMemory
Pass Pipeline API
import { runPipeline, ALL_PASSES } from '@nn-deploy/compiler';
// Run all 7 passes
const result = runPipeline(graph, ALL_PASSES);
// result.graph: optimized graph
// result.history: array of { passName, graph } for each step
// Or run individual passes
import { shapeInferencePass, operatorFusionPass } from '@nn-deploy/compiler';
const result = runPipeline(graph, [shapeInferencePass, operatorFusionPass]);
Code Generation
After optimization, the compiler generates executable code for one of three backends:
JavaScript
Reference implementation. Always works in any browser.
WebGPU (WGSL)
Compute shaders for GPU acceleration.
WASM
Structured op dispatch for near-native speed.
JavaScript Backend
Generates a self-contained JS module with a Tensor class and per-operation kernel functions:
// Example generated kernel for MatMul
function kernel_h1(inputs, output) {
const A = inputs[0], B = inputs[1];
const M = A.shape[A.shape.length - 2];
const K = A.shape[A.shape.length - 1];
const N = B.shape[B.shape.length - 1];
for (let m = 0; m < M; m++) {
for (let n = 0; n < N; n++) {
let sum = 0;
for (let k = 0; k < K; k++) {
sum += A.data[m * K + k] * B.data[k * N + n];
}
output.data[m * N + n] = sum;
}
}
}
WebGPU WGSL Backend
Generates compute shaders for GPU execution:
@group(0) @binding(0) var<storage, read> A: array<f32>;
@group(0) @binding(1) var<storage, read> B: array<f32>;
@group(0) @binding(2) var<storage, read_write> C: array<f32>;
struct Params { M: u32, N: u32, K: u32 }
@group(0) @binding(3) var<uniform> params: Params;
@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
let row = gid.x;
let col = gid.y;
if (row >= params.M || col >= params.N) { return; }
var sum: f32 = 0.0;
for (var k: u32 = 0u; k < params.K; k = k + 1u) {
sum = sum + A[row * params.K + k] * B[k * params.N + col];
}
C[row * params.N + col] = sum;
}
Compile API
import { compileModel } from '@nn-deploy/compiler';
const result = compileModel(graph, {
target: 'js', // 'js' | 'webgpu' | 'wasm'
passes: ALL_PASSES, // which passes to run (default: all)
enableQuantization: false, // enable INT8 quantization
});
// result.model: CompiledModel (kernels + memory plan)
// result.code: GeneratedCode (source string + kernel list)
// result.history: pass-by-pass graph snapshots
// result.metrics: { before, after } GraphMetrics
Runtime API
The @nn-deploy/runtime package provides the execution engine for running compiled models in the browser.
Tensor
The Tensor class manages typed array data with shape and stride information:
import { Tensor } from '@nn-deploy/runtime';
// Create tensors
const zeros = Tensor.zeros([2, 3]); // 2x3 zero tensor
const ones = Tensor.ones([4, 4]); // 4x4 ones tensor
const rand = Tensor.rand([1, 784]); // random uniform [0, 1)
const randn = Tensor.randn([1, 128]); // random normal (0, 1)
// From data
const t = new Tensor(
new Float32Array([1, 2, 3, 4, 5, 6]),
[2, 3] // shape
);
// Properties
t.shape; // [2, 3]
t.numel; // 6
t.ndim; // 2
t.byteSize; // 24
t.strides; // [3, 1]
// Methods
t.reshape([3, 2]); // new tensor with different shape
t.clone(); // deep copy
t.toArray(); // Float32Array -> number[]
t.toString(); // "Tensor<float32>[2,3]"
InferenceSession
The main API for running compiled models. Automatically selects the best available engine.
import { InferenceSession, Tensor } from '@nn-deploy/runtime';
import { parseDSL, compileModel } from '@nn-deploy/compiler';
// Compile
const graph = parseDSL(dslSource);
const { model } = compileModel(graph, { target: 'js' });
// Create session (auto-selects best engine)
const session = await InferenceSession.create(model);
// Run inference
const result = await session.run({
x: Tensor.rand([1, 784])
});
// Result
result.outputs; // Record<string, Tensor>
result.latencyMs; // execution time in ms
result.backend; // 'js' | 'webgpu'
// Metadata
session.getMetadata();
// { name, target, nodeCount, edgeCount }
// Cleanup
session.dispose();
Engine Selection
The runtime supports two execution engines:
| Engine | Requirement | Speed | Compatibility |
|---|---|---|---|
| JSEngine | None | Baseline | All browsers |
| WebGPUEngine | WebGPU API | GPU-accelerated | Chrome 113+, Edge 113+ |
When target: 'webgpu' is specified but WebGPU is unavailable, the session automatically falls back to the JS engine.
Examples
nn-deploy ships with 5 pre-built model examples covering different architectures. All are available in the playground.
| Model | Category | Input Shape | Architecture |
|---|---|---|---|
| MNIST MLP | Classic | [1, 784] |
2 hidden layers (MatMul + Add + ReLU) with Softmax output |
| Tiny CNN | CNN | [1, 3, 32, 32] |
2 conv blocks (Conv2D + BN + ReLU + Pool) + FC classifier |
| ResNet Block | CNN | [1, 64, 16, 16] |
2 conv layers with residual skip connection |
| Transformer Block | Transformer | [1, 32, 64] |
Self-attention (Q/K/V) + FFN (GELU) with residual connections |
| DepthSep Conv | Efficient | [1, 32, 16, 16] |
MobileNet-style depthwise + pointwise (1x1) conv |
Optimization Effects
Here's what the optimization passes do to each model:
- MNIST MLP: MatMul + Add fuses to
FusedMatMulAdd, first chain fuses toFusedLinearReLU(9 nodes → ~7 nodes) - Tiny CNN: Both Conv2D + BN + ReLU chains fuse to
FusedConvBNReLU(16 nodes → ~12 nodes) - ResNet Block: First conv chain fuses to
FusedConvBNReLU, second toFusedConvBN - Transformer: MatMul + Add chains fuse in the FFN, attention weights get quantized
- DepthSep Conv: Both DW and PW blocks undergo BN fusion and layout optimization to NHWC
Try it: Open the playground, select any example, and click "Compile & Optimize" to see the full pass-by-pass transformation timeline.
Architecture
nn-deploy follows a classic compiler architecture: frontend (parser) → IR → optimization passes → backend (codegen) → runtime.
Immutable Graph IR
The core data structure is an immutable graph. Every transformation returns a new Graph object, preserving the full history for visualization and debugging:
interface Graph {
name: string;
nodes: Node[];
edges: Edge[];
passHistory: PassRecord[]; // full transformation log
}
interface Node {
id: string;
op: OpType; // one of 38 operation types
name: string;
inputs: Port[];
outputs: Port[];
attributes: Record<string, any>;
}
interface Edge {
id: string;
sourceNodeId: string;
sourcePort: number;
targetNodeId: string;
targetPort: number;
tensorType?: TensorType; // annotated by shape inference
}
Data Flow
- Parse:
parseDSL(source)→ tokenize → parse → buildGraph - Optimize:
runPipeline(graph, passes)→ each pass returns a newGraph - Compile:
compileModel(graph, options)→ runs passes + codegen →CompiledModel - Execute:
InferenceSession.create(model)→ select engine →session.run(inputs)
Key Design Decisions
- Immutable transformations: Each pass returns a new graph. No mutations, no side effects. Enables pass history visualization and easy debugging.
- Auto-constant creation: Undefined weight references in DSL are auto-created as Constant nodes, simplifying model definitions.
- Multi-target codegen: Same optimized IR compiles to JS, WebGPU, or WASM. Backend choice is deferred to compile time.
- Sandboxed execution: JS engine uses
new Function()for sandboxed code execution in the browser. - Greedy memory planning: First-fit decreasing algorithm minimizes peak memory without complex ILP solvers.
Tech Stack
| Layer | Technology |
|---|---|
| Language | TypeScript (strict mode) |
| Build | Turborepo + npm workspaces |
| Frontend | Next.js 15, React 19 |
| State | Zustand |
| Visualization | D3.js + ELK.js (graph layout) |
| GPU | WebGPU (WGSL compute shaders) |
| Styling | Tailwind CSS v4 |
| Deployment | Vercel |