Production Setup

Deploy Agent Tool Protocol in production with confidence using this comprehensive guide.

Architecture Considerations

Deployment Topologies

Single Server (Small Scale)

┌──────────────────┐
│   Load Balancer  │
│    (Optional)    │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   ATP Server     │
│  - Memory Cache  │
│  - Local State   │
└──────────────────┘

Use when:

< 100 requests/second
Development/staging
Single region deployment

Horizontal Scaling (Large Scale)

┌──────────────────┐
│  Load Balancer   │
└────────┬─────────┘
         │
    ┌────┼────┬─────────┐
    │    │    │         │
    ▼    ▼    ▼         ▼
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ ATP │ │ ATP │ │ ATP │ │ ATP │
│ #1  │ │ #2  │ │ #3  │ │ #4  │
└─────┘ └─────┘ └─────┘ └─────┘
    │    │    │         │
    └────┼────┴─────────┘
         │
         ▼
┌──────────────────┐
│      Redis       │
│  (Shared State)  │
└──────────────────┘

Use when:

100 requests/second
High availability required
Multi-region deployment

Environment Setup

Node.js Configuration

# Recommended Node.js version
node --version  # v20.x LTS

# Environment variables
NODE_ENV=production
NODE_OPTIONS=--max-old-space-size=4096
UV_THREADPOOL_SIZE=128

System Requirements

Minimum:

CPU: 2 cores
RAM: 2 GB
Node.js: 18.x
OS: Linux, macOS, Windows

Recommended:

CPU: 4+ cores
RAM: 8+ GB
Node.js: 20.x LTS
OS: Linux (Ubuntu 20.04+, Debian 11+)

For High Scale:

CPU: 8+ cores
RAM: 16+ GB
Node.js: 20.x LTS
Redis: 6.x or 7.x
OS: Linux (Ubuntu 22.04 LTS)

Production Configuration

Basic Production Server

import { createServer, MB, HOUR, MINUTE } from '@mondaydotcomorg/atp-server';
import { RedisCache } from '@mondaydotcomorg/atp-providers';

const server = createServer({
  execution: {
    timeout: 60000,              // 60 seconds
    memory: 512 * MB,            // 512 MB per execution
    llmCalls: 50,                // Max 50 LLM calls
    provenanceMode: 'track',     // Enable tracking
    securityPolicies: [
      // Add security policies
    ],
  },
  
  clientInit: {
    tokenTTL: 24 * HOUR,         // 24 hour tokens
    tokenRotation: 30 * MINUTE,  // Rotate every 30 min
  },
  
  executionState: {
    ttl: 3600,                   // 1 hour state retention
    maxPauseDuration: 3600,      // Max 1 hour pause
  },
  
  discovery: {
    embeddings: true,            // Enable semantic search
  },
  
  audit: {
    enabled: true,               // Enable audit logs
  },
  
  otel: {
    enabled: true,
    serviceName: process.env.SERVICE_NAME || 'atp-server',
    traceEndpoint: process.env.OTEL_TRACE_ENDPOINT,
    metricsEndpoint: process.env.OTEL_METRICS_ENDPOINT,
  },
  
  providers: {
    cache: new RedisCache({
      url: process.env.REDIS_URL || 'redis://localhost:6379',
      keyPrefix: 'atp:',
      defaultTTL: 3600,
    }),
  },
  
  logger: 'warn',  // Only log warnings and errors
});

// Register your APIs
server.registerAPI('your-api', {
  // API functions
});

// Health check endpoint
server.use(async (context, next) => {
  if (context.path === '/health') {
    return { 
      status: 'ok', 
      timestamp: Date.now(),
      version: process.env.VERSION || '1.0.0',
    };
  }
  return next();
});

// Graceful shutdown
process.on('SIGTERM', async () => {
  console.log('SIGTERM received, shutting down gracefully...');
  await server.stop();
  process.exit(0);
});

process.on('SIGINT', async () => {
  console.log('SIGINT received, shutting down gracefully...');
  await server.stop();
  process.exit(0);
});

// Start server
const PORT = parseInt(process.env.PORT || '3333');
server.listen(PORT, () => {
  console.log(`ATP Server running on port ${PORT}`);
  console.log(`Environment: ${process.env.NODE_ENV}`);
  console.log(`Version: ${process.env.VERSION || '1.0.0'}`);
});

Environment Variables

# Server Configuration
PORT=3333
NODE_ENV=production
SERVICE_NAME=atp-server
VERSION=1.0.0

# Redis Configuration
REDIS_URL=redis://:password@redis-host:6379
REDIS_TLS=true
REDIS_KEY_PREFIX=atp:

# Authentication
ATP_API_KEY=your-secret-api-key-here
JWT_SECRET=your-jwt-secret-here

# OpenTelemetry
OTEL_ENABLED=true
OTEL_SERVICE_NAME=atp-server
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector:4318/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4318/v1/metrics

# Logging
LOG_LEVEL=warn
LOG_FORMAT=json

# Security
CORS_ORIGINS=https://yourdomain.com,https://app.yourdomain.com
RATE_LIMIT_REQUESTS_PER_MINUTE=100
RATE_LIMIT_EXECUTIONS_PER_HOUR=1000

# Execution Limits
EXECUTION_TIMEOUT=60000
EXECUTION_MEMORY=536870912  # 512 MB
EXECUTION_MAX_LLM_CALLS=50

Docker Deployment

Dockerfile

FROM node:20-alpine AS builder

WORKDIR /app

# Install dependencies
COPY package*.json ./
COPY yarn.lock ./
RUN yarn install --frozen-lockfile

# Copy source
COPY . .

# Build
RUN yarn build

# Production image
FROM node:20-alpine

WORKDIR /app

# Copy built files
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

USER nodejs

# Expose port
EXPOSE 3333

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD node -e "require('http').get('http://localhost:3333/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"

# Start server
CMD ["node", "dist/server.js"]

docker-compose.yml

version: '3.8'

services:
  atp-server:
    build: .
    ports:
      - "3333:3333"
    environment:
      - NODE_ENV=production
      - PORT=3333
      - REDIS_URL=redis://redis:6379
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
    depends_on:
      - redis
      - otel-collector
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
        reservations:
          cpus: '1'
          memory: 2G

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 2G

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: --config=/etc/otel-collector-config.yaml
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4318:4318"  # OTLP HTTP
      - "8888:8888"  # Prometheus metrics
    restart: unless-stopped

volumes:
  redis-data:

Build and Run

# Build image
docker build -t atp-server:latest .

# Run with docker-compose
docker-compose up -d

# Check logs
docker-compose logs -f atp-server

# Stop
docker-compose down

Kubernetes Deployment

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-server
  labels:
    app: atp-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: atp-server
  template:
    metadata:
      labels:
        app: atp-server
    spec:
      containers:
      - name: atp-server
        image: your-registry/atp-server:latest
        ports:
        - containerPort: 3333
          name: http
        env:
        - name: NODE_ENV
          value: "production"
        - name: PORT
          value: "3333"
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: atp-secrets
              key: redis-url
        - name: ATP_API_KEY
          valueFrom:
            secretKeyRef:
              name: atp-secrets
              key: api-key
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3333
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 3333
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: atp-server
spec:
  selector:
    app: atp-server
  ports:
  - port: 80
    targetPort: 3333
  type: LoadBalancer
---
apiVersion: v1
kind: Secret
metadata:
  name: atp-secrets
type: Opaque
data:
  redis-url: <base64-encoded-redis-url>
  api-key: <base64-encoded-api-key>
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-server
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Apply Configuration

# Create namespace
kubectl create namespace atp

# Apply configuration
kubectl apply -f deployment.yaml -n atp

# Check status
kubectl get pods -n atp
kubectl get svc -n atp

# Scale manually
kubectl scale deployment atp-server --replicas=5 -n atp

# View logs
kubectl logs -f deployment/atp-server -n atp

Load Balancing

Nginx Configuration

upstream atp_backend {
    least_conn;
    
    server atp-server-1:3333 max_fails=3 fail_timeout=30s;
    server atp-server-2:3333 max_fails=3 fail_timeout=30s;
    server atp-server-3:3333 max_fails=3 fail_timeout=30s;
    
    keepalive 64;
}

server {
    listen 80;
    server_name api.yourdomain.com;
    
    # Redirect to HTTPS
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;
    
    ssl_certificate /etc/ssl/certs/cert.pem;
    ssl_certificate_key /etc/ssl/private/key.pem;
    
    # Security headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-Frame-Options "DENY" always;
    
    # Timeouts for long-running executions
    proxy_connect_timeout 60s;
    proxy_send_timeout 300s;
    proxy_read_timeout 300s;
    
    # WebSocket support (if needed)
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    
    location / {
        proxy_pass http://atp_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
    
    location /health {
        proxy_pass http://atp_backend/health;
        access_log off;
    }
}

Monitoring

OpenTelemetry Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: 0.0.0.0:8888
  
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  logging:
    loglevel: warn

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus, logging]

Prometheus Metrics

Key metrics to monitor:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'atp-server'
    static_configs:
      - targets: ['otel-collector:8888']

Important Metrics:

atp_execution_duration_seconds: Execution time
atp_execution_memory_bytes: Memory usage
atp_llm_calls_total: LLM call count
atp_cache_hits_total: Cache hit rate
atp_errors_total: Error count

Grafana Dashboard

{
  "dashboard": {
    "title": "ATP Server Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(atp_requests_total[5m])"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(atp_errors_total[5m])"
          }
        ]
      },
      {
        "title": "Execution Duration",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(atp_execution_duration_seconds_bucket[5m]))"
          }
        ]
      }
    ]
  }
}

Security Best Practices

1. API Key Management

// Use environment variables
const API_KEY = process.env.ATP_API_KEY;

// Rotate keys regularly
// Use AWS Secrets Manager, HashiCorp Vault, etc.

// Implement middleware
server.use(async (context, next) => {
  const apiKey = context.headers['x-api-key'];
  
  if (!apiKey || apiKey !== API_KEY) {
    throw new Error('Unauthorized');
  }
  
  return next();
});

2. Rate Limiting

import { RateLimiterMemory } from 'rate-limiter-flexible';

const rateLimiter = new RateLimiterMemory({
  points: 100,  // 100 requests
  duration: 60, // per 60 seconds
});

server.use(async (context, next) => {
  try {
    await rateLimiter.consume(context.clientId || context.ip);
    return next();
  } catch {
    throw new Error('Rate limit exceeded');
  }
});

3. CORS Configuration

server.use(async (context, next) => {
  const allowedOrigins = process.env.CORS_ORIGINS?.split(',') || [];
  const origin = context.headers['origin'];
  
  if (allowedOrigins.includes(origin)) {
    context.setHeader('Access-Control-Allow-Origin', origin);
    context.setHeader('Access-Control-Allow-Methods', 'POST, GET, OPTIONS');
    context.setHeader('Access-Control-Allow-Headers', 'Content-Type, Authorization');
  }
  
  return next();
});

4. Security Headers

server.use(async (context, next) => {
  context.setHeader('X-Content-Type-Options', 'nosniff');
  context.setHeader('X-Frame-Options', 'DENY');
  context.setHeader('X-XSS-Protection', '1; mode=block');
  context.setHeader('Strict-Transport-Security', 'max-age=31536000; includeSubDomains');
  
  return next();
});

Performance Optimization

1. Connection Pooling

// Database connection pooling
import { Pool } from 'pg';

const pool = new Pool({
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

2. Caching Strategy

import { RedisCache } from '@mondaydotcomorg/atp-providers';

const cache = new RedisCache({
  url: process.env.REDIS_URL,
  keyPrefix: 'atp:',
  defaultTTL: 3600,
  // Connection pooling
  maxRetriesPerRequest: 3,
  enableReadyCheck: true,
});

3. Code Optimization

// Pre-compile frequently used APIs
const apiDefinitions = await precompileAPIs();

// Cache type definitions
const typeDefsCache = new Map();

// Reuse V8 isolates
const isolatePool = createIsolatePool({ size: 10 });

Backup and Recovery

Redis Backup

# Automated backup script
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
redis-cli --rdb /backup/redis-$TIMESTAMP.rdb

# Keep only last 7 days
find /backup -name "redis-*.rdb" -mtime +7 -delete

Disaster Recovery Plan

Regular Backups: Automated daily backups
Multi-Region: Deploy in multiple regions
Monitoring: Alert on failures
Runbooks: Document recovery procedures

Next Steps

Caching Strategies: Advanced caching
Monitoring: Detailed monitoring setup
Scaling: Scaling strategies
Docker: Docker deployment guide

Troubleshooting

High Memory Usage

Reduce execution.memory limit
Implement memory pooling
Monitor for memory leaks

Slow Performance

Enable Redis caching
Implement connection pooling
Optimize API handlers
Use horizontal scaling

Connection Issues

Check network configuration
Verify firewall rules
Test Redis connectivity
Review load balancer config

Architecture Considerations​

Deployment Topologies​

Single Server (Small Scale)​

Horizontal Scaling (Large Scale)​

Environment Setup​

Node.js Configuration​

System Requirements​

Production Configuration​

Basic Production Server​

Environment Variables​

Docker Deployment​

Dockerfile​

docker-compose.yml​

Build and Run​

Kubernetes Deployment​

Deployment​

Apply Configuration​

Load Balancing​

Nginx Configuration​

Monitoring​

OpenTelemetry Configuration​

Prometheus Metrics​

Grafana Dashboard​

Security Best Practices​

1. API Key Management​

2. Rate Limiting​

3. CORS Configuration​

4. Security Headers​

Performance Optimization​

1. Connection Pooling​

2. Caching Strategy​

3. Code Optimization​

Backup and Recovery​

Redis Backup​

Disaster Recovery Plan​

Next Steps​

Troubleshooting​

High Memory Usage​

Slow Performance​

Connection Issues​