Production Setup | Agent Tool Protocol
Skip to main content

Production Setup

Deploy Agent Tool Protocol in production with confidence using this comprehensive guide.

Architecture Considerations

Deployment Topologies

Single Server (Small Scale)

┌──────────────────┐
│ Load Balancer │
│ (Optional) │
└────────┬─────────┘


┌──────────────────┐
│ ATP Server │
│ - Memory Cache │
│ - Local State │
└──────────────────┘

Use when:

  • < 100 requests/second
  • Development/staging
  • Single region deployment

Horizontal Scaling (Large Scale)

┌──────────────────┐
│ Load Balancer │
└────────┬─────────┘

┌────┼────┬─────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ ATP │ │ ATP │ │ ATP │ │ ATP │
│ #1 │ │ #2 │ │ #3 │ │ #4 │
└─────┘ └─────┘ └─────┘ └─────┘
│ │ │ │
└────┼────┴─────────┘


┌──────────────────┐
│ Redis │
│ (Shared State) │
└──────────────────┘

Use when:

  • 100 requests/second

  • High availability required
  • Multi-region deployment

Environment Setup

Node.js Configuration

# Recommended Node.js version
node --version # v20.x LTS

# Environment variables
NODE_ENV=production
NODE_OPTIONS=--max-old-space-size=4096
UV_THREADPOOL_SIZE=128

System Requirements

Minimum:

  • CPU: 2 cores
  • RAM: 2 GB
  • Node.js: 18.x
  • OS: Linux, macOS, Windows

Recommended:

  • CPU: 4+ cores
  • RAM: 8+ GB
  • Node.js: 20.x LTS
  • OS: Linux (Ubuntu 20.04+, Debian 11+)

For High Scale:

  • CPU: 8+ cores
  • RAM: 16+ GB
  • Node.js: 20.x LTS
  • Redis: 6.x or 7.x
  • OS: Linux (Ubuntu 22.04 LTS)

Production Configuration

Basic Production Server

import { createServer, MB, HOUR, MINUTE } from '@mondaydotcomorg/atp-server';
import { RedisCache } from '@mondaydotcomorg/atp-providers';

const server = createServer({
execution: {
timeout: 60000, // 60 seconds
memory: 512 * MB, // 512 MB per execution
llmCalls: 50, // Max 50 LLM calls
provenanceMode: 'track', // Enable tracking
securityPolicies: [
// Add security policies
],
},

clientInit: {
tokenTTL: 24 * HOUR, // 24 hour tokens
tokenRotation: 30 * MINUTE, // Rotate every 30 min
},

executionState: {
ttl: 3600, // 1 hour state retention
maxPauseDuration: 3600, // Max 1 hour pause
},

discovery: {
embeddings: true, // Enable semantic search
},

audit: {
enabled: true, // Enable audit logs
},

otel: {
enabled: true,
serviceName: process.env.SERVICE_NAME || 'atp-server',
traceEndpoint: process.env.OTEL_TRACE_ENDPOINT,
metricsEndpoint: process.env.OTEL_METRICS_ENDPOINT,
},

providers: {
cache: new RedisCache({
url: process.env.REDIS_URL || 'redis://localhost:6379',
keyPrefix: 'atp:',
defaultTTL: 3600,
}),
},

logger: 'warn', // Only log warnings and errors
});

// Register your APIs
server.registerAPI('your-api', {
// API functions
});

// Health check endpoint
server.use(async (context, next) => {
if (context.path === '/health') {
return {
status: 'ok',
timestamp: Date.now(),
version: process.env.VERSION || '1.0.0',
};
}
return next();
});

// Graceful shutdown
process.on('SIGTERM', async () => {
console.log('SIGTERM received, shutting down gracefully...');
await server.stop();
process.exit(0);
});

process.on('SIGINT', async () => {
console.log('SIGINT received, shutting down gracefully...');
await server.stop();
process.exit(0);
});

// Start server
const PORT = parseInt(process.env.PORT || '3333');
server.listen(PORT, () => {
console.log(`ATP Server running on port ${PORT}`);
console.log(`Environment: ${process.env.NODE_ENV}`);
console.log(`Version: ${process.env.VERSION || '1.0.0'}`);
});

Environment Variables

# Server Configuration
PORT=3333
NODE_ENV=production
SERVICE_NAME=atp-server
VERSION=1.0.0

# Redis Configuration
REDIS_URL=redis://:password@redis-host:6379
REDIS_TLS=true
REDIS_KEY_PREFIX=atp:

# Authentication
ATP_API_KEY=your-secret-api-key-here
JWT_SECRET=your-jwt-secret-here

# OpenTelemetry
OTEL_ENABLED=true
OTEL_SERVICE_NAME=atp-server
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector:4318/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4318/v1/metrics

# Logging
LOG_LEVEL=warn
LOG_FORMAT=json

# Security
CORS_ORIGINS=https://yourdomain.com,https://app.yourdomain.com
RATE_LIMIT_REQUESTS_PER_MINUTE=100
RATE_LIMIT_EXECUTIONS_PER_HOUR=1000

# Execution Limits
EXECUTION_TIMEOUT=60000
EXECUTION_MEMORY=536870912 # 512 MB
EXECUTION_MAX_LLM_CALLS=50

Docker Deployment

Dockerfile

FROM node:20-alpine AS builder

WORKDIR /app

# Install dependencies
COPY package*.json ./
COPY yarn.lock ./
RUN yarn install --frozen-lockfile

# Copy source
COPY . .

# Build
RUN yarn build

# Production image
FROM node:20-alpine

WORKDIR /app

# Copy built files
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001

USER nodejs

# Expose port
EXPOSE 3333

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD node -e "require('http').get('http://localhost:3333/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"

# Start server
CMD ["node", "dist/server.js"]

docker-compose.yml

version: '3.8'

services:
atp-server:
build: .
ports:
- "3333:3333"
environment:
- NODE_ENV=production
- PORT=3333
- REDIS_URL=redis://redis:6379
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
depends_on:
- redis
- otel-collector
restart: unless-stopped
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '1'
memory: 2G

redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
command: redis-server --appendonly yes
restart: unless-stopped
deploy:
resources:
limits:
cpus: '1'
memory: 2G

otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: --config=/etc/otel-collector-config.yaml
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4318:4318" # OTLP HTTP
- "8888:8888" # Prometheus metrics
restart: unless-stopped

volumes:
redis-data:

Build and Run

# Build image
docker build -t atp-server:latest .

# Run with docker-compose
docker-compose up -d

# Check logs
docker-compose logs -f atp-server

# Stop
docker-compose down

Kubernetes Deployment

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-server
labels:
app: atp-server
spec:
replicas: 3
selector:
matchLabels:
app: atp-server
template:
metadata:
labels:
app: atp-server
spec:
containers:
- name: atp-server
image: your-registry/atp-server:latest
ports:
- containerPort: 3333
name: http
env:
- name: NODE_ENV
value: "production"
- name: PORT
value: "3333"
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: atp-secrets
key: redis-url
- name: ATP_API_KEY
valueFrom:
secretKeyRef:
name: atp-secrets
key: api-key
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 3333
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 3333
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: atp-server
spec:
selector:
app: atp-server
ports:
- port: 80
targetPort: 3333
type: LoadBalancer
---
apiVersion: v1
kind: Secret
metadata:
name: atp-secrets
type: Opaque
data:
redis-url: <base64-encoded-redis-url>
api-key: <base64-encoded-api-key>
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-server
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

Apply Configuration

# Create namespace
kubectl create namespace atp

# Apply configuration
kubectl apply -f deployment.yaml -n atp

# Check status
kubectl get pods -n atp
kubectl get svc -n atp

# Scale manually
kubectl scale deployment atp-server --replicas=5 -n atp

# View logs
kubectl logs -f deployment/atp-server -n atp

Load Balancing

Nginx Configuration

upstream atp_backend {
least_conn;

server atp-server-1:3333 max_fails=3 fail_timeout=30s;
server atp-server-2:3333 max_fails=3 fail_timeout=30s;
server atp-server-3:3333 max_fails=3 fail_timeout=30s;

keepalive 64;
}

server {
listen 80;
server_name api.yourdomain.com;

# Redirect to HTTPS
return 301 https://$server_name$request_uri;
}

server {
listen 443 ssl http2;
server_name api.yourdomain.com;

ssl_certificate /etc/ssl/certs/cert.pem;
ssl_certificate_key /etc/ssl/private/key.pem;

# Security headers
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "DENY" always;

# Timeouts for long-running executions
proxy_connect_timeout 60s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;

# WebSocket support (if needed)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";

location / {
proxy_pass http://atp_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}

location /health {
proxy_pass http://atp_backend/health;
access_log off;
}
}

Monitoring

OpenTelemetry Configuration

# otel-collector-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318

processors:
batch:
timeout: 10s
send_batch_size: 1024

exporters:
prometheus:
endpoint: 0.0.0.0:8888

jaeger:
endpoint: jaeger:14250
tls:
insecure: true

logging:
loglevel: warn

service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus, logging]

Prometheus Metrics

Key metrics to monitor:

# prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'atp-server'
static_configs:
- targets: ['otel-collector:8888']

Important Metrics:

  • atp_execution_duration_seconds: Execution time
  • atp_execution_memory_bytes: Memory usage
  • atp_llm_calls_total: LLM call count
  • atp_cache_hits_total: Cache hit rate
  • atp_errors_total: Error count

Grafana Dashboard

{
"dashboard": {
"title": "ATP Server Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(atp_requests_total[5m])"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(atp_errors_total[5m])"
}
]
},
{
"title": "Execution Duration",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(atp_execution_duration_seconds_bucket[5m]))"
}
]
}
]
}
}

Security Best Practices

1. API Key Management

// Use environment variables
const API_KEY = process.env.ATP_API_KEY;

// Rotate keys regularly
// Use AWS Secrets Manager, HashiCorp Vault, etc.

// Implement middleware
server.use(async (context, next) => {
const apiKey = context.headers['x-api-key'];

if (!apiKey || apiKey !== API_KEY) {
throw new Error('Unauthorized');
}

return next();
});

2. Rate Limiting

import { RateLimiterMemory } from 'rate-limiter-flexible';

const rateLimiter = new RateLimiterMemory({
points: 100, // 100 requests
duration: 60, // per 60 seconds
});

server.use(async (context, next) => {
try {
await rateLimiter.consume(context.clientId || context.ip);
return next();
} catch {
throw new Error('Rate limit exceeded');
}
});

3. CORS Configuration

server.use(async (context, next) => {
const allowedOrigins = process.env.CORS_ORIGINS?.split(',') || [];
const origin = context.headers['origin'];

if (allowedOrigins.includes(origin)) {
context.setHeader('Access-Control-Allow-Origin', origin);
context.setHeader('Access-Control-Allow-Methods', 'POST, GET, OPTIONS');
context.setHeader('Access-Control-Allow-Headers', 'Content-Type, Authorization');
}

return next();
});

4. Security Headers

server.use(async (context, next) => {
context.setHeader('X-Content-Type-Options', 'nosniff');
context.setHeader('X-Frame-Options', 'DENY');
context.setHeader('X-XSS-Protection', '1; mode=block');
context.setHeader('Strict-Transport-Security', 'max-age=31536000; includeSubDomains');

return next();
});

Performance Optimization

1. Connection Pooling

// Database connection pooling
import { Pool } from 'pg';

const pool = new Pool({
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
});

2. Caching Strategy

import { RedisCache } from '@mondaydotcomorg/atp-providers';

const cache = new RedisCache({
url: process.env.REDIS_URL,
keyPrefix: 'atp:',
defaultTTL: 3600,
// Connection pooling
maxRetriesPerRequest: 3,
enableReadyCheck: true,
});

3. Code Optimization

// Pre-compile frequently used APIs
const apiDefinitions = await precompileAPIs();

// Cache type definitions
const typeDefsCache = new Map();

// Reuse V8 isolates
const isolatePool = createIsolatePool({ size: 10 });

Backup and Recovery

Redis Backup

# Automated backup script
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
redis-cli --rdb /backup/redis-$TIMESTAMP.rdb

# Keep only last 7 days
find /backup -name "redis-*.rdb" -mtime +7 -delete

Disaster Recovery Plan

  1. Regular Backups: Automated daily backups
  2. Multi-Region: Deploy in multiple regions
  3. Monitoring: Alert on failures
  4. Runbooks: Document recovery procedures

Next Steps

Troubleshooting

High Memory Usage

  • Reduce execution.memory limit
  • Implement memory pooling
  • Monitor for memory leaks

Slow Performance

  • Enable Redis caching
  • Implement connection pooling
  • Optimize API handlers
  • Use horizontal scaling

Connection Issues

  • Check network configuration
  • Verify firewall rules
  • Test Redis connectivity
  • Review load balancer config
Agent Tool Protocol | ATP - Code Execution for AI Agents