Skip to content

Global Metrics Monitoring & System Benchmarking

32.1 System Overview

The Global Metrics Monitoring & System Benchmarking module provides real-time, comprehensive observability and performance benchmarking for all microservices in the quantitative trading system. It enables SLA-driven operations, automated alerting, historical trend analysis, and seamless integration with Prometheus and Grafana for visualization.

32.1.1 Core Objectives

  • Real-Time Metrics: Collect latency, TPS, CPU, memory, disk I/O, network, and account/strategy risk metrics from all microservices
  • SLA Enforcement: Define and monitor Service Level Agreements (e.g., matching latency < 500μs, order success rate > 99.99%)
  • Automated Alerting: Trigger alerts on threshold breaches (e.g., high latency, overload)
  • Historical Analysis: Store and analyze trends (e.g., 7-day system load)
  • Benchmark Automation: Periodically benchmark core modules and generate reports
  • Visualization: Grafana or custom dashboards for all key metrics

32.2 Architecture Design

32.2.1 Microservice Architecture

Global Metrics Center Service:

services/global-metrics-center/
├── src/
│   ├── main.py
│   ├── collector/
│   │   ├── metrics_collector.py
│   ├── aggregator/
│   │   ├── metrics_aggregator.py
│   ├── alert/
│   │   ├── metrics_alert.py
│   ├── benchmark/
│   │   ├── benchmark_runner.py
│   ├── api/
│   │   ├── metrics_api.py
│   ├── config.py
│   ├── requirements.txt
├── Dockerfile

32.2.2 Core Components

  • Metrics Collector: Each microservice reports its own metrics (CPU, memory, I/O, latency, TPS, etc.)
  • Central Aggregator: Aggregates and normalizes metrics from all services
  • Metrics Storage: Stores metrics in a time-series database (e.g., Prometheus TSDB)
  • Alert System: Monitors thresholds and triggers alerts (e.g., via Telegram/Slack)
  • Benchmark Runner: Periodically runs performance tests on core modules
  • API Interface: Exposes metrics and system status via REST API
  • Frontend Dashboard: Grafana or custom React dashboard for visualization

32.3 Module Design

32.3.1 Metrics Collector (metrics_collector.py)

  • Uses psutil and internal hooks to collect:
  • CPU, memory, disk I/O, network bandwidth
  • Service-specific metrics (latency, TPS, order success rate, etc.)
  • Account/strategy PnL and risk exposure
  • Periodically pushes metrics to the aggregator
import psutil, time
class MetricsCollector:
    def collect_system_metrics(self):
        return {
            "cpu_percent": psutil.cpu_percent(interval=None),
            "memory_percent": psutil.virtual_memory().percent,
            "disk_io": psutil.disk_io_counters()._asdict(),
            "net_io": psutil.net_io_counters()._asdict(),
            "timestamp": time.time()
        }

32.3.2 Central Aggregator (metrics_aggregator.py)

  • Receives and aggregates metrics from all services
  • Stores in Prometheus-compatible format
  • Supports querying by service, metric, and time range
class MetricsAggregator:
    def __init__(self):
        self.metrics_storage = []
    def aggregate(self, service_name, metrics):
        self.metrics_storage.append({"service": service_name, "metrics": metrics})

32.3.3 Alert System (metrics_alert.py)

  • Checks metrics against thresholds (e.g., CPU > 90%, latency > 1ms)
  • Triggers alerts via messaging (Telegram, Slack, Email)
class MetricsAlert:
    def check_thresholds(self, metrics):
        alerts = []
        if metrics["cpu_percent"] > 90:
            alerts.append("High CPU Usage Alert")
        if metrics["memory_percent"] > 85:
            alerts.append("High Memory Usage Alert")
        return alerts

32.3.4 Benchmark Runner (benchmark_runner.py)

  • Periodically runs performance tests:
  • Matching engine TPS
  • Backtest engine throughput
  • Data playback speed
  • DB query latency
  • Archives weekly benchmark reports
class BenchmarkRunner:
    def run_tps_test(self):
        start_time = time.time()
        for _ in range(100000):
            match_engine.place_order(...)
        duration = time.time() - start_time
        tps = 100000 / duration
        return {"tps": tps}

32.3.5 API Interface (metrics_api.py)

  • FastAPI-based endpoints for querying metrics and system status
from fastapi import APIRouter
router = APIRouter()
@router.get("/metrics/{service_name}")
async def get_metrics(service_name: str):
    return aggregator.query_service_metrics(service_name)

32.3.6 Frontend Dashboard

  • Grafana or custom React dashboard
  • Visualizes:
  • CPU/memory/disk/network curves
  • Latency and TPS distributions
  • Account risk and PnL trends
  • SLA achievement rates
  • Real-time and historical views

32.4 SLA & Alerting

  • SLA Examples:
  • Matching latency < 500μs
  • Order success rate > 99.99%
  • Alerting:
  • Threshold-based, real-time notification
  • Multi-channel (Telegram, Slack, Email)
  • Trend Analysis:
  • 7-day/30-day system load and performance trends

32.5 Technology Stack

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization and dashboarding
  • Python (FastAPI, psutil): Service implementation
  • Docker: Containerization
  • Alertmanager: Alert routing and notification

32.6 API Design

  • GET /metrics/{service_name}: Query latest metrics for a service
  • GET /metrics/history/{service_name}: Query historical metrics
  • GET /benchmark/report: Get latest benchmark report
  • GET /system/status: System health and SLA status

32.7 Frontend Integration

  • Grafana: Plug-and-play dashboards for all metrics
  • Custom React Dashboard: For advanced visualization and SLA tracking
  • Alert Visualization: Real-time alert banners and notifications

32.8 Implementation Roadmap

  • Phase 1: Metrics collector and aggregator, Prometheus integration
  • Phase 2: Alert system and SLA enforcement, basic dashboard
  • Phase 3: Benchmark automation, historical analysis, advanced visualization

32.9 Integration with Existing System

  • All microservices integrate metrics collector client
  • Central aggregator and alert system run as core ops services
  • Prometheus scrapes all metrics endpoints
  • Grafana dashboards available to ops and engineering

32.10 Business Value

Benefit Impact
Full Observability Real-time and historical system health
SLA Management Quantifiable, enforceable reliability
Automated Alerting Proactive incident response
Benchmarking Continuous performance improvement
Transparency Stakeholder trust and operational excellence