Realtime Monitoring System

Design a realtime server monitoring and alert system like Datadog.

Functional requirement:

  • The monitoring system needs to monitor data from a fleet of servers continuously
  • Data to be collected on each machine: CPU usage, memory, system logs and web server access and error logs.
  • Users can set rules based on certain conditions that triggers alerts. For example, a user can set a condition that average CPU in the last 5 min exceeding 80%. Alert the user by sending emails/sms/push notifications if the condition is met.
  • Users should be able to see visualized data in a web based dashboard. The dashboard should display data in realtime and allow user to view historical data.

Scale requirement:

  • Monitor 10,000 servers initially, scalable to handle an increase of 20% annually;
  • Data retention for 5 years.
  • Assuming each metric is submitted every 10 seconds (submitted 8640 times per day), and each server has 6 metrics.
  • Assuming the read-write ratio is 1:100.
  • Assuming each metric is 100 bytes.
1. Resource Estimation