Design a realtime server monitoring and alert system like Datadog.
Functional requirement:
The monitoring system needs to monitor data from a fleet of servers continuously
Data to be collected on each machine: CPU usage, memory, system logs and web server access and error logs.
Users can set rules based on certain conditions that triggers alerts. For example, a user can set a condition that average CPU in the last 5 min exceeding 80%. Alert the user by sending emails/sms/push notifications if the condition is met.
Users should be able to see visualized data in a web based dashboard. The dashboard should display data in realtime and allow user to view historical data.
Scale requirement:
Monitor 10,000 servers initially, scalable to handle an increase of 20% annually;
Data retention for 5 years.
Assuming each metric is submitted every 10 seconds (submitted 8640 times per day), and each server has 6 metrics.