Evaluating Prometheus Monitoring Tool

A client asked to evaluate Prometheus Monitoring solution for it’s AWS infrastructure, so after 2 days of reading and testing the Prometheus system we can say several things about the tool:

  1. The modular build of the application is confusing at first and can be challenging to someone that is used to have the core product handle all the functions (comparison, alerting , test logic etc’), but once you are able to adjust your way of thinking to it, it makes sense and easy to see the logical division.
  2. Another “shift” from the Nagios approach is the way that Prometheus is evaluating when and how to alert. In Nagios and any system that has evolved from it’s school of thought the evaluation of the triggering is done on the individual data check (service), where as in Prometheus the individual check is irrelevant,the alerting logic is where the evaluation is done, based on multiple dividers: node names,logical grouping, the data point relative to time series and you can also add arithmetic calculation for prediction alerting based on historical data.
  3. The Prometheus clients capture many data points on your remote nodes and require very simple configuration for the server to read the data. The advantages of the “pull” (or “Active” to those coming from Nagios evolved systems) method is apparent as you can have many servers read the data from a single client for redundancy and be fast aware when a remote agent is no longer responding.
  4. A fully evolved query language that allows building complex logic for parsing and slicing the data to present the metric you wish to get.

With those good points (and there are more)  there are some issues that seem to be lacking :

  1. The built in interface does not update in a regular intervals, to achieve a visualisation that keeps the graphs current you need to use a 3rd party tool, the recommended one is Grafana, which already has the capability to use Prometheus as a data backend for querying.
  2. The modular build of the product may be an issue when the internal parts fail (alertmanager) as you will not be aware of the issue, as no alerts will be sent and the only indication will be the dashboard, granted you may define many alert manager instances to eliminate that issue, but for small implementation that still feels like a problem.
  3. “More is Less”: the abundant metrics supplied by the client can be daunting to begin with and understanding how to handle and use those for a basic monitoring setup can be overwhelming, causing the novice user to shy away and seek “simpler” solutions.

There are many more points that can be said both as Pro’s and Con’s on the system, as I am sure that many in the Monitoring world will point out, as a whole Prometheus provides a good solid tool, and as always, you need to consider 2 points when you choose a monitoring tool :

  1. What do you want to achieve?
  2. How much time you want to invest (Time = Money)  ?

When those two are defined and agreed upon, Prometheus could be one of the tools for consideration.