Production distributed system – pt. 1

A customer came to AikiLinux requesting our assistance in designing and implementing a highly distributed and resilient monitoring system based on Icinga, with a planned scope of monitoring it’s own internal cloud service and for some of the services it provides for it’s external customers.

In the initial step we evaluated the requirements of the cluster and also build a small scale lab for them (master, 2 satellites and a host to monitor), and then set out to understand the network topology and limitations that might impact performance.

The things that we found were “normal” for a large multi continent organisation:

  • remote separated data centres
  • very restrictive IT department
  • ESX resources
    … nothing new or things we haven’t encountered before.

So we set out to design the solution and thought on what components will help us provide a truly redundant system, without relying on any cloud provider service, all done in house.

The Stack we ended up with was fairly simple : MariaDB Galera , HashiCorp Consul, Named for the database, and a standard HA setup for the Icinga itself.

The first challenge in building this system was ensuring that the Galera cluster was up and running so we modified the  /etc/my.cnf.d/server.cnf

# this is read by the standalone daemon and embedded servers
[server]

# this is only for the mysqld standalone daemon
[mysqld]
log_error=/var/log/mariadb.log
#
# * Galera-related settings
#
[galera]
# Allow server to accept connections on all interfaces.
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=”gcomm://172.32.6.15,172.31.6.15,172.33.6.15″
innodb_locks_unsafe_for_binlog = 1

## Galera Cluster Configuration
wsrep_cluster_name=”icinga-galera”

and started the nodes….no sync between the master and the nodes.

We tested several solutions, modifying the security policy and the firewall, but in the end the only way to get the cluster up and running was to disable SElinux (mind you, it was after the 3rd firewall that you need to get through to gain access to the server) .

Once the node “saw” each other, we started testing data replication and we saw that 2 nodes replicate data but the 3rd did not.

It turns out that NTP was disabled and the time diff between the servers was more then 1900 seconds, uncomment the ntp records and ensure the sync of the clocks … and now we have replication. YAY!!