KeyChest Blog

KeyChest and Reliability

Jun 17, 2019 2:59:01 PM / by Dan

Those who have been with us for a while may know that we change the cloud provider to Digital Ocean in January. At the same time, we started experimenting with HA database cluster. And we learnt a lot.

 

A wise man would say that you learn most from mistakes and it's true that if things simply work, you are likely to be caught off-guard. We have been experimenting with HA database clusters for several months and believed we mastered the technology. Well, you can judge for yourself in this blog post of mine :) One big lesson however was that you can't do much without information.

Netdata provides an excellent source of real-time system data.

Figure: Netdata provides an excellent source of real-time system data.

We have eventually decided to use Netdata, which so far satisfied all our requirements. We can access detailed reports when we need them, all alarms are forwarded to our Slack and we have also included alarms and main system stats to the KeyChest web site. You can access those at https://keychest.net/status . This page also includes notes about problems and planned system upgrades when we expect downtimes or issues.

Another aspect of the KeyChest monitoring is access to the application logs. We have completely re-factored logging in the KeyChest audit engine. The aim was to create data that is easy to analyse and query. We have decided to go for the JSON format, with simple text messages and separate parameters and implemented a new log dispatcher that collects logs from multiple audit processes.

Each log message uniquely identifies its source and the version of the KeyChest. It has two timestamps so we know when the logged event happened and when it was added to logs. All variable parameters are defined as separate JSON data items so we can automate analysis with very little loss of detail.

The analysis is crucial for effective use of any logs. In our case, we store logs locally in files to ensure they don't get lost due to networking issues and we also push the data to a Splunk instance for visualization and querying.

Splunk visualisation of memory use by KeyChest audit engine

Figure: Splunk visualisation of memory use by KeyChest audit engine

We really do put a lot of effort into making KeyChest a reliable service. Sometimes it feels like everything is stacked against us but we are making it better week by week. So if something doesn't work it only means we missed something and we'd love to hear from you!

 

 

Tags: keychest, incident response

Dan

Written by Dan