Opinions are mine

Scale

I've been using Consul since one of the first releases back in 2014 and have set it up multiple times in various environments - from standalone nodes for convenience purposes to multi-datacenter federated clusters partitioned over WAN and LAN segments.

Over the time, dealing with larger and larger setups, I learned some tricks that came handy as scale grew, especially once node and service counters start running in hundreds or thousands.

I must note that I have not pushed Consul to its limits and my experience is not outstanding in any way as there seem to exist setups that are orders of magnitude larger than what I dealt with so far:

On the other hand, I rarely see people talking about tweaks I've been using and I definitely would've saved myself from a lot of problems if I knew about them before I needed them.

Scope

I only deal with Linux and assume there are no resource bottlenecks - as in, Consul is not starving on CPU/IO/RAM/etc.

Stability

There are few things that make Consul servers way more stable and save time investigating sporadic API call failures. In other words, let's start by making sure that as few requests will fail as possible.

Open File Descriptors

Opening a file takes a file descriptor. As well as opening a socket - and not only for listening for incoming connections but also for each connection that is kept open. This is especially important for Consul as it usually talks a lot via network - both to keep in touch with other nodes and to process API requests to update/retrieve service or KV records.

Linux enforces limits on how many file descriptors given process can have. There is a system-wide limit that one can set via sysctl and there is a limit for any given process.

The latter one can be easily adjusted. In case of Systemd it will be just an entry under Service section:

[Service]
LimitNOFILE=16384

and with Docker it's just an extra command line parameter:

$ docker run --ulimit nofile=16384 consul

raft_multiplier

Of all items on the list, this one is closest to "tweaking" Consul itself. It's very well documented in Consul's performance guide but the bottom line is to put this into Consul Servers' configuration:

{
  "performance": {
    "raft_multiplier": 1
  }
}

This, essentially, speeds up leader failure detection and re-election and therefore minimizes the window when requests involving leader may fail.

Stale Reads

This is also briefly mentioned in Consul's performance guide but I personally think this is way under-appreciated optimization.

In my experience, even in federated setup with hundreds of nodes and services, may a KV entry or a service state change - this information will be propagated around very quickly. In a matter of 10th of milliseconds, when using stale reads. Obviously, such latencies are detectable only when one uses long-polling to listen for state changes. On top of that, I am not sure I ever saw health-checks being executed more often than once per a second or few. This means that the propagation latency can be considered negligible.

The official documentation says (highlights are mine):

This mode allows any server to service the read regardless of whether it is the leader. This means reads can be **arbitrarily** stale; however, results are generally consistent to within **50 milliseconds** of the leader. The trade-off is very fast and scalable reads with a higher likelihood of stale values. Since this mode **allows reads without a leader, a cluster that is unavailable will still be able to respond to queries**.

What does that mean? If we recall the CAP theorem - it means that we opt into AP, or in other words, being available in face of network partition. This might not be suitable for all cases, but in my experience that's a safe assumption in 99% of them - to the degree that I always default to stale reads.

And since reads are "AP" now - even leader failure would not impact data retrieval.

Intervals

When dealing with large distributed systems relying on time too much is not the best idea as high precision clocks are hard, they will drift away or won't be synchronized properly at all.

Even though Consul's performance is such that durations are usually measured in milliseconds, that's not a sane precision in general purpose distributed system built on a commodity hardware.

Going an order of magnitude slower or two - at a per second precision - would simplify a lot of things and would generally eliminate the need to worry about stale data, rate limits, abusive API clients and so on. Even though one can use long-polling to retrieve updates from Consul immediately, adding a second-scale interval before re-starting a watch would greatly reduce load on both on Consul and on the software that processes updates. Going in batches is always more efficient and even may eliminate need for processing updates if some of services are flapping within interval period (unless that's the goal).

Reliability

Approaching Consul from user's perspective "reliability" means not Consul's reliability but rather making sure that are operations being executed are performed they way one expects and there are as few unpredicted edge cases as possible.

Durable Writes

Speaking about stale consistency mode above I only mentioned reads and not writes. To the best of my knowledge, writes are not subject to consistency modes and this is also briefly mentioned in Consul's documentation:

consistency query parameters will be ignored, since writes are always managed by the leader via the Raft consensus protocol

In other words, writes are durable by default and one does not have to worry about consistency modes when modifying data.

Federated Atomic Reads

For a pretty long time by now, Consul has Transactions API that can be used to perform atomic sets of operations. Unfortunately I did not have a chance to measure its overhead yet but it's definitely a way to go if one needs to combine several operations of different types (read a key, modify another one or maybe delete it).

Sometimes though one needs to fetch data from another datacetner in an "atomic way" - as in, imagine watching few KV sub-trees in a federated cluster. This may lead to unexpected results in face of network issues, especially when they are intermittent - some of watches can succeed while other may fail and fall into a retry loop.

A "trick" around such a problem is to avoid watching several KV subtrees in a single context when talking to a federated cluster. Instead, one can just watch a superset of all trees that should be monitored. While potentially resulting in more data transfer, this would make the whole process smoother as it's always will be easy to reason about state of such a KV observation.

Performance

When talking to Consul's APIs one usually takes its own performance for granted but how exactly one talks to APIs can make a huge difference.

ACLs for performance

Consul's KV operations can be performed either on individual entries or on entire sub-trees. Catalog operations are usually done on concrete service or node entries. Nevertheless, when dealing with large clusters one may face situations when it's impossible to filter out unnecessary data.

Watching a service state on few particular nodes would result either into several watches with complicated synchronization or watching service state across whole cluster and then extracting a tiny piece of the retrieved data.

Same goes for KV - as in example with federated atomic reads, one may need to watch entire KV tree just to be able to detect changes in few key entries under different paths.

This might be an un-obvious solution, but ACLs can be a great help with that. In most cases, it's pretty easy to create a Consul access token with ACL rules that would restrict available data to only those that are needed. This may not only tremendously reduce data transfer but also reduce the load by skipping updates to irrelevant data and even simplify the data filtering on its own given that ACL rules engine is domain specific.

HTTP/2

Since version 1.0.1 Consul API supports HTTP/2 when TLS is enabled which means that one can utilize persistent connections and multiplex many API calls over a single TCP connection. This reduces overhead on establishing new connections and also reduces the need for file descriptors on Consul agents that serve a lot of API calls.

Conclusion

While Consul's documentation has many great guides on optimizing Consul's own performance, stability and reliability there is a lot that can be done from user's perspective. Not only such "tricks" can make Consul's life easier but also can simplify applications that are talking to Consul APIs.