While working with elasticsearch it’s impossible to avoid the topic of shard optimization. Sooner or later every user of a more complex system will have to take up this topic. An unoptimized elasticsearch environment can result in slow data indexing, returning responses, and even unstable performance of the entire environment.
The sooner we understand where this problem comes from and the sooner we address it, the better. Planning elasticsearch shard policies is essential to ensure a long-lasting and stable cluster performance. It is also worth remembering that shards are nothing else than the Lucene engine on the basis of which elasticsearch was created.
What is shard?
Index is built with shards and we divide them into primary and replica. Each shard holds some part of the data stored in the index, so the set of primary shards in the index will act as RAID 0. Additionally, each primary shard can have its 1:1 replica. This guarantees data availability in the event of a cluster failure. If any primary shard becomes unavailable, its replica takes his place.
Often times, the elasticsearch environments that receive the data are not properly managed in terms of space. This means that the data released into the environment remain there for a long time, somehow forgotten. Indexing large amounts of data without proper management can quickly consume even huge disk resources. The reason for this is that the replicas are 1: 1 copies of the primary shards.
Holding an index (e.g. 100GB) of 4 shards and each of them has its own three replicas means that we have a total of 15 shards (4 primary + 12 replicas) and 400 GB of data. It is not hard to see how inattention can lead to a quick full disk in such a scenario.
Optimization consists in categorizing data and assigning them an appropriate number of shards and replicas. Of course, each replica taken means a greater susceptibility to permanent data loss in the event of a failure. It is known, however, that not every index is critical and for those with lower priority it is worth considering how many replicas they require.
Elasticsearch is software made in java. The assigned heap is therefore crucial for the proper functioning of the environment. Incorrectly scheduling the server’s memory resources can contribute to a serious failure due to insufficient memory.
How to judge how much memory an elasticsearch node requires It depends on the size of the cluster and the amount of data we collect. Elasticsearch holds a lot of data in memory for quick access. It is recommended that an elasticsearch node does not have more than 25 shards (primary and replicas) per 1 GB of memory in the heap. It is worth noting that elasticsearch is not able to limit this for us. This is one of the fundamental steps of an administrator to monitor the number of shards per heap memory.
When it comes to query optimization, there is no secret that the main element is the structure of the query and the scope of data on which the query is triggered. In other words: 100GB data will be responded faster than 500GB data.
The more complicated the query and the more data is run, the later we will get the answer. Therefore, it is important to balance the relationship between the number of shards and their size.
It is recommended that one shard contains between 20GB and 40GB of data. Therefore, if we have an index of 100GB of data, it is worth allocating a total of 4 shards primary. Replicas are not included in this optimization procedure as they hold 1: 1 copies of the data of the primary shards.
The above aspects show that although elasticsearch is a powerful tool and one of the best engines for managing large amounts of data, it still needs careful attention in terms of optimization. Good planning of the indexes and shards structure will allow you to enjoy a stable and very efficient cluster environment.