2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
ELK optimization can be carried out from the following aspects:
Features of ES as log storage:
①Increase server memory and JVM heap memory
②Use multiple instances for load balancing
③Use filebeat instead of logstash to collect log data
- 为了保证不丢失数据,就要保护 translog 文件的安全:
- Elasticsearch 2.0 之后,每次写请求(如 index 、delete、update、bulk 等)完成时,都会触发fsync将 translog 中的 segment 刷到磁盘,然后才会返回 200 OK 的响应;或者: 默认每隔5s就将 translog 中的数据通过fsync强制刷新到磁盘。
- 该方式提高数据安全性的同时,降低了一点性能。
- ==> 频繁地执行 fsync 操作,可能会产生阻塞导致部分操作耗时较久。 如果允许部分数据丢失,可设置异步刷新 translog 来提高效率,还有降低 flush 的阀值, 优化如下:
- "index.translog.durability": "async",
- "index.translog.flush_threshold_size":"1024mb",
- "index.translog.sync_interval": "120s"
- 写入 Lucene 的数据,并不是实时可搜索的,ES 必须通过 refresh 的过程把内存中的数据转换成 Lucene 的完整 segment 后,才可以被搜索。
- 默认 1 秒后,写入的数据可以很快被查询到,但势必会产生大量的 segment,检索性能会受到影响。所以,加大时长可以降低系统开销。 对于日志搜索来说,实时性要求不是那么高,设置为 5 秒或者 10s;对于 SkyWalking,实时性要求更低一些,我们可以设置为 30s。
- 设置如下:
- "index.refresh_interval":"5s"
- index.merge.scheduler.max_thread_count 控制并发的 merge 线程数,如果存储是并发性能较好的 SSD,可以用系统默认的 max(1, min(4, availableProcessors / 2)),当节点配置的 cpu 核数较高时,merge 占用的资源可能会偏高,影响集群的性能,普通磁盘的话设为1,发生磁盘 IO 堵塞。设置 max_thread_count 后,会有 max_thread_count + 2 个线程同时进行磁盘操作,也就是设置为 1 允许 3 个线程。
- 设置如下:
- "index.merge.scheduler.max_thread_count":"1"
You need to close the index first, then execute, and then open it after success.
curl -XPOST 'http://localhost:9200/_all/_close'
curl -XPUT -H "Content-Type:application/json" 'http://localhost:9200/_all/_settings?preserve_existing=true' -d '{"index.merge.scheduler.max_thread_count" : "1","index.refresh_interval" : "10s","index.translog.durability" : "async","index.translog.flush_threshold_size":"1024mb","index.translog.sync_interval" : "120s"}'
curl -XPOST 'http://localhost:9200/_all/_open'
The write thread pool is full, which results in task rejection and some data cannot be written. After the above optimization, the rejection rate is much lower, but there are still cases of task rejection. So we still need to optimize the write thread pool.
The write thread pool uses a fixed type of thread pool, that is, the number of core threads is the same as the maximum thread value. The number of threads is equal to the number of CPU cores by default, and the maximum value that can be set is only the number of CPU cores plus 1. For example, for a 16-core CPU, the maximum number of threads that can be set is 17.
- # 线程数设置
- thread_pool:
- write:
- # 线程数默认等于cpu核数,即16
- size: 17
- # 因为任务多时存在任务拒绝的情况,所以加大队列大小,可以在间歇性任务量陡增的情况下,缓存任务在队列,等高峰过去逐步消费完。
- queue_size: 10000
Swap partition: When the system's physical memory is insufficient, a portion of the physical memory needs to be released for use by currently running programs. The released space may come from programs that have not been operated for a long time. The released space is temporarily saved in Swap. When those programs are to be run, the saved data is restored from Swap to the memory. In this way, the system always performs Swap when the physical memory is insufficient.
Swap is very detrimental to performance and node stability and should be disabled. It can cause garbage collection to take minutes instead of milliseconds and can cause nodes to become unresponsive or even disconnected from the cluster.
1) Disable Swap in Linux system (temporarily effective)
- 执行命令 sudo swapoff -a
- 可以临时禁用 Swap 内存,但是操作系统重启后失效
2) Minimize the use of Swap in Linux systems (permanently valid)
- 执行下列命令
- echo "vm.swappiness = 1" >> /etc/sysctl.conf
- 正常情况下不会使用 Swap,除非紧急情况下才会 Swap。
3) Enable bootstrap.memory_lock
- config/elasticsearch.yml 文件增加配置
- #锁定内存,不让 JVM 写入 Swap,避免降低 ES 的性能
- bootstrap.memory_lock: true
The size of the index depends on the size of the shards and segments. If the shards are too small, the segments may be too small, which will increase the overhead. If the shards are too large, the shards may be frequently merged, resulting in a large number of IO operations and affecting the write performance.
Because the size of each of our indexes is less than 15G, and the default number of shards is 5, there is no need for so many, so we adjust it to 3.
"index.number_of_shards": "3"
Reduce the number of cluster replica shards. Too many replicas will cause write expansion within ES. The default number of replicas is 1. If one node where an index is located goes down, another machine with a replica has the index backup data, which allows the index data to be used normally. However, writing data to a replica will affect write performance. For log data, 1 replica is sufficient. For indexes with large amounts of data, the number of replicas can be set to 0 to reduce the impact on performance.
"index.number_of_replicas": "1"
①Increase server memory and JVM heap memory
②Use multiple instances for load balancing
③Use filebeat instead of logstash to collect log data
①Optimize the index: optimize fsync and increase the disk flushing interval appropriately
②Optimize the write thread pool configuration to reduce the number of rejected tasks: modify the ES configuration file elasticsearch.yml and set the write thread to the number of CPU cores + 1
③Lock the memory and prevent ES from using swap: swapoff -a, turn off swap
④ Appropriately reduce the number of shards and copies of the index