Oct 15, 2012

Zabbix poller processes more than 75% busy and queue delay (I)

In my previous job, I had to set up a Zabbix infrastructure in order to monitor more than 400 devices between switches and servers. The main feature of this architecture was that there were a lot of machines, but the update interval was large (around 30 seconds) and the number of items small.

For this purpose, I wrote down a couple of articles related to this issue:


But in my current position, I am starting to introduce Zabbix (2.0.3 on Ubuntu Server 12.04) with the aim of controlling few devices where a large number of items and a small monitoring period are required. This situation leads to an overload of the Zabbix server, on the one hand by increasing the number of monitored elements delayed in the queue, and on the other, turning out that the poller processes are busy long.

In addition, I have been able to observe that, from time to time, the agent goes down in an unexpected way. If you take a look at the log file from the client (debug mode), the following error lines are dumped.

root@zabbix-client:~# tail -f /var/log/zabbix/zabbix_agentd.log
...
zabbix_agentd [17271]: [file:'cpustat.c',line:155] lock failed: [22] Invalid argument
 17270:20121015:092010.216 One child process died (PID:17271,exitcode/signal:255). Exiting ...
 17270:20121015:092010.216 zbx_on_exit() called
 17272:20121015:092010.216 Got signal [signal:15(SIGTERM),sender_pid:17270,sender_uid:0,reason:0]. Exiting ...
 17273:20121015:092010.216 Got signal [signal:15(SIGTERM),sender_pid:17270,sender_uid:0,reason:0]. Exiting ...
 17274:20121015:092010.216 Got signal [signal:15(SIGTERM),sender_pid:17270,sender_uid:0,reason:0]. Exiting ...
 17270:20121015:092012.216 Zabbix Agent stopped. Zabbix 2.0.3 (revision 30485).

Below you can observe a figure which shows the Zabbix server performance (queue) for the aforementioned case.




And the other one, reflects the Zabbix data gathering process (pay attention to the data Zabbix busy poller processes, in %).




For the first case, the Zabbix queue has averaged more than 50 monitored items delayed, and for the second one, the poller processes are busy about 100% of the time. This situation can produce that, sometimes, Zabbix draws sporadic dots rather than lines in the graphs. Another effect that you can get from this condition is that if you set a short update interval for an item, you could run into lack of data when you check the values gathered later.




Also say that I followed the tuning guide that I mentioned before, but as you can see, Zabbix server was acting up.


2 comments:

  1. Replies
    1. Next week I will publish the continuation of this article, whereby I will explain how I solved it.

      Delete