在前几天我写了个「pnp4nagios的模板」后,发现使用npcd的bulk模式处理3.5w条性能数据,花费了将近30分钟时间,而我期望的是在1分钟内处理完。需要换个方式,为此我切换到gearman模式
来并发,记录如下:
- 换用SSD硬盘
- 对3.5w条源数据做多一步处理,合并相同host相同servicedesc的数据,最后只有3.4k条
- 停了npcd,启用gearmand做server端
- 以4个工作进程启动process_perfdata.pl,为4个worker
- 把3.4k条数据做为client数据,发送到server端
###SSD硬盘
硬盘空间不是很充足,我减少了RRD文件大小。之前我把pnp4nagios安装到了/home目录下,则修改/home/pnp4nagios/etc/rra.cfg
配置,让RRD文件只保存最多90天的数据
RRA_STEP=60
#
# PNP default RRA config
#
# you will get 400kb of data per datasource
#
# 2880 entries with 1 minute step = 48 hours
#
RRA:AVERAGE:0.5:1:2880
#
# 2880 entries with 5 minute step = 10 days
#
RRA:AVERAGE:0.5:5:2880
#
# 4320 entries with 30 minute step = 90 days
#
RRA:AVERAGE:0.5:30:4320
#
# 5840 entries with 360 minute step = 4 years
#
#RRA:AVERAGE:0.5:360:5840
RRA:MAX:0.5:1:2880
RRA:MAX:0.5:5:2880
RRA:MAX:0.5:30:4320
#RRA:MAX:0.5:360:5840
RRA:MIN:0.5:1:2880
RRA:MIN:0.5:5:2880
RRA:MIN:0.5:30:4320
#RRA:MIN:0.5:360:5840
###合并数据
无论是npcd来调用process_perfdata.pl,或者是在gearman中启动process_perfdata.pl的worker,实际写RRD文件的,都是这个process_perfdata.pl
脚本,减少它对源数据的解析,自然是一个优化的点。举个例子
- DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rta=92.412ms;200.000;500.000;0;
- DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::pl=0%;40;80;;
- DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rtmax=97.281ms;;;;
- DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rtmin=86.727ms;;;;
上面这4条数据,通过放弃时间戳TIMET(可能不同),他们有相同的HOSTNAME和SERVICEDESC,合并为1条
- DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rta=92.412ms;200.000;500.000;0; pl=0%;40;80;; rtmax=97.281ms;;;; rtmin=86.727ms;;;;
###server端
停了npcd: /etc/init.d/npcd stop
修改gearmand的配置,主要添加-t 4
,启动4个IO线程
$ cat /etc/sysconfig/gearmand
### Settings for gearmand
# OPTIONS=""
OPTIONS="--listen=127.0.0.1 --port=4730 --log-file=/diskb/pnp4nagios/gearmand/gearmand.log -t 4"
以默认端口4730启动/etc/init.d/gearmand start
,监听client发送上来的数据
$ netstat -an | grep 4730
tcp 0 0 127.0.0.1:4730 0.0.0.0:* LISTEN
###worker进程
在pnp4nagios的安装目录中,找到/home/pnp4nagios/libexec/process_perfdata.pl
,修改worker工作方式,让其一直执行任务,而不需要重启,修改new_child
这个函数(下面的第41行)
#
# start a new worker process
#
sub new_child {
my $pid;
my $sigset;
my $req = 0;
# block signal for fork
$sigset = POSIX::SigSet->new(SIGINT);
sigprocmask(SIG_BLOCK, $sigset)
or die "Can't block SIGINT for fork: $!\n";
die "fork: $!" unless defined ($pid = fork);
if ($pid) {
# Parent records the child's birth and returns.
sigprocmask(SIG_UNBLOCK, $sigset)
or die "Can't unblock SIGINT for fork: $!\n";
$children{$pid} = 1;
$children++;
return;
} else {
# Child can *not* return from this subroutine.
$SIG{INT} = 'DEFAULT'; # make SIGINT kill us as it did before
# unblock signals
sigprocmask(SIG_UNBLOCK, $sigset)
or die "Can't unblock SIGINT for fork: $!\n";
my $worker = Gearman::Worker->new();
my @job_servers = split(/,/, $conf{'GEARMAN_HOST'}); # allow multiple gearman job servers
$worker->job_servers(@job_servers);
# worker向server注册的函数名为 perfdata,它对应的处理流程是 main函数
$worker->register_function("perfdata", 2, sub { return main(@_); });
my %opt = (
on_complete => sub { $req++; },
stop_if => sub { if ( $req > $conf{'REQUESTS_PER_CHILD'} ) { return 1;}; }
);
print_log("connecting to gearmand '".$conf{'GEARMAN_HOST'}."'",0);
# 忽略REQUESTS_PER_CHILD配置项,不退出一直执行
$worker->work( %opt ) while 1;
print_log("max requests per child reached (".$conf{'REQUESTS_PER_CHILD'}.")",1);
# this exit is VERY important, otherwise the child will become
# a producer of more and more children, forking yourself into
# process death.
exit;
}
}
脚本中另一个修改的地方,在parse_env
函数中,它解析client发送上来的数据$job_data
时会做一个base64解码,但我client端的数据并没有做base编码(也没有加密),这里注释掉这个解码
#
# Parse %ENV and return a global hash %NAGIOS
#
sub parse_env {
my $job_data = shift;
%NAGIOS = ();
$NAGIOS{DATATYPE} = "SERVICEPERFDATA";
if(defined $opt_gm){
# Gearman Worker
# 由于我client端没做base64编码,这里就不需要解码了
#$job_data = decode_base64($job_data);
if($conf{ENCRYPTION} == 1){
$job_data = $cypher->decrypt( $job_data )
}
my @LINE = split(/\t/, $job_data);
foreach my $k (@LINE) {
$k =~ /([A-Z 0-9_]+)::(.*)$/;
$NAGIOS{$1} = $2 if ($2);
}
if ( !$NAGIOS{HOSTNAME} ) {
print_log( "Gearman job data missmatch. Please check your encryption key.", 0 );
return %NAGIOS;
}
}
}
同时修改这个perl脚本配置文件/home/pnp4nagios/etc/process_perfdata.cfg
中关于gearman的配置项,主要是启动4个进程作为4个worker、server端的ip和端口号、以及不使用加密(默认是加密):
#
# File with RRA options used to create new RRD files
#
RRA_CFG = /home/pnp4nagios/etc/rra.cfg
# Gearman Worker Config
# Only used when running as Gearman worker
#
# How many child processes
#
PREFORK = 4
#
# Gearman server to connect to
# Comma separated list of Gearman job servers
#
GEARMAN_HOST = localhost:4730
#
# Enables or disables encryption.
# It is strongly advised to not disable encryption, or
# anybody will be able to inject packages to your worker.
# When using encryption, you will have to specify a shared
# secret eithr via the KEY or the KEY_FILE option.
# Default is 1.
#
ENCRYPTION = 0
基于上面的修改和配置,接着启动worker进程,传入2个参数--gearman --daemon
给到process_perfdata.pl
即可
$ cat run.sh
#!/bin/sh
ps -ef | grep process_perfdata | grep -v grep | awk '{print $2}' | xargs kill -9
sleep 1
su - nagios -c "perl /home/pnp4nagios/libexec/process_perfdata.pl --gearman --daemon"
sleep 1
ps -ef | grep process_perfdata | grep -v grep
完整的process_perfdata.pl
在gist: https://gist.github.com/popozhu/9509554
###client端
把合并后的数据以异步提交的方式发送到server端,由于process_perfdata.pl
的worker向server端注册的函数名是perfdata
,client向server端提交数据时,也得使用相同的函数名perfdata
#!/usr/bin/perl
use Gearman::Client;
my $client = Gearman::Client->new;
$client->job_servers("localhost:4730");
# 其他数据处理,最后是合并后的数据,在%perfdata里
foreach my $data (keys %perfdata){
$client->dispatch_background("perfdata", data);
}
可以通过/usr/bin/gearadmin --status
来查看当前执行的情况,输出的4个字段分别为:注册的函数 任务队列中的任务数量 工作中的worker数量 总共有多少个worker
$ /usr/bin/gearadmin --status
perfdata 3061 4 5
.
最后3.4k条性能数据,以4个worker并发写入/更新到SSD硬盘的RRD文件里,大约40s,不到1分钟,收工。
– EOF –