pnp4nagios的并发

March 12 2014

在前几天我写了个「pnp4nagios的模板」后，发现使用npcd的bulk模式处理3.5w条性能数据，花费了将近30分钟时间，而我期望的是在1分钟内处理完。需要换个方式，为此我切换到gearman模式来并发，记录如下：

换用SSD硬盘
对3.5w条源数据做多一步处理，合并相同host相同servicedesc的数据，最后只有3.4k条
停了npcd，启用gearmand做server端
以4个工作进程启动process_perfdata.pl，为4个worker
把3.4k条数据做为client数据，发送到server端

###SSD硬盘
硬盘空间不是很充足，我减少了RRD文件大小。之前我把pnp4nagios安装到了/home目录下，则修改/home/pnp4nagios/etc/rra.cfg配置，让RRD文件只保存最多90天的数据

RRA_STEP=60
#
# PNP default RRA config
#
# you will get 400kb of data per datasource
#
# 2880 entries with 1 minute step = 48 hours
#
RRA:AVERAGE:0.5:1:2880
#
# 2880 entries with 5 minute step = 10 days
#
RRA:AVERAGE:0.5:5:2880
#
# 4320 entries with 30 minute step = 90 days
#
RRA:AVERAGE:0.5:30:4320
#
# 5840 entries with 360 minute step = 4 years
#
#RRA:AVERAGE:0.5:360:5840

RRA:MAX:0.5:1:2880
RRA:MAX:0.5:5:2880
RRA:MAX:0.5:30:4320
#RRA:MAX:0.5:360:5840

RRA:MIN:0.5:1:2880
RRA:MIN:0.5:5:2880
RRA:MIN:0.5:30:4320
#RRA:MIN:0.5:360:5840

###合并数据
无论是npcd来调用process_perfdata.pl，或者是在gearman中启动process_perfdata.pl的worker，实际写RRD文件的，都是这个process_perfdata.pl脚本，减少它对源数据的解析，自然是一个优化的点。举个例子

DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rta=92.412ms;200.000;500.000;0;
DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::pl=0%;40;80;;
DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rtmax=97.281ms;;;;
DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rtmin=86.727ms;;;;

上面这4条数据，通过放弃时间戳TIMET(可能不同)，他们有相同的HOSTNAME和SERVICEDESC，合并为1条

DATATYPE::SERVICEPERFDATA TIMET::1378779841 HOSTNAME::zh_v542 SERVICEDESC::PING SERVICECHECKCOMMAND::check_icmp SERVICEPERFDATA::rta=92.412ms;200.000;500.000;0; pl=0%;40;80;; rtmax=97.281ms;;;; rtmin=86.727ms;;;;

###server端
停了npcd： /etc/init.d/npcd stop

修改gearmand的配置，主要添加-t 4，启动4个IO线程

$ cat /etc/sysconfig/gearmand
### Settings for gearmand
# OPTIONS=""
OPTIONS="--listen=127.0.0.1 --port=4730 --log-file=/diskb/pnp4nagios/gearmand/gearmand.log -t 4"

以默认端口4730启动/etc/init.d/gearmand start，监听client发送上来的数据

$ netstat -an | grep 4730
tcp        0      0 127.0.0.1:4730              0.0.0.0:*                   LISTEN

###worker进程
在pnp4nagios的安装目录中，找到/home/pnp4nagios/libexec/process_perfdata.pl，修改worker工作方式，让其一直执行任务，而不需要重启，修改new_child这个函数(下面的第41行)

#
# start a new worker process
#
sub new_child {
    my $pid;
    my $sigset;
    my $req = 0;
    # block signal for fork
    $sigset = POSIX::SigSet->new(SIGINT);
    sigprocmask(SIG_BLOCK, $sigset)
        or die "Can't block SIGINT for fork: $!\n";

    die "fork: $!" unless defined ($pid = fork);

    if ($pid) {
        # Parent records the child's birth and returns.
        sigprocmask(SIG_UNBLOCK, $sigset)
            or die "Can't unblock SIGINT for fork: $!\n";
        $children{$pid} = 1;
        $children++;
        return;
    } else {
        # Child can *not* return from this subroutine.
        $SIG{INT} = 'DEFAULT';      # make SIGINT kill us as it did before

        # unblock signals
        sigprocmask(SIG_UNBLOCK, $sigset)
            or die "Can't unblock SIGINT for fork: $!\n";

        my $worker = Gearman::Worker->new();
        my @job_servers = split(/,/, $conf{'GEARMAN_HOST'}); # allow multiple gearman job servers 
        $worker->job_servers(@job_servers);
        # worker向server注册的函数名为 perfdata，它对应的处理流程是 main函数
        $worker->register_function("perfdata", 2, sub { return main(@_); });
        my %opt = ( 
                    on_complete => sub { $req++; }, 
                    stop_if => sub { if ( $req > $conf{'REQUESTS_PER_CHILD'} ) { return 1;}; } 
                  );
        print_log("connecting to gearmand '".$conf{'GEARMAN_HOST'}."'",0);

        # 忽略REQUESTS_PER_CHILD配置项，不退出一直执行
        $worker->work( %opt ) while 1;
        print_log("max requests per child reached (".$conf{'REQUESTS_PER_CHILD'}.")",1);
        # this exit is VERY important, otherwise the child will become
        # a producer of more and more children, forking yourself into
        # process death.
        exit;
    }
}

脚本中另一个修改的地方，在parse_env函数中，它解析client发送上来的数据$job_data时会做一个base64解码，但我client端的数据并没有做base编码（也没有加密），这里注释掉这个解码

#
# Parse %ENV and return a global hash %NAGIOS
#
sub parse_env {
    my $job_data = shift;
    %NAGIOS = ();
    $NAGIOS{DATATYPE} = "SERVICEPERFDATA";

    if(defined $opt_gm){
        # Gearman Worker
        # 由于我client端没做base64编码，这里就不需要解码了
        #$job_data = decode_base64($job_data);
        if($conf{ENCRYPTION} == 1){
            $job_data = $cypher->decrypt( $job_data )        
        }
        my @LINE = split(/\t/, $job_data);
        foreach my $k (@LINE) {
            $k =~ /([A-Z 0-9_]+)::(.*)$/;
            $NAGIOS{$1} = $2 if ($2);
        }
        if ( !$NAGIOS{HOSTNAME} ) {
            print_log( "Gearman job data missmatch. Please check your encryption key.", 0 );
            return %NAGIOS;
        }
    }
}

同时修改这个perl脚本配置文件/home/pnp4nagios/etc/process_perfdata.cfg中关于gearman的配置项，主要是启动4个进程作为4个worker、server端的ip和端口号、以及不使用加密（默认是加密）:

#
# File with RRA options used to create new RRD files
#
RRA_CFG = /home/pnp4nagios/etc/rra.cfg

# Gearman Worker Config
# Only used when running as Gearman worker

#
# How many child processes
#
PREFORK = 4

#
# Gearman server to connect to
# Comma separated list of Gearman job servers
#
GEARMAN_HOST = localhost:4730

#
# Enables or disables encryption.
# It is strongly advised to not disable encryption, or
# anybody will be able to inject packages to your worker.
# When using encryption, you will have to specify a shared
# secret eithr via the KEY or the KEY_FILE option.
# Default is 1.
#
ENCRYPTION = 0

基于上面的修改和配置，接着启动worker进程，传入2个参数--gearman --daemon给到process_perfdata.pl即可

$ cat run.sh
#!/bin/sh

ps -ef | grep process_perfdata | grep -v grep | awk '{print $2}' | xargs kill -9
sleep 1
su - nagios -c "perl /home/pnp4nagios/libexec/process_perfdata.pl --gearman --daemon"
sleep 1
ps -ef | grep process_perfdata | grep -v grep

完整的process_perfdata.pl在gist: https://gist.github.com/popozhu/9509554

###client端
把合并后的数据以异步提交的方式发送到server端，由于process_perfdata.pl的worker向server端注册的函数名是perfdata，client向server端提交数据时，也得使用相同的函数名perfdata

#!/usr/bin/perl
use Gearman::Client;

my $client = Gearman::Client->new;
$client->job_servers("localhost:4730");

# 其他数据处理，最后是合并后的数据，在%perfdata里

foreach my $data (keys %perfdata){
    $client->dispatch_background("perfdata", data);
}

可以通过/usr/bin/gearadmin --status来查看当前执行的情况，输出的4个字段分别为:
注册的函数任务队列中的任务数量工作中的worker数量总共有多少个worker

$ /usr/bin/gearadmin --status
perfdata        3061    4       5
.

最后3.4k条性能数据，以4个worker并发写入/更新到SSD硬盘的RRD文件里，大约40s，不到1分钟，收工。

– EOF –

Categories: in_lib

Tags: gearman,pnp4nagios