2016년 9월 30일 금요일

Openstack Mitaka - Changing ceilometer default polling period in pipeline.yaml

I am currently working on a four-node installation of Openstack Mitaka and while testing Ceilometer alarm functionality, I ran into the problem of the alarm state always displaying insufficient data. Of the four nodes, mgmt01 is the storage node running glance and cinder, mgmt02 is the control node running nova-manager, keystone, ceilometer, heat, horizon, etc., and two compute nodes, compute01 and compute02.

By default, Ceilometer telemetry gathers data every 600 seconds, but you can change interval: 600 (in seconds) to some smaller value. Here's a link to the default version of /etc/ceilometer/pipeline.yaml for Openstack Mitaka:

https://gist.github.com/gojun077/8d7b9e8afc22c8f5d5014c883f8c1cf9

On my control node, mgmt02, I made sure to edit this file so that ceilometer would poll gauges every 60 seconds by using interval: 60 in several places throughout the file.

Next I created a new Cirros VM named cirros-test-meter with only an internal network interface:

# openstack server create --image cirros-d160722-x86_64 \
--flavor m1.tiny \
--nic net-id=bc7730ce-80e8-47e1-96e5-c4103ed8e37c cirros-test-meter


To get the UUID of cirros-test-meter:

# openstack server list
...
f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cirros-test-meter | ACTIVE  | private=192.168.95.142

I then created a ceilometer alarm for the cirros vm that would track cpu usage (using ceilometer gauge cpu_util) and trigger an alarm if cpu utilization went above 70% for more than two consecutive 60 second periods.

# ceilometer alarm-threshold-create --name cpu_high \
--description 'CPU usage high' --meter-name cpu_util \
--threshold 70 --comparison-operator gt --statistic avg \
--period 60 --evaluation-periods 2 --alarm-action 'log://' \
--query resource_id=f3280890-1b60-4a6c-8df5-7195dbb00ca3


Note that since Openstack Liberty, the alarm-action 'log://' will log alarms to /var/log/aodh/notifications.log instead of to /var/log/ceilometer/alarm-notifier.log so don't go looking for alarm logs in the wrong path!

Verify that the alarm cpu_high was created:

# ceilometer alarm-list
...
| Alarm ID                             | Name     | State             | Severity | Enabled | Continuous | Alarm condition                     | Time constraints |
     +--------------------------------------+----------+-------------------+----------+---------+------------+-------------------------------------+------------------+
| 23651a53-19cf-4bb0-97e0-09fab14445cd | cpu_high | insufficient data | low      | True    | False      | avg(cpu_util) > 70.0 during 2 x 60s | None             |

Since the alarm was just created, I will have to wait at least two 60 sec periods before the alarm has enough data.


I create high cpu load inside the cirros vm with the following while loop:

while [ 1 ] ; do
  echo $((13**99)) 1>/dev/null 2>&1
done &


This calculates 13 to the 99th power in an infinite loop. You can later kill this process by running top, finding the PID of the /bin/sh process running the above shell command, and killing it with sudo kill -15 PID.

This will immediately start generating 100% cpu load.

Just to make sure, let's see what meters are available for the cirros vm:

# ceilometer meter-list --query \
  resource=f3280890-1b60-4a6c-8df5-7195dbb00ca3

...
| Name                     | Type       | Unit      | Resource ID                          | User ID                          | Project ID                       |
     +--------------------------+------------+-----------+--------------------------------------+----------------------------------+----------------------------------+
| cpu                      | cumulative | ns        | f3280890-1b60-4a6c-8df5-7195dbb00ca3 | dfb630234e4e4155871611d5e60dc1d4 | ada4ee7cb446439abbe887601c87c900 |
| cpu.delta                | delta      | ns        | f3280890-1b60-4a6c-8df5-7195dbb00ca3 | dfb630234e4e4155871611d5e60dc1d4 | ada4ee7cb446439abbe887601c87c900 |
| cpu_util                 | gauge      | %         | f3280890-1b60-4a6c-8df5-7195dbb00ca3 | dfb630234e4e4155871611d5e60dc1d4 | ada4ee7cb446439abbe887601c87c900 |
...


You will see that the cpu_util ceilometer gauge exists for cirros-meter-test.

Several minutes passed, and so I got a list of cpu_util sample values from ceilometer:

# ceilometer sample-list --meter cpu_util --query \
  resource=f3280890-1b60-4a6c-8df5-7195dbb00ca3

+--------------------------------------+----------+-------+---------------+------+----------------------------+
| Resource ID                          | Name     | Type  | Volume        | Unit | Timestamp                  |
+--------------------------------------+----------+-------+---------------+------+----------------------------+
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.75779497  | %    | 2016-09-27T04:56:31.816000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.483886216 | %    | 2016-09-27T04:46:31.852000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 42.2381838593 | %    | 2016-09-27T04:36:31.826000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 4.27827845015 | %    | 2016-09-27T04:26:31.942000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 4.42085822432 | %    | 2016-09-27T04:16:31.935000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 4.41009081847 | %    | 2016-09-27T04:06:31.825000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 3.69494435414 | %    | 2016-09-27T03:56:31.837000 |


But something is not right. Although the latest readings for cpu_util are more than 100% (remember my alarm cpu_high should be triggered if cpu_util > 70%), you will notice that the polling interval is every 10 minutes:

2016-09-27T04:56
2016-09-27T04:46
2016-09-27T04:36
2016-09-27T04:26
2016-09-27T04:16
...

I definitely edited /etc/ceilometer/pipeline.yaml on my control node mgmt02 so that interval: 60 instead of 600 and then restarted ceilometer on the control node with openstack-service restart ceilometer

It turns out that I also have to edit /etc/ceilometer/pipeline.yaml on both of my Nova compute nodes as well! Running openstack-service status on mgmt02 I get:

[root@osmgmt02 ~(keystone_admin)]# openstack-service status
MainPID=26342 Id=neutron-dhcp-agent.service ActiveState=active
MainPID=26371 Id=neutron-l3-agent.service ActiveState=active
MainPID=26317 Id=neutron-lbaas-agent.service ActiveState=active
MainPID=26551 Id=neutron-metadata-agent.service ActiveState=active
MainPID=26299 Id=neutron-metering-agent.service ActiveState=active
MainPID=26411 Id=neutron-openvswitch-agent.service ActiveState=active
MainPID=26315 Id=neutron-server.service ActiveState=active
MainPID=26265 Id=openstack-aodh-evaluator.service ActiveState=active
MainPID=26455 Id=openstack-aodh-listener.service ActiveState=active
MainPID=26508 Id=openstack-aodh-notifier.service ActiveState=active
MainPID=19412 Id=openstack-ceilometer-central.service ActiveState=active
MainPID=19416 Id=openstack-ceilometer-collector.service ActiveState=active
MainPID=19414 Id=openstack-ceilometer-notification.service
ActiveState=active
MainPID=26577 Id=openstack-gnocchi-metricd.service ActiveState=active
MainPID=26236 Id=openstack-gnocchi-statsd.service ActiveState=active
MainPID=26535 Id=openstack-heat-api.service ActiveState=active
MainPID=26861 Id=openstack-heat-engine.service ActiveState=active

(Note: I am missing heat-api-cfn.service, which is necessary for autoscaling with Heat templates) 
MainPID=26781 Id=openstack-nova-api.service ActiveState=active
MainPID=26753 Id=openstack-nova-cert.service ActiveState=active
MainPID=26691 Id=openstack-nova-conductor.service ActiveState=active
MainPID=26316 Id=openstack-nova-consoleauth.service ActiveState=active
MainPID=26603 Id=openstack-nova-novncproxy.service ActiveState=active
MainPID=26702 Id=openstack-nova-scheduler.service ActiveState=active


And on both of my Nova nodes compute01, compute02 openstack-service status returns:

MainPID=845 Id=neutron-openvswitch-agent.service ActiveState=active
MainPID=812 Id=openstack-ceilometer-compute.service ActiveState=active
MainPID=822 Id=openstack-nova-compute.service ActiveState=active


The compute nodes are running Neutron OVS agent, ceilometer-compute, and nova-compute services.

Once I edit pipeline.yaml on compute01, compute02 and restart ceilometer on each Nova node, my cpu_high alarm finally gets triggered:

# ceilometer alarm-list
+--------------------------------------+----------+-------+----------+---------+------------+-------------------------------------+------------------+
| Alarm ID                             | Name     | State | Severity | Enabled | Continuous | Alarm condition                     | Time constraints |
+--------------------------------------+----------+-------+----------+---------+------------+-------------------------------------+------------------+
| 23651a53-19cf-4bb0-97e0-09fab14445cd | cpu_high | alarm | low      | True    | False      | avg(cpu_util) > 70.0 during 2 x 60s | None             |
+--------------------------------------+----------+-------+----------+---------+------------+-------------------------------------+------------------+

And you can also see that the cpu_util samples are now taken at one-minute intervals:

# ceilometer sample-list --meter cpu_util --query \
  resource=f3280890-1b60-4a6c-8df5-7195dbb00ca3 -l 10

+--------------------------------------+----------+-------+---------------+------+----------------------------+
| Resource ID                          | Name     | Type  | Volume        | Unit | Timestamp                  |
+--------------------------------------+----------+-------+---------------+------+----------------------------+
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.504582533 | %    | 2016-09-27T06:45:50.867000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.478356119 | %    | 2016-09-27T06:44:50.880000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.424959609 | %    | 2016-09-27T06:43:50.945000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.431860726 | %    | 2016-09-27T06:42:50.872000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.486230975 | %    | 2016-09-27T06:41:50.881000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.458524549 | %    | 2016-09-27T06:40:50.873000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.540472746 | %    | 2016-09-27T06:39:50.878000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.494100804 | %    | 2016-09-27T06:38:50.940000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.499692501 | %    | 2016-09-27T06:37:50.869000 |
| f3280890-1b60-4a6c-8df5-7195dbb00ca3 | cpu_util | gauge | 102.415145802 | %    | 2016-09-27T06:36:50.868000 |
+--------------------------------------+----------+-------+---------------+------+----------------------------+