WWW.DEVCO.NET Aggregating Nagios Checks With MCollective (3.7.2010, 15:12)

A very typical scenario I come across on many sites is the requirement to monitor something like Puppet across 100s or 1000s of machines.

The typical approaches are to add perhaps a central check on your puppet master or to check using NRPE or NSCA on every node. For this example the option exist to easily check on the master and get one check but that isn’t always easily achievable.

Think for example about monitoring mail queues on all your machines to make sure things like root mail isn’t getting stuck. In those cases you are forced to do per node checks which inevitably result in huge notification storms in the event that your mail server was down and not receiving the mail from the many nodes.

MCollective has had a plugin that can run NRPE commands for a long time, I’ve now added a nagios plugin using this agent to combine results from many hosts.

Sticking with the Puppet example, here are my needs:

  • I want to know if anywhere some puppet machine isn’t successfully doing runs.
  • I want to be able to do puppetd –disable and not get alerts for those machines.
  • I do not want to change any configs when I am adding new machines, it should just work.
  • I want the ability to do monitoring on subsets of machines on different probes

This is a pretty painful set of requirements for nagios on its own to achieve. Easy with the help of MCollective.

Ultimately, I just want this:

OK: 42 WARNING: 0 CRITICAL: 0 UNKNOWN: 0

Meaning 42 machines – only ones currently enabled – are all running happily.

The NRPE Check

We put the NRPE logic on every node. A simple check command in /etc/nagios/nrpe.d/check_puppet_run.cfg:

command[check_puppet_run]=/usr/lib/nagios/plugins/check_file_age -f /var/lib/puppet/state/state.yaml -w 5400 -c 7200

In my case I just want to know there are successful runs happening, if I wanted to know the code is actually compiling correctly I’d monitor the local cache age and size.

Determining if Puppet is enabled or not

Currently this is a bit hacky, I’ve filed tickets with Puppet Labs to improve this. The way to determine if puppet is disabled is to check if the lock file exist and if its 0 bytes. If it’s not zero bytes it means a puppetd is currently doing a run – there will be a pid in it. Or the puppetd crashed and there’s a stale pid preventing other runs.

To automate this and integrate into MCollective I’ve made a fact puppet_enabled. We’ll use this in MCollective discovery to only monitor machines that are enabled. Get this onto all your nodes perhaps using Plugins in Modules.

The MCollective Agent

You want to deploy the MCollective NRPE Agent to all your nodes, once you’ve got it right you can test it easily using something like this:

% mc-nrpe -W puppet_enabled=1 check_puppet_run
 
 * [ ============================================================> ] 47 / 47
 
Finished processing 47 / 47 hosts in 395.51 ms
              OK: 47
         WARNING: 0
        CRITICAL: 0
         UNKNOWN: 0

Note we’re restricting the run to only enabled hosts.

Integrating into Nagios

The last step is to add this to nagios. I create SSL certs and a specific client configuration for Nagios and put these in it’s home directory.

The check-mc-nrpe plugin works best with Nagios 3 as it will return subsequent lines of output indicating which machines are in what state so you get the details hidden behind the aggregation in alerts. It also outputs performance data for total node, each status and also how long it took to do the check.

The nagios command would be something like this:

define command{
        command_name                    check_mc_nrpe
        command_line                    /usr/sbin/check-mc-nrpe  --config /var/log/nagios/.mcollective/client.cfg  -W $ARG1$ $ARG2$
}

Truncated by Planet-DevOps, read more at the original (another 3018 bytes)

Link
Agile Web Operations Empower Your Team – You Won’t Regret It (29.6.2010, 20:12)

Image by _dougie



It’s hard to find the right structure for any organization. A lot of existing management wisdom comes from a time when you had to organize a physical work force. However, with today’s “knowledge workers” those structures don’t work as nicely anymore.

Everyone needs to prioritize

Every developer or devops has to prioritize the current work – nearly constantly. Do I work on a new feature next? Shall I answer those emails first? My box needs an upgrade – is now the right time for it? You hardly find a pointy haired boss standing behind you dictating your every step (hopefully ;-) ). Because of this, it’s extremely important to ensure that everyone shares the same goals and has all information needed to decide what’s most important right now. If only an archaic elite circle of managers know the whys and whats of a project and just delegate tasks, bad decisions by the executing team are inevitable.

Empowerment leads to better results, faster

Giving your team the responsibility for the success of their work is the next logical step. By letting the team know all the surrounding conditions and letting them add to the task list produces the best picture of work to be done you can get. I see it way too often that mostly the technical people in a team know exactly what needs to be done but do not feel to have the power to voice their concerns. Of course, those concerns like “we need to rethink this module as it will not scale as it is now” have to be taken into the prioritization of the backlog. If you ignore them for too long, you’ll see two bad effects:

  1. You run into serious technical issues
  2. You stress out your best and most dedicated people

Both effects can and should be avoided by choosing a fitting organizational structure: Empowerment of your best people.

Link
morethanseven Python: What To Use? (28.6.2010, 23:00)

My friend Jamie Rumbelow has started a new project and decided to use Python. He asked a great question over on Stack Overflow which basically came down to what should I use for my first proper Python web application project. After a quick prompting on twitter I decided to have a go. I’ve cross posted my anwser below more because it took as long as a typical blog post to write.

Frameworks

OK, so I’m a little biased here as I currently make extensive use of Django and organise the Django User Group in London so bear that in mind when reading the following.

Start with Django because it’s a great gateway drug. Lots of documentation and literature, a very active community of people to talk to and lots of example code around the web.

That’s a completely non-technical reason. Pylons is probably purer in terms of Python philosophy (being much more a collection of discrete bits and pieces) but lots of the technical stuff is personal preference, at least until you get into Python more. Compare the very active Django tag on Stack Overflow with that of pylons or turbogears though and I’d argue getting started is simply easier with Django irrespective of anything to do with code.

Personally I default to Django, but find that an increasing amount of time I actually opt for writing using simpler micro frameworks (think Sinatra rather than Rails). Lots of things to choose from (good list here). I tend to use MNML (because I wrote parts of it and it’s tiny) but others are actively developed. I tend to do this for small, stupid web services which are then strung together with a Django project in the middle serving people.

Worth noting here is appengine. You have to work within it’s limitations and it’s not designed for everything but it’s a great way to just play with Python and get something up and working quickly. It makes a great testbed for learning and experimentation.

Mongo/ORM

On the MongoDB front you’ll likely want to look at the basic python mongo library first to see if it has everything you need. If you really do want something a little more ORM like then mongoengine might be what you’re looking for. A bunch of folks are also working on making Django specifically integrate more seamlessly with nosql backends. Some of that is for future Django releases, but django-norel has code now.

For relational data SQLAlchemy is good if you want something standalone. Django’s ORM is also excellent if you’re using Django.

API

The most official Oauth library is python-oauth2, which handily has a Django example as part of it’s docs.

Piston is a Django app which provides lots of tools for building APIs. It has the advantage of being pretty active and well maintained and in production all over the place. Other projects exist too, including Dagny which is an early attempt to create something akin to RESTful resources in Rails.

In reality any Python framework (or even just raw WSGI code) should be reasonably good for this sort of task.

Testing

Python has unittest as part of it’s standard library, and unittest2 is in python 2.7 (but backported to previous versions too). Some people also like Nose, which is an alternative test runner with some additional features. Twill is also nice, it’s a “a simple scripting language for Web browsing”, so handy for some functional testing. Freshen is a port of cucumber to Python. I haven’t yet gotten round to using this in anger, but a quick look now suggests it’s much better than when I last looked.

I actually also use Ruby for high level testing of Python apps and apis because I love the combination of celerity and cucumber. But I’m weird and get funny looks from other Python people for this.

Message Queues

For a message queue, whatever language I’m using, I now always use RabbitMQ

Truncated by Planet-DevOps, read more at the original (another 1153 bytes)

Link
WWW.DEVCO.NET Tutorial: Writing MCollective Agents (27.6.2010, 15:41)

I’ve recorded a screencast that walks you through the process of developing a SimpleRPC Agent, give it a DDL and also a simple client to communicate with it.

The tutorial creates a small echo agent that takes input and return it unmodified. It validates that you are sending a string and has a sample of dealing with intermittent failure.

Once you’ve watched this, or even during, you can use the following links are reference material: Writing Agents, Data Definition Language and Writing Clients.

You can view it directly on blip.tv which will hopefully be better quality.

I used a few VIM Snippets during the demo to boilerplate the agent and DDL, you’ll find these in the tarball for the upcoming 0.4.7 release in the ext/vim directory, they are already on GitHub too.

Link
Kitchen Soap Ops Meta-Metrics: Velocity 2010 Slides (25.6.2010, 04:05)

As expected, Velocity was excellent this year. What an awesome time to be in this field.

Caveat for those who didn’t see/hear my talk: the graphs and numbers in the slides are, for the most part, made up. But they’re also in line with what I’ve seen at Flickr and Etsy.

Link
WWW.DEVCO.NET MCollective Data Definition Language (24.6.2010, 23:08)

I mentioned in my recent post about mcollective Road Map about the DDL.

The DDL is used to describe agents in a way that is accessible by other programs, web applications, client libraries and so forth to help those various client tools to configure themselves correctly.

An actual example of a DDL file can be found here if you want to have a good look at it and full docs here.

I’ve created a short video showing the DDL and some of the features of the upcoming 0.4.7 release, you probably want to view it full screen to really see what’s going on.

And a quick note about the colors, I know people tend to feel strongly about this kind of thing, you can disable them in the config file of the client :)

This is also my first attempt at using blip.tv, please let me know if you see any problems.

Link
Agile Web Operations Size Matters – Why You Should Prefer Small User Stories (22.6.2010, 20:26)

Image by couchlearner



If you have a lot of big user stories, your velocity will jump up and down wildly. This makes it extremely difficult to tell when a user story will be done. Breaking down your huge user stories into smaller ones will help you smooth the flow and give you a clearer picture.

User Stories Start Big

Often you have an idea for a new feature. There is a vague picture in your mind how it could look and work. You put it on the backlog to be able to estimate and prioritize it. Teams new to agile often make the mistake to keep it that way. They assign it a huge amount of story points and start hacking away.

Big User Stories Are Abstract

The problem with this approach is that you skip an important design step in your process. A rough feature idea needs further break down to find out how exactly it should look like. By doing that you will replace the big, abstract story with a lot of smaller, more concrete ones. If your feature idea was “mobile support” you might start breaking it down into stories like “show my location”, “show nearby friends”, etc. These stories will replace the “mobile support” story. Now, it might be possible to estimate those stories. Usually the sum of story points for the descendent stories is bigger than the very vague estimate for the original user story.

Big User Stories Block Your Flow

Another issue of keeping user stories too big is that they will block your flow. If you use one week long iterations a big story might span multiple iterations. That means you deliver 0 story points for a couple of weeks and then suddenly you earn all 40 or so points for that one huge story. Unfortunately even after quite some weeks you have no idea which velocity you could deliver ongoing. Dividing the 40 story points by number of iterations will not cut it.

Only if you use small enough stories, which you can complete within one iteration, will you get a reasonable value for your velocity iteration after iteration. Only then you’ll be able to tell when stories could be finished in the future.

Truncated by Planet-DevOps, read more at the original (another 2174 bytes)

Link
Everything is a Freaking DNS problem A parallel universe (21.6.2010, 19:19)

What happens when you mention Open Office and Firewall in once sentence, in public ?

People start actually building it (French Article)

Then add to that list that there's also people out there that think that running MySQL over NFS is providing them High Availability, or that using DNS Round Robin will provide them a scalable setup,

So yes .. apparently there is indeed a parallel universe out there.

And no .. I don't want to see Webmin in any Appliance .. that is a joke..., or rather a rant ..

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/1013

Link
Everything is a Freaking DNS problem Inuits Day (21.6.2010, 19:14)

Couple of Fridays ago we had one of our @Inuits days again. Rather than having some people give talks and presentations about what they have been doing for the past couple of months this time we set out to research, test, and build stuff.

We split up in 3 different groups, one focusing on CI and testing freshly build stuff with cucumber, a second one setup and tested Galera

We setup a 3 node Galera cluster , not really as smooth as we'd like to ..

Our first bump was that the installation of the package on CentOS is hell, it needs manual interaction such as replacing packages. Deploying this from a repository is probably not going to be a straight forward option.

Galera only takes care of replicating data, just as with MySQL MM replication there still is a need for an external tool to define where to access the database, and implement monitoring in such a way that you are connecting to an up to date database.

Karl started wondering about Galera's locking, turns out the locks aren't cluster wide, locks within the same node work fine.. so if galera is solely used for HA with 1 active node and X failover nodes, it will work (so all transactions happening on 1 node).

We also ran into some issues when trying to start a node which couldn't contact the wsrep_cluster_address point (which is a node it will sync from at startup if specified in the wsrep.cnf file) , it just didn't want to start. This means that when the referenced node (configured in wsrep_cluster_address)is down, you will need to comment it out before you are able to start the mysql server.

The fact that Galera replicates everythying brought us to the discussion if we really wanted that , or if we wanted more finegrained control over which databases or even tables we want to replicate and which ones we didn't want to replicate. A minority of people wanted to replicate everything, the majority of our group wanted finere grained control over what is being replicated to another node.

I`m sure Lefred will shortly be writing about the progress his group made on Banquise

The day ended as it should .. with BBQ and plenty of drinks

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/1012

Link