Recently in Joyent Category

Checking opensolaris physical memory inside a zone

Joyent uses OpenSolaris zones for its accelerators. At some point I needed to verify the physical memory size of one of these zones but was unable to use the webmin tool that Joyent provides. This seemingly simple operation was actually pretty tricky to figure out. Here are the steps I followed:

 
$ sudo rcapadm -E
                                      state: enabled
           memory cap enforcement threshold: 0%
                    process scan rate (sec): 15
                 reconfiguration rate (sec): 60
                          report rate (sec): 5
                    RSS sampling rate (sec): 5
$ rcapstat -z 1 1
    id zone            nproc    vm   rss   cap    at avgat    pg avgpg
    46 foo            -    0K    0K 2048M    0K    0K    0K    0K
$ sudo rcapadm -D
                                      state: disabled
           memory cap enforcement threshold: 0%
                    process scan rate (sec): 15
                 reconfiguration rate (sec): 60
                          report rate (sec): 5
                    RSS sampling rate (sec): 5
$ 

rcapadm seems to be a resource management tool. I’m not sure if there’s an impact to leaving it running, but I disabled it just in case.

Zeus ZXTM Upgrade Cannot fork: Not enough space

While upgrading one of my Zeus ZXTM traffic managers from v5.1r2 to v6.0r4 it crashed on startup. This pretty surprising because the upgrade process appeared to have proceeded without a hitch. Here’s the error I saw in the logs and when I attempted to fire up the zxtm program using SSH:

$ ./start-zeus
Initializing Zeus Application Framework. (C) 1995 - 2010 Zeus Technology Limited
Zeus Administration Server already running: 235
Zeus Traffic Manager - (C) 1995 - 2010 Zeus Technology Limited
Version 6.0r4, Build date: Feb 10 2010 08:32:37
Process permissions set to zeus:zeus
 INFO   Zeus Traffic Manager starting
 INFO   Version 6.0r4, Build date: Feb 10 2010 08:32:37
 FATAL Parent 1234 hit FATAL at Cannot fork:Not enough space
[0x6ac417] function __1cOcommkeyChanged6FpknNConfigSection_rknKStringBase_pknKConfigFile_p6_nIRetValue__ + 0x417
[0x8d4376] function __1cFFATAL6Fpkc1i_v_ + 0x66
[0x6ad2b3] function __1cUreally_nice_shutdown6F_v_ + 0x883
[0x6b3a3b] function __1cKParentBoot6Fpkc_v_ + 0xa8b
[0x5e83c1] function main + 0x571
[0x5c532c] function _start + 0x6c
[0x0] function ?? + 0xffffffffffa3ad40
$

Not enough space? Something was seriously amiss. The solution turned out to be pretty simple, but was not something I could find in the manual.

In v6.0, a new configuration parameter was added: sharedpoolsize. This was not set since I was upgrading from v5.1. The ZXTM made its best guess, it guessed wrong on my virtualized environment and picked a value that exceeded the memory available to my zone. The fix was simple: set sharedpoolsize in $ZEUSHOME/zxtm-6.0r4/conf/settings.cfg to a size small enough to fit into my available memory. Since this configuration did not exist, I mad to add it at the bottom of the file.

GoDaddy SSL Certificate in Zeus ZXTM Traffic Managers

Almost a year ago I purchased a wildcard SSL certificate from GoDaddy. When it was first issued I simply loaded it into into my Zeus ZXTM load balancer with the import button.

cert1.jpg

Everything seemed fine for quite awhile. I visited my web site in Firefox and in all of the flavors of IE. It seemed to work great. That is, until, someone called me to let me know that Safari was not accepting the certificate! I thought I was in trouble until I googled around and found many blog entries about the root cause. It turns out that my server, or in this case my Zeus ZXTM load balancer, was not configured to display the whole certificate chain back to the root authority.

This makes me wonder how this worked at all in every other web browser. Perhaps this is such a common problem that the other browsers hack around it?

Anyway, the fix was easy enough but the terminology was different. Rather than an SSLCertificateChainFile, my ZXTM called it an ‘Intermediate Certificate’. One click of a button, browsing to gd_bundle.crt (provided with my original certificate), and it was loaded up and the issue was fixed.

cert2.jpg

I ran into this very cryptic one while setting up Nagios at Joyent. I copied my plugins from one nrpe client to a new server. Three of my checks used check_procs which all failed with a message like this:

check_procs
System call sent warnings to stderr: pst3: This program can only be run by the root user!

To make this even more annoying, sudo did not fix it. The same error message was displayed. What was the problem? File permissions! The error message should say “This program must be owned by the root user!” The fix:

sudo chmod root:root pst3

I'm using Nagios to monitor some services on my Solaris 10 systems hosted at Joyent. Until now I've just been using check_http to monitor everything that I cared about. Times change, though, and now I need to monitor disk space, free memory, and cpu load on many systems. I like to keep things simple, so I decided that it's time to install NRPE.

Building Nagios 3 and the other plugins was a breeze so I figured that this would be no problem. I downloaded NRPE and did the typical install steps. This is what I saw:

$  ./configure
... lots of configure output ...
$  gmake all
cd ./src/; gmake ; cd ..
gmake[1]: Entering directory `/home/eng/nrpe-2.12/src'
gcc -g -O2 -I/usr/local/include/openssl -I/usr/local/include -DHAVE_CONFIG_H -o nrpe nrpe.c utils.c -L/usr/local/lib  -lssl -lcrypto -lnsl -lsocket  ./snprintf.o 
nrpe.c: In function `get_log_facility':
nrpe.c:617: error: `LOG_AUTHPRIV' undeclared (first use in this function)
nrpe.c:617: error: (Each undeclared identifier is reported only once
nrpe.c:617: error: for each function it appears in.)
nrpe.c:619: error: `LOG_FTP' undeclared (first use in this function)
gmake[1]: *** [nrpe] Error 1
gmake[1]: Leaving directory `/home/eng/nrpe-2.12/src'

*** Compile finished ***

If the NRPE daemon and client compiled without any errors, you
can continue with the installation or upgrade process.

Read the PDF documentation (NRPE.pdf) for information on the next
steps you should take to complete the installation or upgrade.

Eeek! That sure is an ugly error. At first I assumed that this was a configuration issue, but that should have come up during the ./configure. I ended up doing what you're never supposed to do: I hacked the code. The rest of the installation went by the book.

Just go on into src/nrpe.c and delete the only two line references to LOGAUTHPRIV and LOGFTP. In v2.12 I found them in the middle of an if-else series.

svc.configd Memory Leak

I was making a release today to one of my servers at Joyent. As part of the release I ran a short script written in Java. Java complained that it could not allocate memory to create a JVM! This is a bad sign on a production system. After some poking using top and the much more useful (on Solaris) prstat, I discovered that /lib/svc/bin/svc.configd was taking up 95% of my memory! It appears to have a memory leak.

I checked out the brief man page. It seemed pretty important so I was afraid to kill the process. Some googling around for a restart solution proved my fears baseless. It's OK to kill this process. It will restart by itself.

I killed svc.configd and it came back right away without incident. My memory was freed up.

Time to start monitoring memory usage on my opensolaris zones.

As I mentioned in an earlier entry I'm using PostgreSql 8.3.3 at Joyent. It works great but the documentation is not as good as the ubiquitous MySql.

I required high availability but not real time fail over so I opted to use the warm standby which is a pseudo-built in feature of postgres as of version 8.3 that provides easy failover that can be automated if necessary. It's pseudo-built in because no coding is required, but you need to get your hands pretty dirty to get it working.

To get it working you'll need to do some work on the primary server so that it spits out incremental write ahead log (WAL) files as well as a bit more work on the warm standby server(s) to consume these logs.

You can follow these steps for both a brand new postgres installation as well as an existing heavy traffic installation. Furthermore, no downtime is required to set this up.

Assumptions

  • You're running PostgreSql 8.3.3 (the version that came with my Joyent node)
  • You have at least two identical nodes for this purpose at Joyent
  • You have NFS space for your write ahead log files
  • You've at least scanned the official docs on this subject

Primary Setup

All we need to do on the primary server is set it up to push archive WAL files to our NFS mount. It will push WAL files out in 16 mb chunks. Since 16mb is a lot of data to lose on a low traffic DB like mine, I'll be forcing it to flush every hour. This makes it a space hog, but disk space is cheap (at least at Joyent).
  1. Modify the /var/pgsql/data/postgresql.conf. Set the following parameters
    archive_mode = on
    archive_command = 'cp -i %p /shared/psql_wal/%f </dev/null'
    archive_timeout = 3600
  2. Restart postgres to apply the changes
    sudo -u postgres pg_ctl restart -D /var/pgsql/data

Standby Setup

I only set up one standby node, but you can have as many as you like as long as they all have access to the same NFS mount.

This is where the configuration gets a bit hairy. We'll need to build a small, but very useful, C utility from source called pg_standby. We'll also be doing quite a bit of configuration.


  1. Before we configure stuff we'll need to build pg_standby from source. It's not provided with the default Joyent posgres installation.

    1. Download the PostgreSql source code to your standby server. I could not find the source for v8.3.3, so I grabbed v8.3.4. wget works great for this.

    2. Decompress the tar ball and run configure and gmake
      cd postgresql
      ./configure
      ... lots of output from configure ...
      ./gmake all
      ... lots of output from gmake ...

    3. Next build the contents of contrib. This is the folder where all the cool semi-supported utilities (including pg_standby) live.
      cd contrib
      gmake all
      ... lots of output from gmake ...

    4. It's built. Time to install it.
      cd pg_standby
      sudo cp ./pg_standby /opt/local/bin/pg_standby
      sudo chmod 755 /opt/local/bin/pg_standby

  2. We now have pg_standby. You can verify that it works by running which pg_standby. Next we'll stop the standby server to prepare for a checkpoint backup from our primary system. Run this command on the standby server
    sudo -u postgres pg_ctl stop -D /var/pgsql/data

  3. With the standby server down we'll log into the primary server and start the hot backup to our NFS mount.
    echo "SELECT pg_start_backup('mybackup');" | psql -U postgres
    sudo tar -cvf /shared/mybackup.tar /var/pgsql/data

  4. While the backup flag is still set, log into the standby system and restore this backup.
    sudo rm -rf /var/pgsql/data
    sudo cp /shared/mybackup.tar /var/pgsql
    cd /var/pgsql
    sudo tar -xvf mybackup.tar
    sudo chown -R postgres:postgres data

  5. With the backup completed, log back into the primary system and clear the backup flag.
    echo "SELECT pg_stop_backup();" | psql -U postgres

  6. Time to configure. On the standby server create a recovery.conf file in /var/pgsql/data/ with the following contents. This tells postgres to slurp up the log files.
    restore_command = 'pg_standby -l -d -s 2 -t /tmp/pgsql.trigger.5432 /shared/psql_wal/ %f %p %r 2>>standby.log'

  7. Start up postgres on the standby server. It will start restoring data right away (since our live DB has already produced some data for it to consume)
    sudo -u postgres pg_ctl start -D /var/pgsql/data

  8. Take a peek at the restore to make sure there aren't too many errors. It may complain about missing files. Ignore these warnings
    sudo -u postgres tail -f /var/pgsql/data/standby.log

  9. Verify that it's still in standby mode. It should complain with this message: 'psql: FATAL: the database system is starting up'
    psql

Doing the Failover

Your primary server is busy serving up requests and your warm standbys are slurping up logs every hour. This is great and all, but what do we do when the primary fails? Here's my process for flipping over to a standby.
  1. Log into the primary server and make sure that postgresql is all the way down.
  2. If the primary server is not totally shut down, take it down the rest of the way. kill -9 the primary postgres process if necessary. sudo kill -9 `sudo cat /var/pgsql/data/postmaster.pid | head -n 1`
  3. log into the warm standby
  4. switch to the postges user sudo -u postgres bash
  5. Verify that the server is still in standby mode. It should report 'psql: FATAL: the database system is starting up'. psql
  6. Assuming that it is not up, create the trigger file. This tells our standby server that it's time to become a primary node. It's very important that only one server is primary at a time so i hope you followed steps 1 and 2. touch /tmp/pgsql.trigger.5432
  7. Verify that the standby server is running psql
  8. Get your clients pointing to the new server (change DNS, change IPs in configuration files, etc.)

Credits

This tutorial was largely adapted from Ichsan's Using pg_standby for high availability of Postgresql

Nagios at Joyent

I've recently started using Joyent for my production hosting needs at work.  It's been a great experience so far.  The virtual hardware is great (very fast and never crashes), their support has been very helpful, and if you have a bandwidth intense app their prices are impossible to beat.

Now that things are starting to get up and running at Joyent, I decided it was time to set up monitoring.  Many people on Joyent use munin, but I'm more familiar with Nagios so I decided to give it a go.  It went pretty well, but I did run into a couple hiccups so I'm documenting my installation here.
  1. Create the nagios user and group if they're not there already
    groupadd nagios
    useradd -g nagios -s /usr/bin/false -d /wherever -c 'Nagios User' nagios
  2. Run the package installer if nagios isn't already installed (as it turns out I already had nagios installed)
    pkg_add http://pkgsrc.joyent.com/2008Q3/net/nagios-base-3.0.3.tgz
  3. Create a virtual host for Nagios
    sudo vi /opt/local/etc/httpd/virtualhosts/nagios.conf
    with these contents:
    ScriptAlias /nagios/cgi-bin "/home/jill/cgi-bin/nagios"
    
    <Directory "/home/jill/cgi-bin/nagios">
        AllowOverride None
        Options ExecCGI
        Order allow,deny
        Allow from all
        AuthName "Nagios Access"
        AuthType Basic
        AuthUserFile /opt/local/etc/nagios/htpasswd.users
        Require valid-user
    </Directory>
    
    Alias /nagios "/opt/nagios/share"
    
    <Directory "/opt/nagios/share">
       Options None
       AllowOverride None
       Order allow,deny
       Allow from all
       AuthName "Nagios Access"
       AuthType Basic
       AuthUserFile /opt/local/etc/nagios/htpasswd.users
       Require valid-user
    </Directory>
    
  4. This should have been it, but I was seeing a cryptic error from suExec. As it turns out suExec was preventing them from executing. Since I didn't have the time to learn suExec, I took the easy way out and moved the CGI files to the docroot (which was the seed user's folder)