This week we ran into a rather small (or is it large) problem with our puppet instance. We logged into the puppet console and noticed that there were over 37,000 pending tasks, and the list was growing fast. Checking the logs, we saw an "out of range" exception. An out of range exception for an enterprise product is never a good thing. It's almost as bad as a segmentation fault in an enterprise product, something you can do nothing about if you don't have access to the source code. In this case though, we actually can do something about this particular issue.
Here's the exact error we were seeing...
2015-04-17T22:30:15+0000: [Worker(delayed_job.7 host:http://foosite.com pid:17446)] Class#create_from_yaml failed with ActiveRecord::StatementInvalid: PG::Error: ERROR: value "2147716789" is out of range for type integer: INSERT INTO "resource_events" ("audited", "desired_value", "historical_value", "message", "name", "previous_value", "property", "resource_status_id", "status", "time") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING "id" - 2 failed attempts 2015-04-17T22:30:15+0000: [Worker(delayed_job.7 host:http://foosite.com pid:17446)] PERMANENTLY removing Class#create_from_yaml because of 3 consecutive failures.
2015-04-17T22:30:15+0000: [Worker(delayed_job.7 host:http://foosite.com pid:17446)] 1 jobs processed at 0.3037 j/s, 1 failed ...
2015-04-17T22:30:15+0000: [Worker(delayed_job.2 host:http://foosite.com pid:17361)] Class#create_from_yaml failed with
ActiveRecord::StatementInvalid: PG::Error: ERROR: value "2147716814" is out of range for type integer
It turns out that the functionality that uses this is deprecated as of early 2014, so this supposedly isn't an issue with newer puppet installs. However, if you're using an older puppet (3.0 or older), you might run into this problem.
The problem lies in the database schema for the puppet console. Basically, every time a node checks in, it inserts a row into the database. The database has some tables with columns that auto-increment (0, 1, 2, 3, etc). If you have a lot of nodes reporting back frequently, this number will likely increase a lot over time. In our case, we have 333 nodes reporting every 30 minutes or more (we do development and thus we often manually run puppet agent with the -t switch). In our case, to hit 37,000, it would have taken a little over 2 days (30*(24*60)*333 = 1 day's checkin count)
The columns that autoincrement use the int datatype. This datatype, as seen
here, uses 4
bytes. In case anyone doesn't remember, there are 8 bits in a byte, which means
that 4 * 8 = 32
. That means that the maximum number that will fit in any
column with the int data type is 2^(32-1)^
, which equals 2,147,483,648. That
means 2 billion puppet reports. It seems like a number not easy to achieve, but
it is quite possible - we did it.
The solution here is to change the data type on the columns in concern to be bigint rather than integer. Again, as documented by the postgres folks here, a bigint is 8 bytes, which is a 64 bit number. That means the largest it can hold is 9,223,372,036,854,775,807 (about 9 quintillion). That said, let's get to it.
Before performing the fix, we should probably perform a backup of the database, unless you're the type who likes causing a fire you have to put out on a Friday, just like...
To execute a backup (we'll assume your database name is console), run
pgsql_dump -U console -W -h localhost console > console.$(date '+%Y%d%m.%H%M').sql
Once that backup is complete (mine was 86 GB, so it took a while), shut down
all of your puppet services to be safe. A list of services you might want to
shut down can be found
here.
A general rule of thumb though is, anything in /etc/init.d/ that starts with
pe-
is something that should be stopped, excepting pe-postgresql.
Once that's done, execute this fun console trick.
$ psql -U console -W -h localhot
-- Change to the console table console=> \c console
-- This one might take a *very* long time (mine took an hour) console=>
alter table resource_statuses alter column id type bigint; console=>
alter table resource_events alter column id type bigint; console=> alter
table resource_events alter column resource_status_id type bigint
console=> \q
With that, restart the pe-postgresql
service for good measure. Once that's
done restarting, start up the other pe-* services and everything should be
working now.
This is a bug that was reported about three years ago. They have since migrated ticket tracking systems, so the links can be found at...
Written on: 2015-05-31 01:07:36 -0600
Last edited: 2025-03-07 00:12:21 UTC