This last week I ran into a bit of shell scripting that really caused me some grief for a few days. The person who wrote it had named the script stopService.sh. Now, one should assume that if a script as simple as a service stop script has been working for the last two years and nothing has changed, it would continue working. It turns out that with various other environmental changes, a bug introduced two years ago was suddenly showing up.
That said, I took apart this stop script and noticed at the bottom that the original author had written a check for a service to stop that looked like...
# Get the pid of tomcat
pid=$(ps -ef | grep tomcat | grep -v grep | tr -s ' ' | cut -d ' ' -f 2 | head -1)
# Send kill command to the process
kill -0 ${pid}
while [ true ]; do
sleep 15
ps -f ${pid} 2>/dev/null 1>/dev/null
if [[ $? -gt 0 ]]; then
break
fi
break
done
If you read much bash or code at all, you'll notice that while loop there and think "Oh, how clever - looping until the process exits". Then you'll arrive at the break statement at the end and think "Wait, why put all this code in a loop and always break on the first iterration?".
When I saw that block of code at the end of a 243 line script designed to stop a single process type, I again realized that, regardless of title (senior principle architect in this case), not everyone who scripts knows how to do process management or understands basic programming logic.
I don't claim to be an expert at all. I do however have many years of experience with this, so perhaps I can contribute something new so some people's knowledge. If you disagree with how I go about solving this problem, please feel free to send me an email. I'd be happy to learn something new if you've got a better way to do it! With that, let's get started.
First, there were several problems that code excerpt, that with a little knowledge could be solved easily and in a portable, reproducible, and maintainable manner. The problems are more than just technical as well. Here are the problems that I see.
It doesn't ensure the process stops before proceeding
Kill signal 0 does nothing except allow for error checking (eg: if the process is still running). Check man 1 kill for more information. TLDR; Kill -0 shouldn't be used for asking a process to stop.
The pid status check relies on the output from a subshell
There is no contingency for when the process won't shut down.
The code exists outside of a function, and thus is more difficult to reuse
Whether shutting down a service or simply blocking a process until another exits (like waiting for a backgrounded download to finish for instance), the humble waitpid function can often help out.
The concept of a waitpid function is actually standardized in many places, such as posix c and glibc. That said, let's write our own for bash.
#!/usr/bin/env bash
set -e
#
# Waits the requested time for the specified pid to exit. If the pid does not
# exit in that time, the function return code is 1 (error). If the specified
# pid does exit without the given threshold, then return code is 0.
#
# @param pid Pid to wait for exit
# @param threshold Max amount of time in seconds to wait for the pid to exit
#
waitpid() {
local _pid="${1:-}"
local _threshold="${2:-}"
# Check that arguments were specified
[ -z "${_pid}" ] && printf "Pid required\n" && return 1
[ -z "${_threshold}" ] && printf "Wait threshold required\n" && return 1
# Check every second up to the threshold wait time
for (( i=0; i<${_threshold}; i++ )); do
[ ! -d "/proc/${_pid}" ] && return 0
sleep 1
done
return 1
}
This function takes two arguments: the first is the pid number, the second is the wait threshold. This particularly useful because the code can be re-used without being rewritten.
If you don't want your program waiting forever for a process that's stuck, you can specify a wait threshold and it will return within that time with an error code if the process did not exit within the specified time (it returns 0 if the process did exit within the wait threshold).
This function makes use of return codes. This is useful because, as mentioned, it tells you whether the process exited in the specified time or not. This is useful becasue we can write a process that checks if the process exited in the specified time, and if not, sends a kill -9 to the process. Something like this...
How to kill a stubborn process
pid=7932 # Pid to wait for shutdown
threshold=12 # Wait threshold in seconds
# Send SIGTERM
kill -15 "${pid}"
# Wait for pid exit. If waitpid returned 1, send SIGKILL
waitpid "${pid}" "${threshold}" || kill -9 "${pid}"
That code excerpt will wait 12 seconds for the process to exit. If it does not exit within that time, it sends a SIGKILL signal to the process, forcing it to shut down.
In the original code excerpt, the check for the process' presence in the process table was done via a very complicated 'ps' command daisy chained to several greps, trim, cut, and head. This is dangerous for myriad reasons.
The better way to do it is to read the process from a pid file that was written at startup (if you're not writing pid files at startup, you should be). This process number is then state checked by looking in the system's proc filesystem at /proc/${pid}. This is much safer, and doesn't rely on screenscraping the output of a tool that has several different 'standard' (eg: gnu, bsd, posix, etc) versions.
One final to note: if you want functionality that simply waits for a process to exit without limit, read about the wait POSIX shell builtin. The downside to this though, is it will wait without a timeout, so it may never exit.
As I mentioned, the kill signal 0 sends no signal, but rather is useful for checking if a process is still running. If you want to request a process exit like a real friend, use kill -15. If you are interested in what other signals are available, check out the man page for signal, section 7 (man 7 signal). It contains a full list of standard signals and that they do (very useful for doing things like we are in this blog post).
There are really two signals we're interested in for the purposes of this post though that will most of the time ensure your process quits (it doesn't account for zombies). Those signals are 15 and 9.
Signal 15 is called SIGTERM. It effectively requests a given process to exit nicely. This is the default signal sent by the kill (man 1 kill) command.
Signal 9 is called SIGKILL. Per the signal man page, this signal cannot be caught, so the process has no choice but to exit.
Last edited: 2020-12-30 16:01:00 UTC