Statistics on Linux with /proc
I've been wanting to make a little page for the statistics of my webserver (the system not the program). When I started to research the APIs that I'd need, just on a whim one day with no intention to start, I got grabbed by it and knew I had to start.
Check it out: starlight.html
a /proc
foreword
The /proc
filesystem, on Linux, is a sort of window into
the kernel. It lets you view some pretty detailed information by simply
reading some files (thanks everything-is-a-file linux).
There's a lot of information about it in the man pages.
They might all be in one big one at man proc
but,
like how they are on my server, they could be broken into separate pages
for distinct sections.
I have linked the relevant pages at the top of their section. It's a link to man7.org, which seems to be the source for Linux Kernel man pages on the internet. man7 is linked from kernel.org which lends it credibility at least.
Memory
This one isn't too hard. I open the file /proc/meminfo
and
look for the lines starting with MemTotal
and MemAvailable
which are the total memory and currently available memory, respectively. They
are very well named :). For usage, I just subtract available from total.
Network
If you cat /proc/net/dev
you can see some stats about
your networking interfaces. This is what I parse, with some pain.
I read the bytes columns from the receive and transmit sections. These are total counts of bytes received since boot, so you'll have to take two samples and subtract to get the number of bytes in some time-span.
Looking at it in the terminal, you might assume that the separator between the columns was a tab character. I sure did! It is not a tab, but many spaces.
Because of spaces-and-not-tabs
(not the tabs vs. spaces debate of usual, but with similarities), it proved
to be a bit annoying to parse. It made me finally
pull in a regex crate because I didn't feel like dealing with it
at the time. Eventually™ I want to write a skip-arbitrarily-many-whitespace
iterator, but for now regex-lite
lives in my Cargo.toml
.
CPU
/proc/stat
is the least obvious of the triplet. It has more than
just the CPU's information, but the cpu is what we're after. You'll notice many
CPU lines probably! I'm using the one starting just "cpu" without a number
(cpu0, cpu1, etc.) because I only have the 1 core. If I had more than one core
it'd work similarly, the just-cpu line sums the other ones, but then it could
show >100% usage 'cause it's per-core usage just added together.
First things uh, second? To summarize from the man page:
The units of these values are ticks. There are USER_HZ
ticks per second. On most platforms it's 100 but you can
check the value for your system with sysconf(_SC_CLK_TCK)
.
small C program to check _SC_CLK_TCK :)
#include <stdio.h>
#include <unistd.h>
int main() {
printf("USER_HZ is %i", sysconf(_SC_CLK_TCK));
}
But what columns of data do we use? From this stackoverflow answer it seems that summing the user, nice, and system columns get you the total ticks. The user and system make sense to me, time spent in user and system mode, but what on earth is nice? I sure hope it is.
The Internet tells me to check man nice
(man7.org/nice).
That page says that the
nicness of a process can be adjusted to change how the kernel schedules
that process. Making it less nice (down to -20) increases it's priority, and
increasing it's niceness (up to 19) lowers it. I guess that makes sense. Lowering
the niceness makes the process greedier and in want of more attention
from the scheduler? I'm unsure how well that personification tracks to reality, but
it helped me think about it.
The nice column, then, seems to be the time spent in processes that would go in the user column, but they have a different priority and I guess differentiating that is important.
Oh, but there might be more columns we want! There's another S.O. answer that I found while writing this that says the sixth and seventh columns should used as well. These are irq/softirq and are time spent servicing interrupts. I think it makes sense we'd want that, too.
So you have all these columns—user, nice, system, irq, and softirq—that add together to give you the total number of ticks spent Doing Things since boot, and you have the number of ticks in a second. Can you see where I'm going with this?
Yup, take two samples some time span apart, subtract the former from the later, and then you have how much time the processor spent Doing Things. You can use that and the number of ticks in your time span to calculate utilization. Or you just have how much actual time The Computer spent Doing Work which is also pretty neat. Maybe you can pay it an hourly wage. Is that just AWS?
Something to watch out for:
apparently the numbers in /proc/stat
can overflow and
wrap back to zero. I don't know what size integers they are so I'm
unsure how real of a risk that is, but it seemed worth mentioning here.
So you've parsed the stats, now to graphs!
My main trouble here was selecting a range that makes sense for the data it's representing.
Again, memory was easy. There is a total, normally-unchanging amount of RAM, so I just use that as the max. Perhaps there's something to be said about zooming further in to see the megabyte-by-megabyte variance, but I am much more interested in a "how close am I to the ceiling" kind of graph. Like, would I hit my head if I jumped? that kind of thing.
The CPU graph, though, that's very variable and a bit spiky. I don't really care what the max value was if it's a spike, it can go off the top for all I care, what I want to see is the typical usage.
If I just ranged to the max then I'd have what I call The Linode Problem. I call it that, rather predictably, because that's what Linode's graphs do and it makes them kind of useless? Great, I love to see that spike up to 100%, but that's all that I can see now.
So instead of max-grabbing, I sort the data and take the value that's almost max. My series are 256 samples long, so what this looked like was taking the 240th value in the array, getting the closest-highest percent, and using that as the top of the range.
This does mean if it's very spiky, I get The Linode Problem again, but in that case I'm kind of okay with it. I sample every minute, so my 256 pixel long graphs are roughly 4 hours long. If it spikes more than 16 times in that period, perhaps that's worth looking into.
Okay, CPU done. Network time! It's the same, pretty much. Where there was
one line, there are now two. And lots more spikes! I combine the receive
and transmit series into one vec
, sort it, and take the 32nd
highest value.
I draw the area under the line, too, because it was nigh impossible to see the line when it was so.. discontinuous? We get another problem with that, though, where the second-drawn line-and-underfill will obscure the one drawn first. So then, to not overdraw an entire measurement, I try to draw the average-larger one first. Which is to say, I take the average of both series separately and draw the one with the bigger average first. That way the smaller one will hopefully nestle under the larger, like a baby bird hiding from the rain under their parents wing.
That's how the range selection works, anyway.
The graphs themselves are drawn on 256x160 gif because i like gif, 256 is a good number, and they seem to compress better than png in this use case.
One day I'd love to try and generate alternative text to describe the general look of the graph. "The memory usage is steady at 300MB", or something like "The network usage is variable, but averages 15.4kbps".
That's it!
bye :)