Category: procps

  • Linux Memory Statistics

    Pretty much everyone who has spent some time on a command line in Linux would have looked at the free command. This command provides some overall statistics on the memory and how it is used. Typical output looks something like this:

                 total        used        free      shared  buff/cache  available
    Mem:      32717924     3101156    26950016      143608     2666752  29011928
    Swap:      1000444           0     1000444
    

    Memory sits in the first row after the headers then we have the swap statistics. Most of the numbers are directly fetched from the procfs file /proc/meminfo which are scaled and presented to the user. A good example of a “simple” stat is total, which is just the MemTotal row located in that file. For the rest of this post, I’ll make the rows from /proc/meminfo have an amber background.

    What is Free, and what is Used?

    While you could say that the free value is also merely the MemFree row, this is where Linux memory statistics start to get odd. While that value is indeed what is found for MemFree and not a calculated field, it can be misleading.

    Most people would assume that Free means free to use, with the implication that only this amount of memory is free to use and nothing more. That would also mean the used value is really used by something and nothing else can use it.

    In the early days of free and Linux statistics in general that was how it looked. Used is a calculated field (there is no MemUsed row) and was, initially, Total - Free.

    The problem was, Used also included Buffers and Cached values. This meant that it looked like Linux was using a lot of memory for… something. If you read old messages before 2002 that are talking about excessive memory use, they quite likely are looking at the values printed by free.

    The thing was, under memory pressure the kernel could release Buffers and Cached for use. Not all of the storage but some of it so it wasn’t all used. To counter this, free showed a row between Memory and Swap with Used having Buffers and Cached removed and Free having the same values added:

                 total       used       free     shared    buffers     cached
    Mem:      32717924    6063648   26654276          0     313552    2234436
    -/+ buffers/cache:    3515660   29202264
    Swap:      1000444          0    1000444

    You might notice that this older version of free from around 2001 shows buffers and cached separately and there’s no available column (we’ll get to Available later.) Shared appears as zero because the old row was labelled MemShared and not Shmem which was changed in Linux 2.6 and I’m running a system way past that version.

    It’s not ideal, you can say that the amount of free memory is something above 26654276 and below 29202264 KiB but nothing more accurate. buffers and cached are almost never all-used or all-unused so the real figure is not either of those numbers but something in-between.

    Cached, just not for Caches

    That appeared to be an uneasy truce within the Linux memory statistics world for a while. By 2014 we realised that there was a problem with Cached. This field used to have the memory used for a cache for files read from storage. While this value still has that component, it was also being used for tmpfs storage and the use of tmpfs went from an interesting idea to being everywhere. Cheaper memory meant larger tmpfs partitions went from a luxury to something everyone was doing.

    The problem is with large files put into a tmpfs partition the Free would decrease, but so would Cached meaning the free column in the -/+ row would not change much and understate the impact of files in tmpfs.

    Lucky enough in Linux 2.6.32 the developers added a Shmem row which was the amount of memory used for shmem and tmpfs. Subtracting that value from Cached gave you the “real” cached value which we call main_cache and very briefly this is what the cached value would show in free.

    However, this caused further problems because not all Shem can be reclaimed and reused and probably swapped one set of problematic values for another. It did however prompt the Linux kernel community to have a look at the problem.

    Enter Available

    There was increasing awareness of the issues with working out how much memory a system has free within the kernel community. It wasn’t just the output of free or the percentage values in top, but load balancer or workload placing systems would have their own view of this value. As memory management and use within the Linux kernel evolved, what was or wasn’t free changed and all the userland programs were expected somehow to keep up.

    The kernel developers realised the best place to get an estimate of the memory not used was in the kernel and they created a new memory statistic called Available. That way if how the memory is used or set to be unreclaimable they could change it and userland programs would go along with it.

    procps has a fallback for this value and it’s a pretty complicated setup.

    1. Find the min_free_kybtes setting in sysfs which is the minimum amount of free memory the kernel will handle
    2. Add a 25% to this value (e.g. if it was 4000 make it 5000), this is the low watermark
    3. To find available, start with MemFree and subtract the low watermark
    4. If half of both Inactive(file) and Active(file) values are greater than the low watermark, add that half value otherwise add the sum of the values minus the low watermark
    5. If half of Slab Claimable is greater than the low watermark, add that half value otherwise add Slab Claimable minus the low watermark
    6. If what you get is less than zero, make available zero
    7. Or, just look at Available in /proc/meminfo

    For the free program, we added the Available value and the +/- line was removed. The main_cache value was Cached + Slab while Used was calculated as Total - Free - main_cache - Buffers. This was very close to what the Used column in the +/- line used to show.

    What’s on the Slab?

    The next issue that came across was the use of slabs. At this point, main_cache was Cached + Slab, but Slab consists of reclaimable and unreclaimable components. One part of Slab can be used elsewhere if needed and the other cannot but the procps tools treated them the same. The Used calculation should not subtract SUnreclaim from the Total because it is actually being used.

    So in 2015 main_cache was changed to be Cached + SReclaimable. This meant that Used memory was calculated as Total - Free - Cached - SReclaimable - Buffers.

    Revenge of tmpfs and the return of Available

    The tmpfs impacting Cached was still an issue. If you added a 10MB file into a tmpfs partition, then Free would reduce by 10MB and Cached would increase by 10MB meaning Used stayed unchanged even though 10MB had gone somewhere.

    It was time to retire the complex calculation of Used. For procps 4.0.1 onwards, Used now means “not available”. We take the Total memory and subtract the Available memory. This is not a perfect setup but it is probably going to be the best one we have and testing is giving us much more sensible results. It’s also easier for people to understand (take the total value you see in free, then subtract the available value).

    What does that mean for main_cache which is part of the buff/cache value you see? As this value is no longer in the used memory calculation, it is less important. Should it also be reverted to simply Cached without the reclaimable Slabs?

    The calculated fields

    In summary, what this means for the calculated fields in procps at least is:

    • Used: Total - Available, unless Available is not present then it’s Total – Free
    • Cached: Cached + Reclaimable Slabs
    • Swap/Low/HighUsed: Corresponding Total - Free (no change here)

    Almost everything else, with the exception of some bounds checking, is what you get out of /proc/meminfo which is straight from the kernel.

  • Percent CPU for processes

    The ps program gives a snapshot of the processes running on your Unix-like system. On most Linux installations, this will be the ps program from the procps project.

    While you can get a lot of information from the tool, a lot of the fields need further explanation or can give “wrong” or confusing information; or putting it another way, they provide the right information that looks wrong.

    One of these confusing fields is the %CPU or pcpu field. You can see this as the third field with the ps aux command. You only really need the u option to see it, but ps aux is a pretty common invokation.

    More than 100%?

    This post was inspired by procps issue 186 where the submitter said that the sum of the processes cannot be more than the number of CPUs times 100%. If you have 1 CPU then the sum of %CPU for all processes should be 100% or less, have 16 CPUs then 1600% is your maximum number.

    Some people reason for the oddity of over 100% CPU as some rounding thing gone wrong and at first I did think that; except I know we get a lot of reports about comparing the top header CPU load vs process load not lining up and its because “they’re different”.

    The trick here, is ps is reporting a percentage of what? Or, perhaps to give a better clue, a percentage of when?

    PCPU Calculations

    So to get to the bottom of this, let’s look at the relevant code. In ps/output.c we have a function pr_pcpu that prints the percent CPU. The relevant lines are:

      total_time = pp->utime + pp->stime;
      if(include_dead_children)
          total_time += (pp->cutime + pp->cstime);
      seconds = cook_etime(pp);
      if (seconds)
          pcpu = (total_time * 1000ULL / Hertz) / seconds;

    OK, ignoring the include _dead_time line (you get this from the S option and means you include the time this process waited for its children processes) and the scaling (process times are in Jiffies, we have the CPU as 0 to 999 for reasons) you can reduce this down to.

    %CPU = ( Tutime + Tstime ) / Tetime

    So we find the amount of time the CPU(s) have been busy either in userland or the system, add them together, then divide the sum by the total time. The utime and stime increment like a car’s odometer. So if a process uses one Jiffy of CPU time in userland, that counter goes to 1. If it does it again a few seconds later, then that counter goes to 2.

    To give an example, if a process has run for ten seconds and within those ten seconds the CPU has been busy in userland for that process, then we get 10/10 = 100% which makes sense.

    Not all Start times are the same

    Let’s take another example, a process still consumes ten seconds CPU time but been running for twenty seconds, the answer is 10/20 or 50%. With our single CPU example system both of these cannot be running at the same time otherwise we have 150% CPU utilisation which is not possible.

    However, let’s adjust this slightly. We have assumed uniform utilisation. But take the following scenario:

    • At time T: Process P1 starts and uses 100% CPU
    • At time T+10 seconds: Process P1 stops using CPU but still runs, perhaps waiting for I/O or sleeping.
    • Also at time T+10 seconds: Process P2 starts and uses 100% CPU
    • At time T+20 we run the ps command and look at the %CPU column

    The output for ps -o times,etimes,pcpu,comm would look something like:

        TIME ELAPSED %CPU COMMAND
          10      20   50 P1
          10      10  100 P2

    What we will see is P1 has 10/20 or 50% CPU and P2 has 10/10 or 100% CPU. Add those up, and you have 150% CPU, magic!

    The key here is the ELAPSED column. P1 has given you the CPU utilisation across 20 seconds of system time and P2 the CPU utilisation across only 10 seconds. You directly add them together you get the wrong answer.

    What’s the point of %CPU?

    Probably the %CPU column gives results that a lot of people are not expecting, so what’s the point of it? Don’t use it to see why the CPU is running hot; you can see above those two processes were working the CPU hard at different times. What it is useful for is to see how “busy” a process is, but be warned its an average. It’s helpful for something that starts busy but if the process idles or hardly uses CPU for a week then goes bananas you won’t see it.

    The top program, because a lot of its statistics are deltas from the last refresh, is a much better program for this sort of information about what is happening right now.

  • procps-ng 3.3.16

    procps-ng version 3.3.16 was released today. Besides some documentation and library updates, there were a few incremental changes.

    Zombie Hunting with pgrep

    Ever wanted to find zombies? Perhaps processes with other states? pgrep has a shiny new runstate flag to help you which will match process against the runstate. I’m curious to see the use-cases for this flag; it certainly will get used (e.g. find my zombies) but as some processes bounce in and out of states (think Run to Sleep and back) it might add some confusion.

    Snice plays nice with PIDs

    Top Enhancements

    Top got a bunch of love again in this release. If you ever wanted your processes to be shown in fuchsia? Perhaps goldenrod? With some earlier versions of top, you could by directly editing the toprc file but now everyone can have more than the standard 8 colours!

    If you use the other filters parameter for some fancy process filtering in top, it now will save that configuration.

    Collapsed children (process names are weird) get some help. If you are in tree view, you can collapse or fold the children processes under the parent. Their CPU is also added to the parent so there are no “missing” CPU ticks.

    For people who use the One True Editor (which is, of course, VIM) you can use the vim navigation keys to move through the process list.

    Where to find it?

    You’ll find the latest version of procps either at our git repository or download a tarball.

  • The sudo tty bug and procps

    There have been recent reports of a security bug in sudo (CVE-2017-1000367) where you can fool sudo into thinking what controlling terminal it is running on to bypass its security checks.  One of the first things I thought of was, is procps vulnerable to the same bug? Sure, it wouldn’t be a security bypass, but it would be a normal sort of bug. A lot of programs  in procps have a concept of a controlling terminal, or the TTY field for either viewing or filtering, could they be fooled into thinking the process had a different controlling terminal?

    Was I going to be in the same pickle as the sudo maintainers? The meat between the stat parsing sandwich? Can I find any more puns related somehow to the XKCD comic?

    TLDR: No.

    (more…)
  • procps 3.3.12

    The procps developers are happy to announce that version 3.3.12 of procps was released today. This version has a mixture of bug fixes and enhancements. This unfortunately means another API bump but we are hoping this will be fixed with the new library API coming soon.

    procps is developed on gitlab and the new version of procps can be found at https://gitlab.com/procps-ng/procps/tree/newlib

    procps 3.3.12 can be found at https://gitlab.com/procps-ng/procps/tags/v3.3.12

    (more…)

  • Displaying Linux Memory

    Memory management is hard, but RAM management may be even harder.

    Most people know the vague overall concept of how memory usage is displayed within Linux. You have your total memory which is everything inside the box; then there is used and free which is what the system is or is not using respectively. Some people might know that not all used is used and some of it actually is free.  It can be very confusing to understand, even for a someone who maintains procps (the package that contains top and free, two programs that display memory usage).

    So, how does the memory display work?

    (more…)

  • pidof lost a shell

    pidof is a program that reports the PID of a process that has the given command line. It has an option x which means “scripts too”. The idea behind this is if you have a shell script it will find it. Recently there was an issue raised saying pidof was not finding a shell script. Trying it out, pidof indeed could not find the sample script but found other scripts, what was going on?

    (more…)

  • ps standards and locales

    I looked at two interesting issues today around the ps program in the procps project. One had a solution and the other I’m puzzled about.

    ps User-defined Format

    Issue #9 was quite the puzzle. The output of ps changed depending if a different option had a hyphen before it or not.

    First, the expected output

    $ ps p $$ -o pid=pid,comm=comm
     pid comm
    31612 bash

    Next, the unusual output.

    $ ps -p $$ -o pid=pid,comm=comm
    pid,comm=comm
     31612

    (more…)

  • procps 3.3.11

    I have updated NEWS, bumped the API and tagged in git; procps version 3.3.11 is now released!

    This release we have fixed many bugs and made procps more robust for those odd corner cases. See the NEWS file for details.  The most significant new feature in this release is the support for LXC containers in both ps and top.

    The source files can be found at both sourceforge and gitlab at:

    My thanks to the procps co-maintainers, bug reporters and merge/patch authors.

    What’s Next?

    There has been a large amount of work on the library API. This is not visible to this release as it is on a different git branch called newlib. The library is getting a complete overhaul and will look completely different to the old libproc/libprocps set. A decision hasn’t been made when newlib branch will merge into master, but we will do it once we’re happy the library and its API have settled. This change will be the largest change to procps’ library in its 20-odd year history but will mean the library will use common modern practices for libraries.

  • Be careful with errno

    I’m getting close to releasing version 3.3.11 of procps.  When it gets near that time, I generally browse again the Debian Bug Tracker for procps bugs. Bug number #733758 caught my eye.  With the free command if you used the s option before the c option, the s option failed, “seconds argument ‘N’ failed” where N was the number you typed in. The error should be for you trying to type letters for number of seconds. Seemed reasonably simple to test and simple to fix.

    Take me to the code

    The relevant code looks like this:

       case 's':
                flags |= FREE_REPEAT;
                args.repeat_interval = (1000000 * strtof(optarg, &endptr));
                if (errno || optarg == endptr || (endptr && *endptr))
                    xerrx(EXIT_FAILURE, _("seconds argument `%s' failed"), optarg);
    

    Seems pretty stock-standard sort of function. Use strtof() to convert the string into the float.

    You need to check both errno AND optarg == endptr because:

    • A valid but large float means errno = ERANGE
    • A invalid float (e.g. “FOO”) means optarg == endptr

    At first I thought the logic was wrong, but tracing through it was fine.  I then compiled free using the upstream git source, the program worked fine with s flag with no c flag. Doing a diff between the upstream HEAD and Debian’s 3.3.10 source showed nothing obvious.

    I then shifted the upstream git to 3.3.10 too and re-compiled. The Debian source failed, the upstream parsed the s flag fine. I ran diff, no change. I ran md5sum, the hashes matched; what is going on here?

    I’ll set when I want

    The man page says in the case of under/overflow “ERANGE is stored in errno”. What this means is if there isn’t and under/overflow then errno is NOT set to 0, but its just not set at all. This is quite useful when you have a chain of functions and you just want to know something failed, but don’t care what.

    Most of the time, you generally would have a “Have I failed?” test and then check errno for why. A typical example is socket calls where anything less than 0 means failure. You check the return value first and then errno. strtof() is one of those funny ones where most people check errno directly; its simpler than checking for +/- HUGE_VAL. You can see though that there are traps.

    What’s the difference?

    OK, so a simple errno=0 above the call fixes it, but why would the Debian source tree have this failure and the upstream not? Even with the same code? The difference is how they are compiled.

    The upstream compiles free like this:

    gcc -std=gnu99 -DHAVE_CONFIG_H -I. -include ./config.h -I./include -DLOCALEDIR=\"/usr/local/share/locale\" -Iproc -g -O2 -MT free.o -MD -MP -MF .deps/free.Tpo -c -o free.o free.c
    mv -f .deps/free.Tpo .deps/free.Po
    /bin/bash ./libtool --tag=CC --mode=link gcc -std=gnu99 -Iproc -g -O2 ./proc/libprocps.la -o free free.o strutils.o fileutils.o -ldl
    libtool: link: gcc -std=gnu99 -Iproc -g -O2 -o .libs/free free.o strutils.o fileutils.o ./proc/.libs/libprocps.so -ldl

     

    While Debian has some hardening flags:

    gcc -std=gnu99 -DHAVE_CONFIG_H -I. -include ./config.h -I./include -DLOCALEDIR=\"/usr/share/locale\" -D_FORTIFY_SOURCE=2 -Iproc -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -MT free.o -MD -MP -MF .deps/free.Tpo -c -o free.o free.c
    mv -f .deps/free.Tpo .deps/free.Po
    /bin/bash ./libtool --tag=CC --mode=link gcc -std=gnu99 -Iproc -g -O2 -fstack-protector-strong -Wformat -Werror=format-security ./proc/libprocps.la -Wl,-z,relro -o free free.o strutils.o fileutils.o -ldl
    libtool: link: gcc -std=gnu99 -Iproc -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wl,-z -Wl,relro -o .libs/free free.o strutils.o fileutils.o ./proc/.libs/libprocps.so -ldl

    It’s not the compiling of free itself that is doing it, but the library. Most likely something that is called before the strtof() is setting errno which this code then falls into. In fact if you run the upstream free linked to the Debian procps library it fails.

    Moral of the story is to set errno before the function is called if you are going to depend on it for checking if the function succeeded.