Info

Userusage's Github page can be found here.

The full README which includes everything from install instructions to use instructions is available on the Github page.


Story

This is not the first iteration of Userusage. In fact, this isn't even the first Python iteration of Userusage.

We managed a lot of servers at CU. At the time this script was written, we had 1849 systems in our database. These are everything from web servers to student dev servers, and we couldn't always keep a close eye on what is going on. There are plenty of tools out there to measure the "whats" in a system, but none to keep track of the "whos." What is overheating? What drive is full? What UPS is off? All of those can be answered with widely available tools (Nagios, for example) in a quick and efficient manner. Who's using all of that disk space? Now that is an equally important question with a much more difficult answer. However, the disk is going to continue being full if nobody knows or cares that it is full, and that can cause even more problems.

This is where the first iteration of Userusage came into play. The first round was a Perl program that was about 5 years old. It took a monstrous 4 hours to run on a standard size (1TB /home) server, but it did its job. When I started learning Python and joined the CU Unix team in 2014, my boss handed it to me with a simple directive, "make it better." Better can mean a lot of things, but reading through a complicated man page for an hour and then running a 30 minute script is still faster (in a mono-focus sense) than four hours of waiting. If you are doing time-sensitive computations, like many of CU's aerospace research labs need to do, you might not have four hours to free up space. So the biggest goal was to speed it up.

Storage is a really complicated thing. All I can say is "thank God there are more talented people than me managing that," because physical storage is a difficult problem to solve. Due to this, I decided the best thing to do would be to go to what was already written. Python is pretty slow compared to C, especially when you are looking at a terabyte of data. So we wanted to use standard command line interfaces to do the computations, parse it with Python, and go home happy.

This is where the first iteration of Python Userusage came about. The first one did exactly what the Perl version did, ran this command

find . -user username -print

And then used Python to build the file list for users and find the space that the files took up. Next up, we tried a few pure Python implementations.

for item in os.walk('.'):
    if isfile(item):
        space += getsize(item)

We rebuilt our find command

find . -user username -type f -exec du -k {} ;

What we found was about a 25% increase in performance. A difference of an hour from the original. This was good, but I really wanted better. The other important thing to note is that these all were either built for one thing or another.

What I mean by that is that it was using either purely Python or purely C, because people rarely mixed the two in the same way I wanted to. This meant that no matter what we tried, it would churn out results similar in speed.

The other issue is one everyone who uses Python might have said from the beginning, "subprocesses are slow." Simply enough, the time it takes to shell out to bash and back is significant when you are running lots of commands.

The best thing to do in this case, following prior logic, is to only run 1 subprocess throughout the whole script, get back data, and parse it using Python. Both sides of the equation are doing what they are strong at and never interact with each other outside of 1 way communication from the command to Python. The question now became, what command?

Find was good because we could search for a username, but had to do a command for every username. We have hundreds of students on some systems, so that's not acceptable. That's when I found stat, a built in utility for file information. A few hours of manpages later, I came up with this.

find / type -f -exec stat -c %U %s {} +

This churns out a bunch of lines for every file, with the username and the disk space used on every line. Then we add it up, convert it to a dict, use sorted() to order it, and produce the results accordingly. The time? 15 minutes. A nearly 94% efficiency increase just for using 2 things for their strengths.

 

The new script went into production on every one of those systems, and was later thrown up as an open-source library for the public's consumption.


outcome

This project was significant to me in many ways. Firstly, it taught me to never take anything for granted when it comes to performance. The original Userusage was a script that had been run for years and years on the basis of sub-par performance. This wasn't because the technology wasn't available to improve it, but simply because nobody thought it needed improving. On top of that, this was the most used project I had built at the time, with all of our systems at CU running the script. They say one of the scariest things you can do in software engineering is actually ship your software, and this gave me that experience of having something in production.


Userusage is licensed under Beerware R42.

LICENSE