Category Archives for "Open source"

A quick intro to the Unix “find” utility

One of the most powerful Unix command-line utilities is “find” — but it also has a huge number of options, and most of the documentation I’ve read on “find” is hard to follow and understand.  That’s a shame, because once you understand what “find” does and how it works, you can accomplish quite a bit.  I hope that this post will show you some of the basics of “find”, so that you can take advantage of it in your day-to-day work.

The basic idea is that “find” looks through a directory (and all of its subdirectories), applying one or more filters when deciding which files are interesting, and executing one or more actions on matching files.

So, what can you do with “find”?

  • Move any backup log older than 30 days to /tmp/
  • Find all of the MP4 files larger than 100MB
  • Find all of the documents with either “doc” or “docx” extensions anywhere in your home directory
  • In a directory of text files, find those containing the phrase “budget” which have not been touched in the last 30 days

(In these examples, I’m going to use the GNU version of find, which is standard on Linux machines and available for the Mac via Homebrew.  Note that if you use Homebrew on the Mac, then GNU “find” will be installed as “gfind” by default.  Use the –with-default-names option to “brew install” if you want to avoid this prefix.)

Note: There is a big difference between “find” and “locate”, which are often confused for one another:

  • “find” looks for files according to a number of criteria, and performs an action on the files matching those criteria. The search takes place when you run the program.
  • “locate” uses a database (typically created with the “updatedb” command) for filenames matching a pattern, and returns those filenames.

So if you know that you have a file named “important.txt” somewhere on your system, then you probably want to use “locate” — assuming, of course, you have been updating your filename database on a regular basis, typically via “cron”.

If you don’t remember the name of the file, but do remember that you modified it in the last 14 days, and that it contains the phrase “very important”, then you can use “find”.

For example, let’s say that I just want to find all of the files in the current directory and all of its subdirectories.  I can say:

find . -print

This means: Look at all files and directories in the current directory (.) and contained within its subdirectories, and then print them.

Now, in GNU find, both of these arguments are optional; you can just say


but I don’t recommend doing so, if only because it’s a bit ambiguous.  Moreover, the longer version emphasizes that “find” looks through a directory, filters through the results (although we don’t have any filters here), and then executes something (in this case, “print”).  The filters and actions are specified using command-line arguments; thus, we say “-print” if we want to print the name of the file.  Note that it’s not “–print” (i.e., with two “-” characters before “print”), which we might expect.

Also notice that the result includes all files, including directories and special Unix files (e.g., device files).  If you want to only look at files, then you can specify the “-type” filter.  For example, the following command shows all files (i.e., not subdirectories, symbolic links, or the like) under the current directory:

find . -type f -print    # find regular files

What if you want to find directories?  Then instead of using “-type f”, specify “-type d”:

find . -type d -print    # find directories

What if I only want to find files that match a certain pattern?  Then I can filter using the “-name” test and the shell’s standard characters.  For example, let’s say I want to find all of the files that end with “.txt”.  I can then say:

find . -type f -name "*.txt" -print

The above applies two tests —only regular files (i.e., not directories or the like) that match the pattern “*.txt” will match and be printed.

What if I want to find files that end with “.txt” or “.text”?  In such cases, it might be easiest to use the “or” option, written as “-o”, that combines two tests.  For example:

find . -type f \( -name '*.txt' -o -name '*.text' \) -print

The “-o” option (for logical “or” — and yes, there is also a “-a” option that’s logical “and”) allows either of the tests to succeed in order for it to declare success. However, the items on either side of “-o” must be inside of parentheses.  Since parentheses in the Unix shell have their own uses, we need to preface them with backslashes, to avoid clashes between the levels of parsing.  But wait — if the “\(” and “\)” are touching the arguments, then you’ll get hard-to-understand errors.  So make sure that “\(” and “\)” are surrounded by whitespace, if you want to avoid trouble.

Let’s say that I want to find old files on my system. Unix filesystems keep track of file ages in three different ways:

  • ctime (creation time) — when was the file first created?
  • mtime (modification time) — when was the file last modified?
  • atime (access time) — when was the file last accessed/read?

Let’s say that I want to find files in the current directory (and below) that were last accessed 7 days ago.  I can say:

find . -atime 7 -print

The “atime” is measured in 24-hour increments, starting with midnight of the current day. So “-atime 7” means, “last accessed 7*24 hours before midnight today.”

But wait a second — when was the last time you wanted to find files that were accessed exactly 7 days ago?  It’s far more likely that you want to find files that were last accessed less than 7 days ago. In order to do that, you need to preface the number with a “-” sign:

find . -atime -7 -print

By contrast, if you want to find all of those files that were accessed more than 7 days ago, you’ll want to preface the number with a “+” sign:

find . -atime +7 -print

And of course, if you want to find files that were accessed more than 2 days ago, but less than 9 days ago, you can say:

find . -atime +2 -atime -9 -print

Depending on your needs, it might well be better to use “mtime” rather than “atime”. I’m often interested in finding files I changed recently, rather than those I read recently. The same rules apply; here’s how I would find all of those files that I last modified more than two days ago but less than 9 days ago:

find . -mtime +2 -mtime -9 -print

Notice that I’m able to combine two rules (i.e., two “atime” or “mtime” rules) without using “-a” to join them together with a logical “and”.

Another useful thing to look for is big files. What files, for example, are bigger than 2 GB? I can say the following:

$ find . -size +2G -print

(I believe that this “-size” option only works this way on GNU find. Other versions might well require that you specify the file size in blocks. It has been a while since I used non-GNU versions.)

Look familiar? That’s right; the “+2” means “greater than”, and the “G” suffix means “GB”.  You can use a bunch of suffixes to the number, to indicate just how big the file should be.  As you might have guessed, you can say “-2M” to mean “less than 2 MB”, which on a modern computer is just about everything, to be honest.

We can also combine these, just as we did with “atime” and “mtime”: What files are bigger than 500 MB and smaller than 5 GB?

find . -size +500M -size -5G -print

We can combine these filters with others. What files are bigger than 500 MB and smaller than 5 GB, and were last accessed no more than 30 days ago?

find . -size +500M -size -5G -atime -30 -print

You can imagine using this sort of command to find large, unused files, such as old videos that you had forgotten are on your filesystem. Indeed, what if I’m only interested in finding MP4 files that are larger than 500 MB, smaller than 5 GB, and accessed in the last 30 days? I  can add another condition:

find . -size +500M -size -5G -atime -30 -name "*.mp4" -print

There are lots of other filters you can apply, and GNU find is especially full of them. There are alternative ways to specify dates. You can search for particular types of special files.  You can search for certain permissions. And so forth.  But the ones I’ve shown you are the ones I’ve used most often.

But the tests are only the first part of using “find”: Once you’ve gotten a list of files, what can you do with them?

So far, we’ve seen a single action, namely “-print”.  There are a few others that you might find useful.

The first is “-ls”, which runs the Unix “ls” command (with a few options that’ll show size and permissions):

find . -size +500M -size -5G -atime -30 -name "*.mp4" -ls

The above will not only print the filename (like “-print”), but will also show lots of other information about the files we’ve found. What if you want to write this list to a file? Then just use the “-fls” option, and give it a filename:

find . -size +500M -size -5G -atime -30 -name "*.mp4" -fls big-movies.txt

It’s pretty common to want to delete files. So you can use the “-delete” option to do so.  Warning: Running a program that automatically deletes files can be very dangerous. I almost never do this, because I’m always so worried that something will go wrong.  Here’s how I can remove all of the backup files in my Linux /var/log directory that are more than 21 days old:

find . -name '*.gz' -mtime +21 -delete -print

Note that you can have more than one action; in this case, my first action was “-delete”, and my second was “-print”.

It’s pretty common for me to want to search through an entire directory for a file that contains particular text. In other words, I want to run the “grep” utility on each file. I can do that by using the all-purpose “-exec” action.  The basic idea is as follows: You hand “-exec” a command, and the command is then ended with \; (yes, backlash + semicolon). In between, you can write whatever Unix command you want, including options. The current filename can be put into the command with the special formula {} (i.e., empty curly braces).  For example, I can say:

find . -name "*.txt" -exec grep Reuven {} \;

The above will show all lines from all files containing my name. (Of course, a regular expression can be far more complex than this; if you aren’t familiar with grep or regexps, you can take my free “regular expressions crash course.”)    But the output only shows the lines we would get from “grep”, which (by default) doens’t show the name of the current file if you’re running it one file at a time. For this reason, we would be wise to include the “-H” option:

$ find . -name "*.txt" -exec grep -H Reuven {} \;

While “grep” is the most common command that I run via “-exec”, you can use any program you want, including programs that you’ve written.  In this way, you can really make “find” work for you, and execute custom code for each file that fits a criteria. Combine “find” with “cron”, and you have an easy way to identify files that need your attention, or that should be removed, or that you’ve been looking for and otherwise cannot find.

If there’s one drawback to “find”, it’s that the search happens in real time. There is no database through which it runs. Which means that if you’re going through a very large directory structure, you might discover that “find” takes quite a while.

And that’s about it! If you’re like me, then you’ll find (no pun intended) that these use cases cover most of what you need with the “find” utility. The documentation is extremely long, but only because “find” has many other tests and actions that you can mix and match in a variety of ways.




My new course, “Understanding and Mastering Git,” is now available

Ah, Git.  It’s one of the best and most important tools I use as a software developer.   Git is everything I want in a version-control system: It’s fast. It lets me collaborate. I can work without an Internet connection.  I can branch and merge easily, using a variety of techniques.  I can take a personal project and turn it into a large, collaborative one with minimal effort. And it’s cross platform, meaning that I know my clients and colleagues will be able to use it.

So, what’s the problem?  Git’s learning curve is extremely steep.  Until you understand what Git is doing, and how it works, you cannot use it effectively.  Moreover, until you understand what Git is doing, you will likely be puzzled and frustrated by its commands, messages, and documentation.

I’m thus delighted to unveil my latest online course: Understanding and mastering Git.  This course, which I have taught to numerous companies all around the world for more than a decade, includes:

  • Nearly 80 video lectures, for a total of more than 7 hours of videos (preview a number of videos here, on the course page)
  • Dozens of exercises, to help you practice and understand working with Git
  • 11 slide decks (in PDF format), the same ones I use when teaching.

If you have been frustrated by Git, or consider the commands you’ve been using to be a form of black magic, then this course is for you.  It walks you through Git’s commands, objects, and methods for collaboration.

This course has been battle-tested for a decade at some of the world’s best-known companies.  If you want to get the most out of Git, I’m sure that my course will help.  And if it doesn’t?  E-mail me, and I’ll give you a 100% refund.

Don’t let Git frustrate you any more.  Understand it.  Master it.  Tame it.   Learn from my “Understanding and mastering Git” course, today:

Not sure if this course is for you?  That’s fine: You can preview a number of the course videos for free from the course sales page.

Also: If you’re a student or pensioner, then you qualify for a discount on the course.  Just e-mail me, and I’ll send you a special discount code.

And finally: If you live in a country outside of the top 30 per-capita GDP countries in the world, then e-mail me and I’ll send you a special discount code to make the course more affordable to you.

You can and should learn Git, and I want to help you to learn it.  Try my course, and discover why so many software engineers won’t even think about using something else.


A very sad day — the end of Linux Journal

[Update, as of August 8, 2019: Since I wrote this post, Linux Journal re-opened, thanks to a generous investment/purchase, and survived for another two years… And then, earlier today, I learned that LJ has closed, once again — and this time, for good.  This post is just as accurate today as it was when I originally wrote it, back in 2017, when it shut down for the first time..]

In 1995, just before moving to Israel, I was working for Time Warner in New York City.  One day, I received email from a company called SSC, publishers of numerous “cheat sheets” for Linux users and programmers. They were about to publish a quick-reference guide for GNU Emacs, and since I was the maintainer of the Emacs FAQ, they wanted my input and thoughts.

I gave them my feedback, and the people at SSC were so thankful for my help that they offered to give me, free of charge, any 10 items from their online catalog. In this way, I got a bunch of quick-reference cards, as well as several issues of the then-new publication, Linux Journal.

I had been using Unix since 1988, and had just started to use this new, open-source version.  (I ordered a distribution Linux from a small company in Connecticut known as “Red Hat,” which made for a fairly straightforward installation on a PC. And by “fairly straightforward,” I mean that it only took a few days of configuration to get X Windows to work on my hardware.)  So I was excited to read through Linux Journal.  I loved what I read, and was in awe of the knowledgeable columnists they had.

In one of those issues was an ad, saying that the magazine would soon be starting a spinoff known as “Websmith,” all about a new phenomenon known as the “World Wide Web.”

Now, I loved to write: I had edited the student newspaper while studying at MIT, and I had always thought that it would be fun to write a column. So I e-mailed the editor of Websmith, asking if they would be interested in having me write a column about CGI programming — which was then the latest and greatest way to create what we now call “dynamic Web content.”  After all, I had been doing Web development since early 1993, back when I set up one of the first 100 sites in the world; I figured that my technical and writing experience could be of help.

The editor said “yes,” and my column — which he called “At the Forge,” fitting in with Websmith’s blacksmith theme — was off to the races.

Websmith didn’t survive for very long; it seems that the World Wide Web wasn’t interesting or big enough to support a business.  (Who knows?  Maybe that’ll change some day.)  But some of us who were writing for Websmith discovered that our columns and articles were simply incorporated into Linux Journal’s next issue, continuing from there.

In other words: I was suddenly and unexpectedly a Linux Journal columnist — a position that I have held, continuously and proudly, since 1996.

It’s thus extremely sad for me to know that Linux Journal is no more.  The magazine’s publisher, Carlie Fairchild, announced the magazine’s closure to the staff and writers earlier this week, and is telling subscribers today.

Just to put things in perspective: Before I met my wife, before my children were born, before I started my 11-year-long PhD program, I was writing for Linux Journal.

For as long as we have been married, my wife has heard the following, typically several times each month, each time with increased urgency: “I have to work on the column, which was due a few days ago. My editor is going to kill me.”

Ditto for my children, whose definition of “deadline” has definitely been affected — for the worse — by my monthly worrying, complaining, and late-night writing.

But I’m not complaining. Writing for Linux Journal for so long has been a defining part of me, and my career, for a very, very long time.

Thanks to LJ, I have had a chance to write, which I so love to do. I have learned a huge amount through my writing, because each column required that I spend time learning before I could teach.

Thanks to LJ, I’ve even gotten some clients, who contacted me as a result of my articles.

Thanks to LJ, my wife and I were invited to visit Alaska and the Caribbean, when I was a featured speaker with “Geek Cruises.”

Thanks to LJ, I got many free books and other resources, was invited to speak at conferences, and was (much to my pleasant surprise) recognized at technical conferences, even when I didn’t speak there.

And of course, thanks to LJ, I’ve gotten to work with some amazing people.  I’ve worked with a variety of editors, the most recent being Jill Franklin, all of whom were talented, helpful, and tolerant of my flexible interpretation of deadlines. They all gave me total freedom to write about whatever topic I wanted, whenever I wanted.  I got to explore all sorts of great topics that were of interest to me, and to my clients, and (I believe) to my readers.

I received e-mail from people all around the world who read my columns, which was deeply satisfying and gratifying.  At conferences, people would see my name badge and tell me that they had been reading my columns for many years.  Even now, as I publish a weekly newsletter for programmers, I often get messages from people saying that they read my column in Linux Journal, and are happy to reconnect with me.

The publishing industry is changing, and that reality is hitting publishers big and small. Indeed, earlier this week, Time Inc. (yes, where I worked when SSC first contacted me) sold itself to another company. The combination of publishing economics, along with the plethora of free, online content for open-source enthusiasts, makes it hard to produce a profitable magazine with top-notch technical content.

Consider the world in 1996: Linux was an oddball operating system, supported by no major manufacturers and seen as a hacker’s plaything. Perl and Python were used by lots of individuals, but not for too many serious applications. MySQL was free, but wasn’t open source, and there weren’t any open-source options for people who wanted to use a database.  And of course, Web applications were in their infancy; my “form-mail” program, which was used by countless sites to send e-mail (before it was taken and modified for the worse by Matt’s Script Archive) remained cutting edge for quite some time.

Things have changed a great deal since then. My phone uses Android, a form of Linux. When the flight attendant reboots the in-flight entertainment system, I see that it’s running Linux, as well. My 14-year-old daughter is learning Python. Companies from Apple to IBM, Cisco to PayPal, Ericsson to VMWare, in the US, Europe, Israel, and China, ask me to teach Python and Git to their employees. Who knew that the technologies I learned, and learned to love, so many years ago, would be so dominant in the world today?

And thus, while Linux Journal might have failed as a business, it succeeded in its mission: To spread the word of Linux and open-source software, to help people to understand, implement, and use open-source technology, and to bring these technologies into the mainstream as a legitimate alternative to the then-dominant commercial offerings.

To my many readers: Thanks for your support over the years. (And you can keep getting weekly writing from me via my “Better developers” newsletter.)

To the amazing staff at Linux Journal, over many years and in many iterations: Thanks for putting up with my delays, and for doing such a great job with my text.  I can’t imagine how they edited a magazine with such in-depth technical content, but they did, and did it well.

To my family: Thanks for putting up with my loud, monthly stresses, including the many times in which I wondered, out loud, how they could possibly think to have a columnist who cannot get his software to work.

Oh, and to the three (!) journalists who e-mailed me in the last 10 days, asking to whom they could pitch stories for Linux Journal: Sorry, folks.  You’re a bit too late.

RIP, Linux Journal.  It has been a fun, wild ride.


Aha! Preview this week’s live, online Python courses

In just 48 hours, I’ll be starting my latest round of live, online courses. Wondering what it’s like to take an online course from me? Or perhaps you’re wondering what sorts of topics I’ll discuss in my “Python dictionaries” and “Python functions” courses? Well, wonder no more; here’s a short preview of my teaching style, and the sorts of things I intend to demonstrate in my courses:

Preview: Reuven’s October 2017 live, online courses about Python and Git from Reuven Lerner on Vimeo.

If you are a beginning or intermediate Python developer, then you’ll become a more effective and fluent developer — good not only for your current employer, but for your career — thanks to my courses. And we’ll have lots of fun along the way. There will be plenty of time for exercises, questions, and comments, to ensure that you understand these technologies well. You’ll return to work the next day able to do more, and more quickly, than before.

Any questions? Just send me e-mail , and I’ll be happy to answer.

I look forward to seeing you this week (for my Python courses) and next week (for my two-day “Understanding Git” course)!



Announcing: Three new live courses, to level up your Python and Git skills

  • Confused by Python dicts, or wondering how you can take advantage of them in your programs?
  • Do you wonder how Python functions work, and how you can make them more “Pythonic,” and easier to maintain?
  • Do you wonder why everyone raves about Git, when it seems impossibly hard to understand?  Cloning, pulling, and pushing mostly work… but when they don’t, Git seems like magic, and not the good kind.

If any (or all) of the above is true, then you’ll likely be interested in one or more of the live, online classes I’m teaching later this month:

  1. Python dictionaries, on Wednesday, October 25
  2. Python functions, on Thursday, October 26
  3. Understanding Git, on Tuesday, October 31 and Wednesday, November 1

Each of these classes is live, with tons of live-coding demos, exercises, and time for Q&A. My goal is for you to understand these technologies, how they work, and (most importantly) how you can use them effectively in your work.

Previous classes have been small and highly interactive. These are the same classes I give to some of the best-known companies in the world, such as Apple, Cisco, IBM, PayPal, VMWare, and Western Digital; I’m sure that you’ll enjoy yourself, and come out a better engineer.

Better yet: Buy a ticket by this Friday, and you’ll get a substantial (20%) discount on the ticket price.

This is not a recorded class (although recordings will be available later on).  I’ll be speaking and interacting the entire time, giving you a chance to get your questions answered.  I want to make sure you really understand what’s going on, and will answer any questions you have!

Speaking of which: If you have questions, just e-mail me at, and I’ll do my best to answer.

And if you’re a student, ask me for a coupon code that will give you a substantial discount off of the ticket price.

I hope that you can join me for one or more of these classes!

The easiest way to return to the last Git branch

I don’t know about you, but it’s common for me to switch between branches in Git.  After all, that’s one of the main advantages of using Git — the incredible ease with which you can create and merge branches.

Just a few minutes ago, I was in the “adwords” branch of an application I’m working on.  I wanted to go back to “master”, make sure that I hadn’t missed any commits from the central repository, and then go back to “adwords”.  If there were any commits in “master” that I was missing in “adwords”, I figured that I would just rebase from “master” to “adwords”.

So I did a “git checkout master”, and found that I was up to date with the central Git server. For reasons that I can’t explain, I then decided to try something out: Instead of returning to the previous branch with “git checkout adwords”, I instead typed

git checkout -

(That’s a minus sign after the word “checkout.”)

Sure enough, I returned to the “adwords” branch!

Now, there is a fair amount of logic to this: In the Unix shell, you can typically return to the previous directory with “cd -“, which has proven to be quite useful over the years.  In Git, of course, branches are just aliases to commits.  So “git checkout -” is returning you to the previous branch, but it’s really just taking you back to whatever the last commit was that you worked on.

I just checked the Git documentation (“git checkout –help”), and it seems that this is a special case of a more generalizable command structure:

As a special case, the "@{-N}" syntax for the N-th last branch/commit checks out 
branches (instead of detaching). You may also specify - which is synonymous
with "@{-1}".

I can’t imagine wanting to tell Git to return to the 5th-most-recent branch that I worked on, so this generalized formula seems a bit much to me.

I predict that this trick will save me precious seconds every day, all of which I squandered in writing this blog post.  But I do think that this is a super-cool trick and feature, and demonstrates once again how clever and useful Git is.


Control-R, my favorite Unix shell command

If you use a modern, open-source Unix shell — and by that, I basically mean either bash or zsh — then you really should know this shortcut.  Control-R is probably the shell command (or keystroke, to be technical about it) that I use most often, since it lets me search through my command history.

Let’s start with the basics: When you use bash or zsh, your commands are saved into a history, typically put in the environment variable HISTFILE.  I use zsh (thanks to oh-my-zsh), and it puts my HISTFILE in ~/.zsh_history.  How many commands does it store?  That depends on the value of the environment variable HISTSIZE, which in my case is 10,000.  Yes, I store the 10,000 last commands that I entered into my shell.

Now, before control-R, there were a bunch of ways to search through and use the history.  Each command has its own number, and thus if you want to replay command 5329, you can do so by typing


But this requires that you keep track of the numbers, and while I used to do that, I found it to be more annoying than useful.  What I really wanted was just to repeat a command … you know, the last time I ssh’ed into a server, or something.  So yeah, you can do


and you’ll get the most recent “ssh” command that you entered.  But what if you have used ssh lots of times, to lots of servers?  You could start to search for the server name, but then things start to get complicated, messy, and annoying.

What control-R does is search backwards through HISTFILE, looking for a match for what you have entered until now.  If you use Emacs, then this will make perfect sense to you, since control-R is the reverse version of control-S in Emacs.  If you don’t know Emacs, then it’s a crying shame — but I’ll still be your friend, don’t worry.

Let’s say you have ssh’ed into five different servers today, and you want to ssh again into the third server of the bunch.  You type control-R, which puts you into bck-i-search (i.e., “backward incremental search”) mode.  Now type “s” (without enter).  The most recent command that you entered, which contains an “s”, will appear.  Now type another “s” (again, without pressing enter).  The most recent command containing two “s” characters in a row will appear.  Depending on your shell and configuration, the matching text might even be highlighted.

Now enter “h”.   In my case, I got to the most recent call to “ssh” that I made in my shell.  But I don’t want this last (fifth) one; I want the third one.  So I enter control-R again, and then again.  Now I’m at the third time (out of five) that I used ssh today, at the command I want.  I press “enter”, and I’ve now executed the command.

While searching backward, if you miss something because you hit control-R one too many times, you can use control-S to search forward.  You can use the “delete” key to remove characters, one at a time, from the search string.  And you can use “enter”, as described above, to end the search.  I should also note that I’ve modified my zsh prompts such that the matched text in control-R is highlighted, which has made it even more useful to me.

So, when was the last time I entered the full “ssh” command into a client’s server? I dunno, but it was a while ago… since the odds are that within the 10,000 most recent commands, I’ve got a mention of that client’s server.  And if I needed to pass specific options to ssh, such as a port number or a certificate file to get into AWS, that’ll be in the history, too.  By combining a huge history with control-R, you can basically write each command once, and then refer back to it many times.

Now the fact is that control-R isn’t really part of bash or zsh, per se.  Rather, it’s part of a GNU library called “readline” that is used in a large number of programs.  For example, it’s used in IPython, Pry, and the psql command-line client for PostgreSQL.  Everywhere I go, I can use control-R — and I do!  Each program saves its own history, so there’s no danger of mixing shell commands with PostgreSQL queries.


If you build it, they will come — but they might hate you

Several months ago, I was teaching an introductory Python course, and I happened to mention the fact that I use Git for all of my version-control needs.  I think that I would have gotten a more positive response if I had told them that my hobby is kicking puppies.

The reactions were roughly — and I’m not exaggerating here — something like, “What?  You use Git?!?  That so-called version control system whose main feature is eating our files?!?”   And I got this not just from one person, but from all 20-something people who were taking my Python course.  The more experience they had with Git, the more violently negative their reactions were.

I managed to calm them down a bit, and tried to tell them that Git is a wonderful system, except for one little problem, namely the fact that its interface is very hard to understand.  But, I promised them, once you understand how Git works, and once you start to work with it within the context of understanding what it’s doing, things start to make sense, and you can really enjoy and appreciate the system.

I should note that since that Python class, I’ve returned to the same company to give two day-long Git classes.  Based on the feedback I received, the Git class was very helpful, and I’m guessing that this is because I concentrated on what Git is really doing, and how the commands map to those actions.  I’m pretty sure that people from that class are starting to appreciate the power and flexibility of Git, rather than focusing only on their frustrations with it.

However, my experience working with and teaching Git have taught me a great deal about designing both software and UIs.  We love to say and think that excellent products with terrible marketing never get anywhere.  And in the commercial world, that might well be true. Everyone loves to quote the movie “Field of Dreams” (which I never really liked anyway), and how the main character builds a baseball field after repeatedly hearing, “If you build it, they will come.” As numerous other people have said, this is not the case for businesses: If you build it, they probably won’t come, unless you’ve invested time and money in marketing your product. 

However, in the open-source world,  we expect to invest time in learning a technology, and are generally more technical folks in any event.  Thus, we tend to be more forgiving of bad UIs, focusing on features rather than design. It’s thus possible for something brilliant, efficient, flexible, and profoundly frustrating for new users to become popular. Git is a perfect example of this.

Now, I happen to think that Git is one of the most brilliant pieces of software I’ve ever seen. Really, it’s impressively designed.  However, the commands are counter-intuitive for many people who used other version-control systems, and it’s possible to get yourself into a situation from which an expert can extract himself or herself, but in which a novice is completely befuddled.  Once you understand how Git works (brilliantly described in this video), things start to make sense.  But getting to that point can take a great deal of time, and not everyone has that time.

In open source, then, “If you build it, they will come” might sometimes work.  However, even if they do come, and even if they use the software that you have written, you might end up in a particularly unenviable situation: People will use the software, but will hate you for the way in which you designed it.

The upshot, then, is that it’s worth taking a bit of time to think about your users, and how they will use your system.  It’s worth taking the time to create an interface (including commands) that will make sense for people.  Look at WordPress, for example: It packs in a great deal of functionality, but also pays attention to the UI… and as a result, has become a hugely dominant part of the Web ecosystem.

Sure, Git is famous and popular, and I’m one of its biggest fans, at least in terms of functionality. But if Linus had spent just a bit more time thinking about command names, or behaviors, I think that we would have had an equally powerful tool, but with fewer people in need of courses to understand why their files are getting trampled.

Good intentions, unexpected results: Mailing lists and DMARC

If there’s anything that software people know, it’s that changing one part of a program can result in a change in a seemingly unrelated part of the program.  That’s why automated testing is so powerful; it can show you when you have made a mistake that you not only didn’t intend, but that you didn’t expect.

If unexpected results can happen in a system that you control and supposedly understand, it’s not hard to imagine what happens when the results of your changes involve many pieces of software other than yours, running on computers other than yours, being used by customers who aren’t yours.

This would appear to be the situation with one of the latest anti-spam and security features for e-mail, known as DMARC.

I’m not intimately familiar with this standard, but I’ve seen other standards relating to e-mail in the past to know that anything having to do with e-mail will be frustrating for some of the people involved.  E-mail is in use by so many people, on so many computers, and by so many different programs, that you can’t possibly make changes without someone getting upset.  Nevertheless, the DMARC implementation and rollout by a number of large e-mail providers over the last few weeks has been causing trouble.

Let me explain: DMARC promises, to some degree, to reduce the amount of spam that we get by verifying that the sender’s e-mail address (in the “From” field) matches the server from which the e-mail was sent.  So if you get e-mail from me, with a “From” address of “”, DMARC will verify that the e-mail was really sent from the server.  To anyone who has received spam, or fake messages, or illegal “phishing” messages, this sounds like a great thing: No longer will you get messages from your friend with a address, asking for money now that they’re stranded in London.  It really, admirably aims to reduce the number of such messages.

How? Very simply, by checking that the “From” address in the message matches the server from which the message was sent.  If your DMARC-compliant server receives e-mail from “”, but the server was some anonymous IP address in Mongolia, your server will refuse to receive the e-mail message.

So far, so good.  But of course, for every rule, there are exceptions.  Consider, for example, e-mail lists: When someone posts to a list, the “From” address is preserved, so that the message appears to be coming from the sender.  But in fact, the message isn’t coming from the sender.  Rather, it’s coming from the e-mail program running on a server.

For example, if I ( send e-mail to a mailing list (, the e-mail will really be coming from the server.  But it’ll have a “From” address of  So now, if a receiver is using DMARC, they’ll see the discrepancy, and refuse to receive the e-mail message.

If is using DMARC in the strictest way possible, then sending to will have especially unpleasant consequences: will refuse to receive its own subscriber’s message to the list, because DMARC will show it to be a fake.  These refusals will count as a “bounce” on the mailing list, meaning a message that failed to get to the recipient’s inbox.  Enough such bounces, and everyone at will be unsubscribed.

Yes, this means that if your e-mail provider uses DMARC, and if you subscribe to an e-mail list, then posting to such a list may result (eventually) in every other user of your provider being unsubscribed from the list!

I’ve witnessed this myself over the last few weeks, as members of a large e-mail list I maintain for residents of my city have slowly but surely been unsubscribed.  Simply put, any time that a Hotmail, Yahoo, or AOL users posts to the list for Modi’in residents, all of these companies (and perhaps more) refuse the message.  This refusal increases the number of bounces attributed to the users, and eventually results in mass auto-subscriptions.

As if that weren’t bad enough (and yes, it’s pretty bad), people who have been passively reading (i.e., not participating) in the e-mail list for years are now getting cryptic messages from the list-management software, saying that they have been unsubscribed because of excessive bounces.  Most people have no idea what this means, which in turn leads to the list managers (such as me) having to explain intricate e-mail policy issues.

There are some solutions to this problem, of course.  But they’re all bad, so far as I can tell, and came without any serious warning or notification.  And when it comes to e-mail, you really don’t want to start rejecting message en masse without warning.  The potential solutions are:

  1. Subscribers can receive the digest mode of the list, which is always “From” an address on the server.  If you get the digest, this problem won’t happen to you.  If you are a mailing-list subscriber, rather than a list administrator, this is really the only recourse that you have.
  2. The list managers can change the list such that instead of each message being “From” the individual, it’ll come from the list’s address.  I know that there are some people who say that this is the right behavior for e-mail lists, but I have long subscribed (so to speak) to the school of thought that you don’t want to change the “From” address.  (For more on this subject, you can read “reply-to considered harmful” and its associated messages.)
  3. Supposedly, Mailman (the list-management software that I use) now has some support for DMARC that might solve the problem.  But the more I learn about DMARC, the less I’m convinced that Mailman can do anything.

And by the way, it’s not just little guys like me who are suffering.  The IETF, which writes the standards that make the Internet work, recently discovered that their e-mail lists are failing, too.

E-mail lists are incredibly useful tools, used by many millions (and perhaps billions) of people around the world.  You really don’t want to mess with how they work unless there’s a very good reason to do so.  Yes, spam and fraud are big problems, and I welcome the chance to change them.  

But really, would it have been so hard to contact all of the list-management software makers (how many can there be?) and work out some sort of deal?  Or at least get the message out to those of us running lists that this is going to happen?  I have personally spent many hours now researching this problem, and trying to find a solution for my list subscribers, with little or no success.

This all brings me back to my original point: The intentions here were good, and DMARC sounds like a good idea overall.  But it is affecting, in a very negative way, a very large number of people who are now suddenly, and to their surprise, cut off from their friends, colleagues, workplaces, and organizations.  The fact that AOL and other e-mail providers are saying, “Well, you’ll just need to reconfigure your list software,” without considering whether we want to do this, or whether e-mail lists really need to change after more than two decades (!) of working in a certain way, is rather surprising to me.  I’m not sure if there’s any way back, but I certainly hope that this is the last time such a drastic, negative solution is foisted on the public in this way.

Convention over confusion

One of the most celebrated phrases that has emerged from Ruby on Rails is “convention over configuration.” The basic idea is that software can traditionally be used in many different ways, and that we can customize it using configuration files. Over the years, configuration files for many types of software have become huge; installing software might be easy, but configuring it can be difficult. Moreover, given the option, everyone will configure software differently. This means that when you join a new project, you need to learn that project’s specific configuration and quirks.

“Convention over configuration” is the idea that we can make everyone’s lives easier if we agree to restrict our freedom. Ruby on Rails does this by telling you precisely what your directories will be named, and where they will be located. Rails tells you what to call your database tables, your class names, and even your filenames. The Ruby language, while generally quite open and flexible, also enforces certain conventions: Class and module names must begin with capital letters, for example.

It can take some time for developers to accept these conventions. Indeed, I was one of them: When I first started to work with Rails, I was somewhat offended to be told precisely what my database column names would be, especially when those names contradicted advice that I had heard and adopted years ago. (The advice was to prefix every column in a database table with the name of the table, which would make it more easily readable in joins.  Thus the primary key of the “People” table would be person_id, followed by person_first_name, person_last_name, and so forth.)  Over time, I have grown not only to use these Rails conventions, but to enjoy working with them; it turns out that people can changes pretty easily, at least when it comes to these arbitrary decisions.

The real benefit of such conventions has nothing to do with my own work. Rather, it reduces the need for communication among people working on the same project. If everyone does it the same way, then there are fewer things to negotiate, and we can all concentrate on the real problems, rather than the ones which are relatively arbitrary.

Back in college, I was the editor of the student newspaper. We, like many newspapers, used the AP Stylebook to determine the style that we would use. The AP Stylebook was our bible; whatever it said, we did.  Of course, we also had our own local style, to cover things that AP didn’t, such as building names and numbers (e.g., we could refer to “Building 54″). In some cases, I personally disagreed with the AP Stylebook, especially when it came to the “Oxford comma.” But by keeping that rule, we were able to download articles from the Washington Post and LA Times, and stick them into our newspaper with minimal editing. Again, I prefer the serial comma, and use it in my personal writing. By adhering to a standard, I was able to ensure consistency in our writing, and reduce the workload of the (already hard-working) newspaper staff.

Twice in the last few weeks, I’ve been reminded of the benefits of convention over configuration — both times, when developers on projects I inherited decided to flout the rules. Their decisions weren’t wrong, but they were so wildly different from the conventions of Rails that they caused trouble, delays, and bugs.

The first case had to do with the Rails “asset pipeline,” a part of Rails which handles static assets such as JavaScript and CSS files. The idea is that you create a file called application.js, and that file then tells Rails about all of the JavaScript files used by your application. Before deploying a new version of your application, Rails combines all of these files into one big file, thus improving site performance (by reducing the number of files to download) and improving caching. The asset pipeline is a great idea, and it even works well — but in many cases, getting it to work correctly can be difficult and painful, particularly if you’re new to Rails.

So you can imagine my surprise when I looked for the application.js file, and didn’t find it.  That was bad enough, but the asset pipeline mechanism, as well as the deployment scripts I was developing, got rather confused by the absence of application.js. When I confronted the original developer about this, he told me that actually, he liked to call it something else entirely, reflecting the name of the application and client. Why? He didn’t really have a technical reason; it was all for reasons of aesthetics. The fact is that the rest of the Rails ecosystem expected application.js, though, so his decision meant that the rest of the software needed to be configured in a special, different way.

As a way of justifying his decision, the other developer told me, “Conventions shouldn’t be a boundary when developing.”  No, just the opposite — the idea is that conventions are there to limit you, to tell you to work in a way that everyone else works, so that things will be smoother.  In much of the world, we drive on the right side of the road.  This is utterly random; as numerous countries (e.g., England) have proven, you can drive on the other side of the road just fine — but only so long as everyone is doing it.  The moment everyone decides on their own conventions, big problems can occur.

When Biblical Hebrew wants to describe anarchy, it uses the phrase, “People did whatever was right in their own eyes.”

Something similar occurred with another project where I inherited code from someone else: One of my favorite things about Ruby on Rails is the fact that it runs the application in an “environment.”  The three standard environments are development (which is optimized for developer speed, not for execution speed), production (which is optimized for execution speed), and test (which is meant for testing). The environments aren’t meant to change the application logic, but rather the way in which the application behaves.  For example, I recently changed the way in which e-mail is sent to users of my dissertation software, the Modeling Commons. When I send the e-mail in the “production” environment, the e-mail is actually sent — but when I do so within the “development” environment, the e-mail is opened in a browser, so that I can examine it.  This is standard and expected behavior; all Rails applications have development, production, and test environments — and some even havea  “staging” environment, in which we prepare things.

My client’s software, which I inherited from someone else, decided to do something a bit different: The code was meant to be used on several different sites, each with slightly different logic.  The developer decided to use Rails environments in order to distinguish between the logical functions.  Thus, if you run the application under the “xyz” environment, you’ll get one logical path, and if you run the application under the “abc” environment, you’ll get another logical path.

It’s hard to describe the number of surprises and problems that this seemingly small decision has created: It means that we can’t really test the application using the normal Rails tools, because nothing will work correctly in the “test” environment. It means that the Phusion Passenger server that we installed to run the application needs an additional, special configuration parameter (not normally needed in production) to find the right database, and execute with the correct algorithms. It means that when you’re trying to trace through the logic of the application, you need to check the environment.

Basically, all of the things that you can assume about most Rails applications aren’t true in this one.

Now, the point of me writing this isn’t to say that I’m brilliant and that other developers are stupid — although it is true that Reuven’s First Law of Consulting states that a new consultant on a project must call his predecessor a moron.  Rather, it’s to point to the fact that conventions are there for a reason, and that if you insist on ignoring them, you’ll be increasing the learning curve that other developers will need to work on your application.  Now, if you have oodles of time and money, that’s just fine — but as a general rule, a developer’s time is a software company’s greatest expense, and anything you can do to increase productivity, and  decrease the need for explanations and communication, is worthwhile.

By the way, this is the whole reason why one of the Python mantras is, “There’s only one way to do it” — a direct contrast with the Ruby and Perl mantra, “There’s more than one way to do it.” Having a single, common way to do things makes everyone’s code more similar readable, and easier to understand. It doesn’t stop you from doing brilliant and interesting things, but does ask that you demonstrate your brilliance within the context of established practice.

Of course, this doesn’t mean that conventions are written in stone, or that they are unchangeable.  But if and when you ignore them, it should be for good reason.  Even if you’re right, think about whether you’re so right that it’s worth having multiple people learn your way of doing things, instead of the way that they’re used to doing them.

What do you think?  Have you see these sorts of issues in your work?  Let me know!