Category Archives for "PostgreSQL"

Reminder: My “Intro to SQL” course is this Sunday!

This is just a reminder that on Sunday, I’ll be teaching a live, 4-hour introductory course on databases and SQL. If you haven’t ever worked with databases before, then this will give you the push that you need to understand how they work, and how to work effectively with them.

I’ve been using SQL for 30 years, but it’s far from an outdated skill. No only do many of the world’s largest companies run on databases using SQL, but the smartphone in your pocket does, too. Moreover, once you understand the basics, you can start to ask serious and interesting questions from public and private datasets.

In this class, I’ll introduce you to SQL via the open-source PostgreSQL database. There will be numerous exercises and many opportunities to ask questions. And after the course is over, you’ll have access to an exclusive forum in which you can ask me questions about SQL.

Sounds good? Sign up here:

I’ve gotten a lot of e-mail with questions from people curious about the course. So without further ado, here’s a FAQ:

How much programming do I need to know for this course?

None; SQL is a query language, not a programming language. It’ll help to be able to think logically, which programming certainly helps to develop. But we won’t be using any language other than SQL itself.

What will I be able to do after this course?

You’ll understand how relational databases work in a general way. But beyond that, you’ll be able to:

  • Retrieve data using queries.
  • Insert, update, and delete data.
  • Create indexes
  • Assign a “primary key” (and know why that’s important)
  • Understand the value of normalization
  • Perform basic “joins” between tables

I already know some SQL. Is this course a good fit for me?

Probably not. We’ll be doing basic queries — the sort of thing that everyone needs to know when working with databases, but nothing too fancy.

That said, I’ve gotten enough questions like this one that I’ll almost certainly do an “Advanced SQL” course in the coming months.

Is this a course in using Python with SQL?

No, this course is only about SQL. We won’t be using Python! However, just as I’ll be doing an advanced SQL course in the coming months, I’ll probably offer one on using databases from within Python. But you’ll need the knowledge in this course to take that one.

What do I need to install?

I’ll be setting up a virtual machine with PostgreSQL installed, along with the pre-loaded data sets we’ll be using. And I’ll give you a copy of that when we’re done, so that you can practice at home. We’ll likely use a Web-based interface; I’ll give you full details before the course opens. And if you need more help, then I’ll provide it in class and in our forum.

Yeah, what’s with the forum you keep mentioning?

When I did my “pytest” course a few months ago, several people suggested that I provide either office hours or a forum after the course had ended, in case people had follow-up questions. This course will thus have a forum, open to anyone who has registered to ask questions after the course is over. I thus don’t see it as a one-and-done class, but a chance to ask questions and keep learning after it’s over.

Why should I take this course, when there are lots of introductory books and tutorials?

Just as my Python courses go beyond the syntax and help you to really understand what’s going on behind the scenes, this course will help you to think in terms of databases and tables. You’ll understand not just the “what” but the “why.” And when the course is over, you’ll have the foundation you need to learn and do more on your own.

But wait, there’s more: Because this is a live course, you’ll have a chance to ask questions and get detailed explanations. My goal is to really help you learn, and if that means explaining something several times until you get it, then that’s what I’ll do. This isn’t a generic course; it’s one that will cater to your specific needs.

Do you offer any discounts?

Absolutely! My standard discount policy applies:

  • Students
  • Seniors/pensioners/retirees
  • People living outside of the world’s 30 richest countries
  • Anyone affected adversely by the coronavirus/covid-19 pandemic

Questions? Need a discount code? Just e-mail me at, or contact me on Twitter as @reuvenmlerner.

Otherwise, learn more and sign up at I look forward to seeing you on Sunday!

Join the data revolution with my “Intro to SQL” course!

Have you heard? Data is “the new oil” — meaning, data is the most valuable and important thing in the modern world. Which means that if you can store, retrieve, and organize your data, then you (and your company) are positioned for greater success.

This usually means working with a database — and frequently, a relational database, with which you communicate using a language called SQL.

In other words: SQL is the key to the modern data revolution. But too often, people are put off from learning SQL. It seems weird, even when compared with a programming language.

Well, I have good news: If you want to join the data revolution and work with databases, I’m offering a new course. On November 15th, I’ll be teaching a live, 4-hour online course, “Intro to SQL.” I’ll teach you the basics of what you need to work with a database.

The course includes:

  • Access to the live, 4-hour online course, including numerous exercises and opportunities for Q&A
  • Access to the course recording, forever
  • Participation in our private forum, where you can ask me (and others) database-related questions

I’ve been using databases since 1995, and have been teaching SQL for more than 20 years. This course is based on that corporate training, and is meant to get you jump started into the world of data and relational databases. We’ll be using PostgreSQL, a powerful open-source database I’ve been using for more than two decades.

Questions? Learn more at (where there’s an extensive FAQ). Or contact me on Twitter (@reuvenmlerner) or via e-mail ( I’ll answer as soon as I can.

I hope to see you there!


Yes, you can master regular expressions!

Announcing: My new book, “Practice Makes Regexp,” with 50 exercises meant to help you learn and master regular expressions. With explanations and code in Python, Ruby, JavaScript, and PostgreSQL.

I spend most of my time nowadays going to high-tech companies and training programmers in new languages and techniques. Actually, many of the things I teach them aren’t really new; rather, they’re new to the participants in my training. Python has been around for 25 years, but for my students, it’s new, and even a bit exciting.

I tell participants that my job is to add tools to their programming toolbox, so that if they encounter a new problem, they’ll have new and more appropriate or elegant ways to attack and solve it. Moreover, I tell them, once you are intimately familiar with a tool or technique, you’ll suddenly discover opportunities to use it.
Earlier this week, I was speaking with one of my consulting clients, who was worried that some potentially sensitive information had been stored in their Web application’s logfiles — and they weren’t sure if they had a good way to search through the logs.


I suggested the first solution that came to mind: Regular expressions.

Regular expressions are a lifesaver for anyone who works with text.  We can use them to search for patterns in files, in network data, and in databases. We can use them to search and replace.  To handle protocols that have changed ever so slightly from version to version. To handle human input, which is always messier than what we get from other computers.

Regular expressions are one of the most critical tools I have in my programming toolbox.  I use them at least a few times each day, and sometimes even dozens of times in a given day.

So, why don’t all developers know and use regular expressions? Quite simply, because the learning curve is so steep. Regexps, as they’re also known, are terse and cryptic. Changing one character can have a profound impact on what text a regexp matches, as well as its performance. Knowing which character to insert where, and how to build up your regexps, is a skill that takes time to learn and hone.

Many developers say, “If I have a problem that involves regular expressions, I’ll just go to Stack Overflow, where my problem has likely been addressed already.” And in many cases, they’re right.

But by that logic, I shouldn’t learn any French before I go to France, because I can always use a phrasebook.  Sure, I could work that way — but it’s far less efficient, and I’ll miss many opportunities that would come my way if I knew French.

Moreover, relying on Stack Overflow means that you never get a full picture of what you can really do with regular expressions. You get specific answers, but you don’t have a fully formed mental model of what they are and how they work.

But wait, it gets worse: If you’re under the gun, trying to get something done for your manager or a big client, you can’t spend time searching through Stack Overflow. You need to bring your best game to the table, demonstrating fluency in regular expressions.  Without that fluency, you’ll take longer to solve the problem — and possibly, not manage to solve it at all.

Believe me, I understand — my first attempt at learning regular expressions was a complete failure. I read about them in the Emacs manual, and thought to myself, “What could this seemingly random collection of characters really do for me?”  I ignored them for a few more years, until I started to program in Perl — a language that more or less expected you to use regexps.

So I spent some time learning regexp syntax.  The more I used them,  the more opportunities I found to use them.  And the more I found that they made my life easier, better, and more convenient.  I was able to solve problems that others couldn’t — or even if they could, they took much longer than I did.  Suddenly, processing text was a breeze.

I was so excited by what I had learned that when I started to teach advanced programming courses, I added regexps to the syllabus.  I figured that I could figure out a way to make regexps understandable in an hour or two.

But boy, was I wrong: If there’s something that’s hard for programmers to learn, it’s regular expressions.  I’ve thus created a two-day course for people who want to learn regular expressions.  I not only introduce the syntax, but I have them practice, practice, and practice some more.  I give them situations and tasks, and their job is to come up with a regexp that will solve the problem I’ve given them.  We discuss different solutions, and the way that different languages might go about solving the problem.

After lots of practice, my students not only know regexp syntax — they know when to use it, and how to use it.  They’re more efficient and valuable employees. They become the person to whom people can turn with tricky text-processing problems.  And when the boss is pressuring them for a

ImageAnd so, I’m delighted to announce the launch of my second ebook, “Practice Makes Regexp.”  This book contains 50 tasks for you to accomplish using regular expressions.  Once you have solved the problem, I present the solution, walking you through the general approach that we would use in regexps, and then with greater depth (and code) to solve the problem in Python, Ruby, JavaScript, and PostgreSQL.  My assumption in the book is that you have already learned regexps elsewhere, but that you’re not quite sure when to use them, how to apply them, and when each metacharacter is most appropriate.

After you go through all 50 exercises, I’m sure that you’ll be a master of regular expressions.  It’ll be tough going, but the point is to sweat a bit working on the exercises, so that you can worry a lot less when you’re at work. I call this “controlled frustration” — better to get frustrated working on exercises, than when the boss is demanding that you get something done right away.

Right now, the book is more than 150 pages long, with four complete chapters (including 17 exercises).  Within two weeks, the remaining 33 exercises will be done.  And then I’ll start work on 50 screencasts, one for each of the exercises, in which I walk you through solutions in each of Python, Ruby, JavaScript, and PostgreSQL.  If my previous ebook is any guide, there will be about 5 hours (!) of screencasts when I’m all done.

If you have always shied away from learning regular expressions, or want to harness their power, Practice Makes Regexp is what you have been looking for.  It’s not a tutorial, but it will help you to understand and internalize regexps, helping you to master a technology that frustrates many people.

To celebrate this launch, I’m offering a discount of 10%.  Just use the “regexplaunch” offer code, and take 10% off of any of the packages — the book, the developer package (which includes the solutions in separate program files, as well as the 300+ slides from the two-day regexp course I give at Fortune 100 companies), or the consultant package (which includes the screencasts, as well as what’s in the developer package).

I’m very excited by this book.  I think that it’ll really help a lot of people to understand and use regular expressions.  And I hope that you’ll find it makes you a more valuable programmer, with an especially useful tool in your toolbox.


Using regexps in PostgreSQL

After months of writing, editing, and procrastinating, my new ebook, “Practice Makes Regexp” is almost ready.  The book (similar to my earlier ebook, “Practice Makes Python“) contains 50 exercises to improve your fluency with regular expressions (“regexps”), with solutions in Python, Ruby, JavaScript, and PostgreSQL.

When I tell people this, they often say, “PostgreSQL?  Really?!?”  Many are surprised to hear that PostgreSQL supports regexps at all.  Others, once they take a look, are surprised by how powerful the engine is.  And even more are surprised by the variety of ways in which they can use regexps from within PostgreSQL.

I’m thus presenting an excerpt from the book, providing an overview of  PostgreSQL’s regexp operators and functions. I’ve used these many times over the years, and it’s quite possible that you’ll also find them to be of assistance when writing queries.


PostgreSQL isn’t a language per se, but rather a relational database system. That said, PostgreSQL includes a powerful regexp engine.  It can be used to test which rows match certain criteria, but it can also be used to retrieve selected text from columns inside of a table.  Regexps in PostgreSQL are a hidden gem, one which many people don’t even know exists, but which can be extremely useful.

Defining regexps

Regexps in PostgreSQL are defined using strings.  Thus, you will create a string (using single quotes only; you should never use double quotes in PostgreSQL), and then match that to another string. If there is a match, PostgreSQL returns “true.”

PostgreSQL’s regexp syntax is similar to that of Python and Ruby, in that you use backslashes to neutralize metacharacters. Thus, + is a metacharacter in PostgreSQL, whereas \+ is a plain “plus” character. However, there are differences between the regexp syntax for example, PostgreSQL’s word-boundary metacharacter is \y whereas in Python and Ruby, it is \b.  (This was likely done to avoid conflicts with the ASCII backspace character.)

Where things are truly different in PostgreSQL’s implementation is the set of operators and functions used to work with regexps. PostgreSQL’s operators are generally aimed at finding whether a particular regexp matches text, in order to include or exclude result rows from an SQL query.  By contrast, the regexp functions are meant to retrieve some or all of a string from a column’s text value.

True/false operators

PostgreSQL comes with four regexp operators. In each case, the text string to be matched should be on the left, and the regexp should be on the right.  All of these operators return true or false:

  • ~  case-sensitive match
  • ~*  case-insensitive match
  • !~  case-sensitive non-match
  • !~* case-insensitive non-match

Thus, you can say:

select 'abc' ~ 'a.c';   -- returns "true"
select 'abc' ~ 'A.C';   -- returns "false"
select 'abc' ~* 'A.C';  -- returns "true"

In addition to the standard character classes, we can also use POSIX-style character classes:

select 'abc' ~* '^[[:xdigit:]]$';    -- returns "false"
select 'abc' ~* '^[[:xdigit:]]+$';   -- returns "true"
select 'abcq' ~* '^[[:xdigit:]]+$';  -- returns "false"

This operator, as mentioned above, is often used to include or exclude rows in a query’s WHERE clause:

INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT id, thing FROM Stuff WHERE thing ~* '^[abc]{3}$';

This final query should return three rows, those in which thing is equal to abc, Abc, and ABC.

Extracting text

If you’re interested in the text that was actually matched, then you’ll need to use one of the built-in regexp functions that PostgreSQL provides. For example, the regexp_match function allows us not only to determine whether a regexp matches some text, but also to get the text that was matched.  For each matching column, regexp_match returns an array of text (even if that array contains a single element).  For example:

INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT regexp_matches(thing, '^[abc]{3}$') FROM Stuff;

The above will return a single row:


As you can see, the above returned only a single column (from the function) and a single row (i.e., the one matching it).  That’s because when you invoke regexp_matches, you can provide additional flags that modify the way in which it operates. These flags are similar to those used in Python, Ruby, and JavaScript.

For example, we can use the i flag to make regexp_match case-insensitive:

INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT regexp_matches(thing, '^[abc]{3}$', 'i') FROM Stuff;

Now we’ll get three rows back, since we have made the match case-insensitive.  regexp_matches can take several other flags as well, including g (for a global search). For example:

INSERT INTO Stuff (thing) VALUES ('ABC');
SELECT regexp_matches(thing, '.', 'g') FROM Stuff;

Here is the output from regexp_matches:


Notice how regexp_matches, because of the g option, returned three rows, with each row containing a single (one-character) array. This indicates that there were three matches.

Why is each returned row an array, rather than a string? Because if we use groups to capture parts of the text, the array will contain the groups:

INSERT INTO Stuff (thing) VALUES ('ABC'), ('AqC');
SELECT regexp_matches(thing, '^(A)(..)$', 'ig') FROM Stuff;

Notice that in the above example, I combined the i and g flags, passing them in a single string.  The result is a set of arrays:

| regexp_matches |
| {A,BC}         |
| {A,qC}         |


A common function in many high-level languages is split, which takes a string and returns an array of items. PostgreSQL offers this with its split_part function, but that only works on strings.

However, PostgreSQL also offers two other functions: regexp_split_to_array and regexp_split_to_table. This allows us to split a text string using a regexp, rather than a fixed string.  For example, if we say:

select regexp_split_to_array('abc def   ghi   jkl', '\s+');

The above will take any length of whitespace, and will use that to split the columns.  But you can use any regexp you want to split things, getting an array back.

A similar function is regexp_split_to_table, which returns not a single row containing an array, but rather one row for each element. Repeating the above example:

select regexp_split_to_table('abc def   ghi   jkl', '\s+');

The above would return a table of four rows, with each split text string in its own row.

Substituting text

The regexp_replace function allows us to create a new text string based on an old one.  For example:

SELECT regexp_replace('The quick brown fox jumped over the lazy dog',
                      '[aeiou]', '_');

The above returns:

Th_ quick brown fox jumped over the lazy dog

Why was only the first vowel replaced? Because we didn’t invoke regexp_replace with the g option, making it global:

SELECT regexp_replace('The quick brown fox jumped over the lazy dog',
                      '[aeiou]', '_', 'g');

Now all occurrences are replaced:

Th_ q__ck br_wn f_x j_mp_d _v_r th_ l_zy d_g

In PostgreSQL, as in life, don’t wait too long to commit

I recently helped a client with serious PostgreSQL problems. Their database is part of an application that configures and monitors large networks; each individual network node reports its status to the database every five minutes. The client was having problems with one table in particular, containing about 25,000 rows — quite small, by modern database standards, and thus unlikely to cause problems.

However, things weren’t so simple: This table was growing to more than 20 GB every week, even though autovacuum was running. In response, the company established a weekly ritual: Shut down the application, run a VACUUM FULL, and then restart things. This solved the problem in the short term — but by the end of the week, the database had returned to be about 20 GB in size.  Clearly, something needed to be done about it.

I’m happy to say that I was able to fix this problem, and that the fix wasn’t so difficult. Indeed, the source of the problem might well be obvious to someone with a great deal of PostgreSQL experience. That’s because the problem mostly stemmed from a failure to understand how PostgreSQL’s transactions work, and how they can influence the functioning of VACUUM, the size of your database, and the reliability of your system. I’ve found that this topic is confusing to many people who are new to PostgreSQL, and I thus thought that it would be a useful to walk others through the problem, and then describe how we solved it.

PostgreSQL, like many other modern relational databases, uses a technology known as multiversion concurrency control (“MVCC”).  MVCC reduces the number of exclusive locks placed on a database’s rows, by keeping around multiple versions of each row. If that sounds weird, then I’ll try to explain. Let’s start with a very simple table:


Now let’s add 1 million rows to that table:

INSERT INTO Foo (x) VALUES (generate_series(1,1000000));

Now, how much space does this table take up on disk? We can use the built-in pg_relation_size function, but that’ll return a large integer representing a number of bytes. Fortunately, PostgreSQL also include pg_size_pretty, which turns a number into something understandable by computer-literate humans:

SELECT pg_size_pretty(pg_relation_size('foo'));

35 MB

What happens to the size of our table if we UPDATE each row, incrementing x by 1?

UPDATE Foo SET x = x + 1;
SELECT pg_size_pretty(pg_relation_size('foo'));

69 MB

If you’re new to MVCC, then the above is probably quite surprising. After all, you might expect an UPDATE query to change each of the existing 1 million rows, keeping the database size roughly identical.

However, this is not at all the case: Rather than replace or update existing rows, UPDATE in an MVCC system adds new rows. Indeed, in an MVCC-based system, our queries never directly change or remove any rows. INSERT adds new rows, DELETE marks rows as no longer needed, and UPDATE performs both an INSERT and a DELETE.

That’s right: In an MVCC system, UPDATE doesn’t change existing rows, although we might have the illusion of this occurring.  Every time you UPDATE a table, you’re doubling its size.

At this point, many newcomers to PostgreSQL are even more confused: How can it be that deleted rows don’t go away? And why would UPDATE work in such a crazy way? And how can all of this possibly help me?

The key to understanding what’s happening is to realize that once a row has been added to the system, it cannot be changed. However, its space can be reused by new rows, if the old row is no longer accessible by any transaction.

This starts to make a bit more sense if you realize that every transaction in PostgreSQL has a unique ID number, known as the “transaction ID.” This number constantly increases, providing us with a running count of transactions in the system. Every time we INSERT a row, PostgreSQL records the first transaction ID from which the row should now be visible, in a column known as “xmin.” You an think of xmin as the row’s birthday. You cannot normally see xmin, but PostgreSQL will show it to you, if you ask for it explicitly:

SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187665 │  1 │ 2 │
│ 187665 │  2 │ 3 │
│ 187665 │  3 │ 4 │
│ 187665 │  4 │ 5 │

In an MVCC-based system, DELETE doesn’t really delete a row.  Rather, it indicates that a row is no longer available to future transactions. We can see an initial hint of this if, when we SELECT the contents of a table, we ask for not only xmin, but also xmax — the final transaction (or “expiration date,” if you like) for which a row should be visible:

SELECT  xmin, xmax, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ xmax │ id │ x │

│ 187656 │    0 │  1 │ 2 │
│ 187656 │    0 │  2 │ 3 │
│ 187656 │    0 │  3 │ 4 │
│ 187656 │    0 │  4 │ 5 │

We can see that each of these rows was added in the same transaction ID (187656). Moreover, all of these rows are visible, since their xmax value is 0. (Don’t worry; we’ll see other xmax values in a little bit.)  If I update some  of these rows, we’ll see that their xmin value will be different:

UPDATE Foo SET x = x + 1 WHERE id IN (1,3);
SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187668 │  1 │ 3 │
│ 187665 │  2 │ 3 │
│ 187668 │  3 │ 5 │
│ 187665 │  4 │ 5 │

The rows with ID 1 and 3 were added in transaction 187668, whereas the rows with ID 2 and 4 were added in an earlier transaction, number 187665.

Since I’ve already told you that rows cannot be changed by a query, this raises an interesting question: What happened to the rows with IDs 1 and 3 that were added in the first transaction, number 187665?

The answer is: Those rows still exist in the system! They have not been modified or removed. But the system did mark them as having an xmax of 187668, the transaction ID in which the new, replacement rows are now available.

Transaction IDs only go up. But you can imagine a hypothetical time-traveler’s database, in which you could move the transaction ID down, and see rows from an earlier era. (Somehow, I doubt that this would be a very popular television series.)

The thing is, we don’t need a hypothetical, time-traveler’s database in order to see such behavior. All we need is to open two separate sessions to our PostgreSQL database at the same time, and have each session open a transaction.

You see, when you BEGIN a new transaction, your actions are hidden from the rest of the world — and similarly, modifications made elsewhere in the database are hidden from you.  You can thus have a situation in two different sessions see completely different versions of the same row, thanks to their different transaction IDs.  This might sound crazy, but it actually helps the database to run smoothly, with a minimum of conflicts and locks.

So, let’s open two different PostgreSQL sessions, and take a look at what happens.  In order to distinguish between them, I used the \set command in psql to make PROMPT1 either “Session A>” or “Session B>”. Here’s session A:

Session A> BEGIN;
Session A> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

And here is session B:

Session B> BEGIN;
Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

At this point, they are identical; both sessions see precisely the same rows, with the same xmin and xmax. Now let’s modify the values of two rows in session A:

Session A> update foo set x = x + 1 where id in (1,3);

What do we see now in session A?

Session A> select xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187670 │    0 │  1 │ 4 │
│ 187665 │    0 │  2 │ 3 │
│ 187670 │    0 │  3 │ 6 │
│ 187665 │    0 │  4 │ 5 │

Just as we would expect, we have two new rows (IDs 1 and 3), and two old rows (IDs 2 and 4), which we can distinguish via the xmin value.  In all cases, the rows have an xmax value of 0.

Let’s return to session B, and see what things look like there.  (Note that I haven’t committed any transactions here!)

Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │  xmax  │ id │ x │

│ 187668 │ 187670 │  1 │ 3 │
│ 187665 │      0 │  2 │ 3 │
│ 187668 │ 187670 │  3 │ 5 │
│ 187665 │      0 │  4 │ 5 │

That’s right — the rows that session B sees are no longer the rows that session A sees.  Session B continues to see the same rows as were visible when it started the transaction. We can see that the xmax of session B’s old rows is the same as the xmin of session A’s new rows. That’s because any transaction after 187670 will see the new row (i.e., with an xmin of 187670), but any transaction started before 187670 will see the old row (i.e., with an xmax of 187670).

I hope that the picture is now coming together: When you INSERT a new row, its xmin is the current transaction ID, and its xmax is 0.  When you DELETE a row, its xmax is changed to be the current transaction ID.  And when you UPDATE a row, PostgreSQL does both of these actions.  That’s why the UPDATE I did earlier, on all of the rows in the table, doubled its size.  (More on that in a moment.)

This allows different transactions to work independently, without having to wait for locks. Imagine if session B wants to look at row with ID 3; because of MVCC, it doesn’t need to wait to read it.  Heck, session B can even change the row with ID 3, although it’ll hang until session A commits or rolls back.

And indeed, another benefit of MVCC is that rollbacks are trivially easy to implement: If we COMMIT session A and roll back session B, then session A’s view of the world will be the current one, for all following transactions.  But if we ROLLBACK session A, then everything is fine; PostgreSQL keeps track of the fact that session A was rolled back, and that we should actually be using (until further notice) the rows with an xmax of 187670.

As you can imagine, this can lead to some trouble. If every UPDATE increases the number of rows, and every DELETE fails to free up any space, every PostgreSQL database will eventually run out of room, no? Well, yes — except for VACUUM, which is the garbage-collecting mechanism for PostgreSQL’s rows. VACUUM can be run manually, but it’s normally run via “autovacuum,” a program that runs in the background, and invokes VACUUM on a regular basis.

VACUUM doesn’t free space from your PostgreSQL database. Rather, it identifies rows whose xmin is greater than the current transaction ID, and thus marks those rows as available for re-use. PostgreSQL thus keeps track of not only the current transaction ID, but also the ID of the earliest still-open transaction. So long as that transaction is open, it needs to be able to see the rows from that point in time.

Don’t confuse VACUUM with VACUUM FULL; the later does indeed clear out space from a PostgreSQL database, but locks the database while it’s doing so. If you have a serious 24/7 application, then VACUUM FULL is a very bad idea. The regular autovacuum daemon, albeit perhaps with a bit of configuration, should take care of identifying which rows can and should be marked for deletion.

In the case of my client’s problem, my initial thought was that autovacuum wasn’t running. However, a quick look at the pg_stat_user_tables view showed that autovacuum was running on a regular basis.

However, VACUUM will mark a row as eligible for reuse if no other transactions can see it. (After all, if a session remains open, then it still needs to see those old row versions.)  Meaning that if a transaction remains open from the time the system starts up, VACUUM will never identify any rows as eligible for reuse, because there is still an open transaction whose ID is lower than the xmax of the deleted rows.

And this is precisely what happened to my client: It turns out that their application, at startup, was opening several transactions with BEGIN statements… and then never committing those transactions or rolling them back.  None of those transactions led to a lock in PostgreSQL, because they weren’t executing any query that would lead to a conflict. And autovacuum was both running and reporting that it ran on each table — because it had indeed done so.

But when it came time to mark the old rows as eligible for reuse, VACUUM ignored all of the rows in the entire database, including our 23,000-row table — because of the transaction that the application had opened at startup. It didn’t matter how many times we can VACUUM, whether it was done manually or automatically, or what else was running at the time: All of the old versions of every row in our table were kept around by PostgreSQL, leading to a massive disk usage and database slowness. Stopping the application had the effect of closing those transactions, and thus allowing VACUUM FULL to do its magic, albeit at the cost of system downtime. And because the application opened the transactions at startup, any fix was bound to be a temporary one.

My client was a bit surprised to discover that the entire database could encounter such problems from an open transaction that was neither committed nor rolled back. However, I hope that you now see why the space wasn’t reclaimed, and why the database acted as it did.

The solution, of course, was to get rid of those three open, empty transactions at startup; once we did that, everything ran smoothly.

Transactions are a brilliant and useful invention, but in PostgreSQL (and other MVCC systems), you shouldn’t be holding onto open transactions for too long. Doing so might have surprisingly unpleasant consequences. Commit or rollback as soon as you can — or you might find yourself with a database accumulating gigabytes full of old, unnecessary rows.

Remember: Don’t mix = and NULL

I was recently performing an analysis on the database for my dissertation software, and wanted to look at one particular set of rows.  The table I was looking at is called Nodes, and because each node (represented by a row in the table) can optionally belong to a group, the group_id column allows for NULL values.

So, if I want to find all of the nodes in group 2, I can say

SELECT * FROM Nodes WHERE group_id = 2;

My data consists (of course) of several tables, including Versions, which has a many-to-one relationship with Nodes.  If I want all of the versions for nodes in group 2, I can say:

SELECT COUNT(*) FROM Versions WHERE node_id in (SELECT id FROM Nodes WHERE group_id = 2);

Now, let’s say that I want to get all of the versions that are not from group 2.  How would I do that?  Well, one way would be:

SELECT COUNT(*) FROM Versions WHERE node_id 
   NOT IN (SELECT id FROM Nodes WHERE group_id = 2);

But at some point, I flipped the logic, turning the NOT IN into IN, and = into <>:

SELECT COUNT(*) FROM Versions WHERE node_id 
  IN (SELECT id FROM Nodes WHERE group_id <> 2);

That, of course, was a bug.  That’s because SQL uses trinary (or ternary) logic, not the binary true-false logic to which we’re accustomed.  The result of comparing anything with NULL results in a NULL.  For example, consider the following table:

select * from truth_test ;
| val    |
| t      |
| f      |
| [null] |

I can create a truth table pretty easily:

SELECT val, val = true as is_true, val=false as is_false, 
       val=null as equal_null, val is null as is_null 
FROM truth_test;
| val    | is_true | is_false | equal_null | is_null |
| t      | t       | f        | [null]     | f       |
| f      | f       | t        | [null]     | f       |
| [null] | [null]  | [null]   | [null]     | t       |

You can see, then, that TRUE is equal to TRUE, and FALSE is equal to FALSE, which we would expect.  But using the = operator with NULL is always going to result in NULL — even with itself!

If you want to compare with NULLs, then you need to use IS NULL.  (Or you can use a function such as COALESCE, which returns the first non-NULL value passed to it.)  And that’s the mistake that I made in my queries yesterday: I was trying to get all of the nodes for which group_id was not 2.  But by using =, I inadvertently excluded all of the rows for which group_id was NULL — in other words, all of the nodes that weren’t assigned to a group.  Which, it turns out, was the majority of them.  Whoops.

My solution was to create a view containing all of the IDs that I want to exclude:

CREATE OR REPLACE VIEW group2_model_ids
FROM nodes
WHERE nodes.group_id = 2;

Now, if I want the number of versions that aren’t in group 2, I can use a subselect, as follows:

FROM Versions 
WHERE node_id NOT IN (SELECT id FROM group2_model_ids);

Of course, I could have simply rewritten the query, either to use COALESCE or to check for both NULL and not-NULL values.  But I prefer this way, especially since I’m only dealing with a few tens of thousands of rows of data in this table, in part because it gives me a name and abstraction that I can use in future queries.


Turning a PostgreSQL array to rows

I’ve spent the last few blog posts in this series talking about PostgreSQL’s arrays — how to create them, query them, and even create them with the ARRAY function.  But there are cases in which I would like to do the opposite — turn a PostgreSQL array to rows.

There are a number of reasons why I might want to do this. For example, perhaps I want to find the intersection between two arrays.  The INTERSECT operator in PostgreSQL can easily do that with two sets of rows — but there isn’t any equivalent for arrays.  Similarly, the UNION operator lets me join two sets of rows, but nothing like that exists for arrays.  And of course, there are times when I might want to take the elements of an array and run a JOIN between them and the rows of an existing table.

The key to all of these is the UNNEST function.  You need to be a bit careful when you’re using UNNEST, because (as always with computers) PostgreSQL will do what you tell it to do, not necessarily what you want it to do.  For example, if I have the following table:

select * from foo;
| id | stuff         |
| 1  | {abc,def,abc} |
| 2  | {abc,xyz}     |

Let’s say that I use UNNEST on our first row:

[local]/reuven=# SELECT UNNEST(stuff) FROM Foo where id = 1;
| unnest |
| abc    |
| def    |
| abc    |

Sure enough, the three elements of “stuff” where id is 1 have now become three rows.  We get the results as a set of rows, which we can then capture in a table, query in a subquery, or turn back into an array, if we want:

SELECT ARRAY(SELECT UNNEST(stuff) FROM Foo where id = 1);
| array         |
| {abc,def,abc} |

This example might seem pointless, but it raises some additional possibilities.  For example, let’s say that I want to ensure that the elements of my array are unique.  I can do that by applying the DISTINCT operation on the result of UNNEST, and then packing those distinct rows back into an array:

| array     |
| {abc,def} |

It’s this sort of operation that has led me to really love PostgreSQL arrays.  If I use them as actual arrays, then I feel bad, as though I’ve betrayed the entire world of normalized, relational databases. But if I think of arrays as frozen sets of rows, which I can then unfreeze (with UNNEST), manipulate, join, or otherwise modify, and then re-freeze (with ARRAY) as necessary, then things suddenly get much clearer and easier.

I’m in the last weeks of my PhD research and writing, and I’ve been using no small number of arrays in my research and analysis.  I wouldn’t dream of using arrays to store actual combinations of data that could otherwise be normalized and stored in a reasonable way.  But I am constantly creating views that turn complex queries into arrays, which I can then further query, nest, and unnest as possible.

I’m just about done with this series; next time, I’ll show how I can combine arrays and CTEs to perform the sorts of operations that I described above, without having to pollute my database namespace with lots of unnecessary views.


Turning PostgreSQL rows into arrays

So far in this series, I have described how you can create arrays and retrieve information from them — both the actual data stored in the array, and information about the array, such as its length. But the coolest trick, or set of tricks, that I use in PostgreSQL is the ability to turn arrays into rows, and vice versa.  This is where, in my mind, arrays start to get truly powerful, and allow us to do all sorts of things that would otherwise be quite difficult, or require another language.

For example, let’s say that I want to keep track of my appointments in a database.  I create an Appointments table:

CREATE TABLE Appointments (id SERIAL,
 meeting_at TIMESTAMP,
 description TEXT

And then I enter information into this table, such that I have the following appointments:

[local]/reuven=# select * from appointments order by meeting_at;
 | id | meeting_at          | description   |
 | 5 | 2014-05-22 12:00:00  | Dissertation  |
 | 4 | 2014-05-22 14:00:00  | Devel meeting |
 | 1 | 2014-05-23 15:00:00  | Dissertation  |
 | 2 | 2014-05-23 17:00:00  | Dinner        |
 | 3 | 2014-05-23 23:00:00  | Sleep         |
 (5 rows)

If I want to find out what I’m doing on May 23rd, I can simply say:

[local]/reuven=# select * from appointments where meeting_at::date = '2014-may-23';
 | id | meeting_at          | description  |
 | 1 | 2014-05-23 15:00:00  | Dissertation |
 | 2 | 2014-05-23 17:00:00  | Dinner       |
 | 3 | 2014-05-23 23:00:00  | Sleep        |

Notice that in the above example, I’m using PostgreSQL’s casting operator (::) to turn the meeting_at column, which contains a TIMESTAMP value, into a DATE value.  Without that,  the comparison would have been at midnight. So far, this isn’t very exciting or surprising.  But let’s say that I want to get all of the descriptions of today’s appointments in a comma-separated list.

Normally, I would need to retrieve all of the rows, and then use another language to piece them together.  With PostgreSQL’s arrays, though, we can do this all in the database — which executes much more quickly, among other things. The way to accomplish this is with the ARRAY function.  Now, don’t confuse the ARRAY[] constructor with the ARRAY function, although it’s hard not to.  (To be honest, I’m not sure how PostgreSQL can tell the difference; it might be the square brackets vs. parentheses that tell it to do different things.)

The ARRAY function is built into PostgreSQL, and turns a set of rows into an array.  That is, ARRAY expects to get rows as its input, and returns a single row, containing a single array, as its output. Let’s start with a set of rows, containing the descriptions of today’s appointments.  I’m going to use PostgreSQL’s concatenation operator (||) to combine the time (using the ::time cast) of the appointment with the appointment itself.  My query looks like:

SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at;

The result I get is:

 | ?column?              |
 | 15:00:00 Dissertation |
 | 17:00:00 Dinner       |
 | 23:00:00 Sleep        |

Now that I have my rows, I can turn them into an array.  Now, the ARRAY function needs to be invoked via a SELECT.  Thus, using ARRAY means that we’re using a subselect.  The inner SELECT is what we did above.  The outer one is just our call to ARRAY:

SELECT ARRAY(SELECT meeting_at::time || ' ' || description 
FROM Appointments 
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at);

And sure enough, we get a one-row, one-column result:

 | array                                                        |
 | {"15:00:00 Dissertation","17:00:00 Dinner","23:00:00 Sleep"} |

You can see that our three rows have been turned into three elements of the array.  I do this all of the time when creating reports or summaries of other data: I write a view or CTE that grabs the data from the rows that are of interest to me, and stuffs that data into an array.  I can then check the array’s contents, length, or anything else that might be useful in my analysis. But earlier, I said that I want to have a comma-separated list of the appointment.  I currently have them in an array.  Fortunately, PostgreSQL’s array_to_string can save the day, acting similarly to the “join” method that I know and love from such languages as Ruby and Python.  This function, wrapped around a call to ARRAY, wrapped around our subselect, will then look like:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ');

And sure enough, the result is as follows:

 | array_to_string                                        |
 | 15:00:00 Dissertation, 17:00:00 Dinner, 23:00:00 Sleep |

Not bad, for one query; we’ve taken our three rows, and turned them into a string, without having to use anything other than SQL!  There’s just one final problem here that I’d like to fix, namely that when you invoke a function as we did here, the function’s name becomes the name of the column.  I’d like to change that, so the column has a more reasonable name.  This is particularly important if you’re planning to read the column using a high-level language; your code will look awfully funny if it includes statements like:

puts row['array_to_string']

when it could instead read

puts row['appointments_for_today']

Here’s how I can use “AS” to change the name of the column, making it easier to read and work with:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ') 
AS appointments_for_today;

It is in these sorts of uses, to aggregate data and create reports, that I’ve found PostgreSQL’s arrays to be particularly useful and powerful. Next time, we’ll see how we can do the opposite, turning an array back into rows — and then, we’ll see how the combination of these conversions lets us perform all sorts of tricks.


Looking in PostgreSQL arrays with ANY

So far, this series has looked at how to create PostgreSQL arrays, how to retrieve data from them, and how to get the length of an array’s outer and inner dimensions. But one of the most common actions that we’ll want to do with an array is look inside to see if a particular value is in there.  Today, I’ll show you how to do that with the PostgreSQL ANY operator and its cousins. For example, given my table “foo”:

[local]/reuven=# select * from foo;
| id | stuff               |
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
(3 rows)

We’ve already seen (in my last post on this topic) how we can find all of the rows in which the “stuff” array is of a certain length.  But sometimes — more often, perhaps — I want all of the rows in which a particular value is somewhere in the array.  (You can think of this as the SQL equivalent of Ruby’s “select” on an enumerable, or Python’s “filter” on a sequence.) For this, we need to use PostgreSQL’s ANY operator.  I have always found the syntax of ANY (and its synonym, SOME) to be a bit hard to understand.  The IN operator works by itself, but ANY only works with an = sign.  For example, if I want all of the rows from the above “foo” table in which the “stuff” array contains an element ‘abc’, I can say:

[local]/reuven=# select * from foo where 'abc' = ANY(stuff);
| id | stuff               |
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
(3 rows)

Sure enough, because each of these three arrays contains an element ‘abc’, we get all three rows back.  What if I ask for ‘def’?

[local]/reuven=# select * from foo where 'def' = ANY(stuff);
| id | stuff               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
(2 rows)

Again, not a surprise.  What if I search for ‘ghi’?

[local]/reuven=# select * from foo where 'ghi' = ANY(stuff);
| id | stuff |
(0 rows)

This is important to remember, actually: ANY looks at the elements of the array, and no further.  So the string ‘ghi’ was compared with the elements of “stuff”, which in the case of row ID 10, was the three strings ‘abc’, ‘def’, and ‘ghi jkl’.  PostgreSQL compared ‘ghi’ with each of these, found them to be unequal, and thus return 0 rows.  Remember this when you’re trying to search through your text[] array column; you can look for specific elements of the array, but no further. What I personally find confusing, and easy to forget, is the fact that using ANY requires that I use an = sign before the call to ANY.  Moreover, it’s important to remember that ANY and the array go on the right-hand side of the equal sign, for reasons that I don’t quite understand.  Thus, the following will not work:

[local]/reuven=# select * from foo where ANY(stuff) = 'abc';
ERROR: 42601: syntax error at or near "ANY"
LINE 1: select * from foo where ANY(stuff) = 'abc';

I’ve never had any occasion to use it, but you should know that PostgreSQL also offers an ALL operator.  It works the same way as ANY, except that it only returns a TRUE value if all of the elements in the array are equal to the value on the left-hand side.  For example:

[local]/reuven=# select * from foo where 'abc' = ALL(stuff);
| id | stuff |
| 8  | {abc} |
(1 row)

In this example, we only get one row back, because the stipulation is that all of the array elements need to be equal to ‘abc’. I’ve got a few more posts planned in this series about PostgreSQL arrays.  If there is something about this subject that you have always wanted to know, please contact me, and I’ll do what I can to answer.


PostgreSQL array indexes and length

In my last blog post, I introduced the idea of a PostgreSQL array, and showed how we can insert data into a table using either the curly-brace {} syntax, or the ARRAY construction syntax.  In this post, I want to talk about PostgreSQL indexes and length — what happens when we retrieve from indexes that exist (and don’t), how we can construct multidimensional arrays, and how we can ask PostgreSQL for the length of any dimension of our multidimensional array.

First of all, let’s talk about how you can retrieve data from arrays.  if I still have my three-row table from last time:

[local]/reuven=# select id, stuff from foo;
 │ id │ stuff               │
 │ 8  │ {abc}               │
 │ 9  │ {abc,def}           │
 │ 10 │ {abc,def,"ghi jkl"} │

What if I want to get just the first value back from the “stuff” column?  Well, then I have to take the first element of an array.  Most modern languages start to number arrays with index 0; PostgreSQL, by contrast, starts to count them with 1.  I can thus ask for just the first element of the “stuff” array from each row with:

[local]/reuven=# select id, stuff[1] from foo;
 │ id │ stuff │
 │ 8  │ abc   │
 │ 9  │ abc   │
 │ 10 │ abc   │

If we ask for an index that doesn’t exist, we get a NULL value back:

[local]/reuven=# select id, stuff[3] from foo;
 │ id │ stuff   │
 │ 8  │ [null]  │
 │ 9  │ [null]  │
 │ 10 │ ghi jkl │

Note that I have configured my PostgreSQL client to show NULL values as “[null]”, by putting the following line in my ~/.psqlrc file:

\pset null [null]

Without the above line in your .psqlrc (or running that command in psql yourself manually), you might see blank space for row IDs 8 and 9.

Now, it’s pretty rare for me to pull out a particular value from a PostgreSQL array.  Instead, I’m often finding out the lengths of the arrays on which I’m working.  I can do this with the array_length function:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
 │ id │ stuff               │ array_length │
 │ 8  │ {abc}               │ 1            │
 │ 9  │ {abc,def}           │ 2            │
 │ 10 │ {abc,def,"ghi jkl"} │ 3            │

Notice that array_length is a function that takes two parameters, an array and an integer.  The array is what we want to measure.  The integer describes which dimension should be measured in the array’s length. If you’re like me, and come from dynamic languages like Ruby and Python in which arrays (or lists) can be of any length, then you should realize here that PostgreSQL arrays can be multidimensional, but each inner array must be of the same length. So, for example, I can create a 2×3 array of integers with

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]];
│ array             │
│ {{1,1,1},{2,2,2}} │

Trying to have a different inner dimension will not work:

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2,2]];
ERROR: 2202E: multidimensional arrays must have array expressions with matching dimensions

Assuming that I have a legitimate array, I can get its length:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 1);
│ array_length │
│ 2            │

Or I can get the length of the inner dimension:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 2);
│ array_length │
│ 3 │

So, when retrieving our rows from the “foo” table:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
│ id │ stuff               │ array_length │
│ 8  │ {abc}               │       1      │
│ 9  │ {abc,def}           │       2      │
│ 10 │ {abc,def,"ghi jkl"} │       3      │

I can get the array length in a separate column, as in this example.  Or I can even sort in descending order of the array length:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo order by array_length(stuff, 1) desc;
│ id │ stuff               │ array_length │
│ 10 │ {abc,def,"ghi jkl"} │     3        │
│ 9  │ {abc,def}           │     2        │
│ 8  │ {abc}               │     1        │

Notice that our ORDER BY clause has to repeat the function that we used to create the third column.  Another way to do this is to declare an alias for the output of array_length, and then use the alias in ORDER BY:

select id, stuff, array_length(stuff, 1) len from foo order by len desc;
│ id │ stuff               │ len │
│ 10 │ {abc,def,"ghi jkl"} │ 3   │
│ 9  │ {abc,def}           │ 2   │
│ 8  │ {abc}               │ 1   │

Next time, we’ll look at how we can manipulate arrays using array functions.