Archive

Category Archives for "PostgreSQL"
2

Learning to love PostgreSQL arrays

I’ll admit it: When arrays were added to PostgreSQL a number of years ago, I thought that this was a really bad idea.  I’m a firm believer in normalization when it comes to database design and storage; and the idea of putting multiple values inside of a single column struck me as particularly foolish.  Besides, my impression was that PostgreSQL arrays were clumsy to work with, and didn’t really add much to my data model.

Of course, it turns out that arrays are extremely useful in PostgreSQL.  I still cringe when people want to use them for general-purpose storage, instead of working to normalize their database design.  But over the last few months, as I’ve been doing all sorts of complex PostgreSQL queries for my PhD dissertation, I’ve found that PostgreSQL arrays are extremely useful when it comes to aggregating and reporting data.

I’ve thus decided to dedicate a number of blog posts to PostgreSQL arrays: How to create them, use them, manipulate them, and decide when to use them.

Let’s start with the very basics; over the next few blog posts, I’ll try to show how arrays can be interesting — and even useful, and fit into more complex queries and needs.

You can create an array of just about any data type in PostgreSQL.  As the documentation says, “Arrays of any built-in or user-defined base type, enum type, or composite type can be created. Arrays of domains are not yet supported.” This means that you can create arrays of just about any data type you want: Integers, text, enums, other arrays (for multidimensional arrays), or even user-defined types.  To date, I have generally created arrays of integers and text, but that might not be representative of your use case.

To create a table with a text array in one column, just add square brackets ([]) after the type:

CREATE TABLE Foo (
id SERIAL NOT NULL,
stuff TEXT[],
PRIMARY KEY(id)
);

When I then ask for the definition of my “Foo” table, I see the following:

[local]/reuven=# \d foo
 Table "public.foo"
┌────────┬─────────┬──────────────────────────────────────────────────┐
│ Column │ Type │ Modifiers │
├────────┼─────────┼──────────────────────────────────────────────────┤
│ id     │ integer │ not null default nextval('foo_id_seq'::regclass) │
│ stuff  │ text[]  │                                                  │
└────────┴─────────┴──────────────────────────────────────────────────┘
Indexes:
 "foo_pkey" PRIMARY KEY, btree (id)

Notice that the type of the “stuff” column is indeed recorded as “text[]”, showing that it’s an array.  If we try to insert a plain-text value into that column, PostgreSQL will complain:

[local]/reuven=# insert into foo (stuff) values ('abc');
ERROR: 22P02: array value must start with "{" or dimension information
LINE 1: insert into foo (stuff) values ('abc');
                                        ^

One of the many things that I love about PostgreSQL is the attention to detail in the error messages.  Not only does it tell us that the table is expecting an array value, but that the array must begin with a { character.  It also shows us, using a ^ character, where the parser had problems.  That’s not always a perfect indicator of where the problem lies, but it’s a great start.

If I want to insert an array value into my table, I can thus use the literal array syntax that PostgreSQL provides, with (as indicated above) curly braces:

[local]/reuven=# insert into foo (stuff) values ('{abc}');
INSERT 0 1
[local]/reuven=# insert into foo (stuff) values ('{abc,def}');
INSERT 0 1
[local]/reuven=# insert into foo (stuff) values ('{abc,def,ghi jkl}');
INSERT 0 1

The above commands insert three rows into our table.  In all three cases, we are inserting array values into our table.  Notice that in all cases, the array is inserted as a string, surrounded by single quote marks.  Thus, ‘{abc}’ becomes a one-element array, and ‘{abc,def}’ becomes a two-element array.

What happens when there is a space character inside of the text?  PostgreSQL automatically quotes the value (with double quotes — be careful!).  What happens if you want a comma or single quote as part of the text?  Then things get even uglier.

A nice solution, and a better way (I believe) to insert arrays in any event, is to use the built-in ARRAY constructor syntax.  Then you don’t have to worry about such things.  For example, I can rewrite all of the above INSERT commands in what I believe to be a much nicer way:

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc']);
INSERT 0 1

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def']);
INSERT 0 1

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def', 'ghi jkl']);
INSERT 0 1

[local]/reuven=# select * from foo;
┌────┬─────────────────────┐
│ id │ stuff │
├────┼─────────────────────┤
│ 8  │ {abc}               │
│ 9  │ {abc,def}           │
│ 10 │ {abc,def,"ghi jkl"} │
└────┴─────────────────────┘
(3 rows)

The same data was inserted into the table, but with less hassle than before.

Now, just because we can insert arrays directly into our tables doesn’t necessarily mean that we should do so.  You’ll see, over the course of this series, that I view arrays as a great way to aggregate and analyze existing data, particularly within the context of a view or a CTE.  So please don’t be tempted to start stuffing all of the data you want and need into a single column; normalization is still a good idea, and arrays can be tempting.  However, being familiar with the basics of defining and inserting array data into the database is quite useful, and will serve us well throughout the rest of this series.

 

12

Don’t use double quotes in PostgreSQL

I spend a large proportion of my time teaching classes in a variety of open-source technologies — specifically, Ruby, Python, PostgreSQL, and Git.  One of the questions that invariably arises in these classes has to do with the case sensitivity of the technology in question. That is, is the variable “x” the same as the variable “X”?

In nearly ever case, the technologies with which I work are case sensitive, meaning that “x” and “X” are considered two completely different identifiers.  Indeed, the Ruby language goes so far as to give capitalized identifiers a special status, calling them “constants.”  (They’re not really constants, in that you can always redefine a Ruby constant.  However, you will get a warning when you reassign it.  For this reason, I prefer to call them “stubborns,” so that people don’t get the wrong idea.)

SQL is a completely different story, however: The SQL standard states that SQL queries and identifiers (e.g., table names) aren’t case sensitive.  Thus, there’s no difference between

select id, email from people;

and

SELECT ID, EMAIL FROM PEOPLE;

I find both of these styles to be somewhat unreadable, and over the years have generally followed Joe Celko‘s advice for capitalization in SQL queries:

  1. SQL keywords are in ALL CAPS,
  2. Table names have Initial Caps, and
  3. Column names are all in lowercase.

Given that rule, the above query would look like this:

SELECT id, email FROM People;

Again, this capitalization scheme is completely ignored by PostgreSQL.  It’s all for our benefit, as developers, who want to be able to read our code down the road.

Actually, that’s not entirely true: PostgreSQL doesn’t exactly ignore the case, but rather forces all of these names to be lowercase. So if you say

CREATE TABLE People (
  id SERIAL NOT NULL,
  email TEXT NOT NULL,
  PRIMARY KEY(id)
);

PostgreSQL will create a table named “people”, all in lowercase. But because of the way PostgreSQL works, forcing all names to lowercase, I can still say:

SELECT * FROM People;

And it will work just fine.

Now, there is a way around this, namely by using double quotes.  Whereas single quotes in PostgreSQL are used to create a text string, double quotes are used to name an identifier without changing its case.

Let me say that again, because so many people get this wrong: Single quotes and double quotes in PostgreSQL have completely different jobs, and return completely different data types.  Single quotes return text strings.  Double quotes return (if you can really think of them as “returning” anything) identifiers, but with the case preserved.

Thus, if I were to repeat the above table-creation query, but use double quotes:

CREATE TABLE "People" (
 id SERIAL NOT NULL,
 email TEXT NOT NULL,
 PRIMARY KEY(id)
);

I have now created a table in which the table name has not been forced to lowercase, but which has preserved the capital P.  This means that the following query will now fail:

select * from people;

ERROR: relation "people" does not exist
LINE 1: select * from people;
                      ^

It fails because I have created a table “People”, but I have told PostgreSQL to look for a table “people”.  Confusing?  Absolutely.  If you use double quotes on the name of a table, column, index, or other object when you create it, and if there is even one capital letter in that identifier, you will need to use double quotes every single time you use it.  That’s frustrating for everyone involved — it means that we can’t use the nice capitalization rules that I mentioned earlier, and that various queries will suddenly fail to work.

The bottom line, then, is to avoid using double quotes when creating anything.  Actually, you should avoid double quotes when retrieving things as well — otherwise, you might discover that you’re trying to retrieve a column that PostgreSQL doesn’t believe exists.

Now, let’s say that you like this advice, and you try to take it to heart.  Unfortunately, there are places where you still might get bitten, despite your best efforts.

For example, the GUI tool for PostgreSQL administration, PGAdmin 3, is used by many people.  (I’m an old-school Unix guy, and thus prefer the textual “psql” client.)  I’ve discovered over the years that while PGAdmin might be a useful and friendly way to manage your databases, it also automatically uses double quotes when creating tables.  This means that if you create a table with PGAdmin, you might find yourself struggling to find or query it afterwards.

Another source of frustration is the Active Record ORM (object-relational mapper), most commonly used in Ruby on Rails. Perhaps because Active Record was developed by users of MySQL, whose table and column names are case-sensitive by default, Active Record automatically puts double quotes around all table and column names in queries. This can lead to frustrating incompatibilities — such as if you want to access the column in Ruby using CamelCase, but in a case-insensitive way in the database.

PostgreSQL is a fabulous database, and has all sorts of great capabilities.  Unless you really want your identifiers to be case-sensitive, though, I strongly suggest that you avoid using double quotes.  And if you encounter problems working with columns, check the database logs to see whether the queries are being sent using double quotes.  You might be surprised, and manage to save yourself quite a bit of debugging time.

>