Remember: Don’t mix = and NULL
I was recently performing an analysis on the database for my dissertation software, and wanted to look at one particular set of rows. The table I was looking at is called Nodes, and because each node (represented by a row in the table) can optionally belong to a group, the group_id column allows for NULL values.
So, if I want to find all of the nodes in group 2, I can say
SELECT * FROM Nodes WHERE group_id = 2;
My data consists (of course) of several tables, including Versions, which has a many-to-one relationship with Nodes. If I want all of the versions for nodes in group 2, I can say:
SELECT COUNT(*) FROM Versions WHERE node_id in (SELECT id FROM Nodes WHERE group_id = 2);
Now, let’s say that I want to get all of the versions that are not from group 2. How would I do that? Well, one way would be:
SELECT COUNT(*) FROM Versions WHERE node_id NOT IN (SELECT id FROM Nodes WHERE group_id = 2);
But at some point, I flipped the logic, turning the NOT IN into IN, and = into <>:
SELECT COUNT(*) FROM Versions WHERE node_id IN (SELECT id FROM Nodes WHERE group_id <> 2);
That, of course, was a bug. That’s because SQL uses trinary (or ternary) logic, not the binary true-false logic to which we’re accustomed. The result of comparing anything with NULL results in a NULL. For example, consider the following table:
select * from truth_test ; +--------+ | val | +--------+ | t | | f | | [null] | +--------+
I can create a truth table pretty easily:
SELECT val, val = true as is_true, val=false as is_false, val=null as equal_null, val is null as is_null FROM truth_test; +--------+---------+----------+------------+---------+ | val | is_true | is_false | equal_null | is_null | +--------+---------+----------+------------+---------+ | t | t | f | [null] | f | | f | f | t | [null] | f | | [null] | [null] | [null] | [null] | t | +--------+---------+----------+------------+---------+
You can see, then, that TRUE is equal to TRUE, and FALSE is equal to FALSE, which we would expect. But using the = operator with NULL is always going to result in NULL — even with itself!
If you want to compare with NULLs, then you need to use IS NULL. (Or you can use a function such as COALESCE, which returns the first non-NULL value passed to it.) And that’s the mistake that I made in my queries yesterday: I was trying to get all of the nodes for which group_id was not 2. But by using =, I inadvertently excluded all of the rows for which group_id was NULL — in other words, all of the nodes that weren’t assigned to a group. Which, it turns out, was the majority of them. Whoops.
My solution was to create a view containing all of the IDs that I want to exclude:
CREATE OR REPLACE VIEW group2_model_ids AS SELECT nodes.id FROM nodes WHERE nodes.group_id = 2;
Now, if I want the number of versions that aren’t in group 2, I can use a subselect, as follows:
SELECT COUNT(*) FROM Versions WHERE node_id NOT IN (SELECT id FROM group2_model_ids);
Of course, I could have simply rewritten the query, either to use COALESCE or to check for both NULL and not-NULL values. But I prefer this way, especially since I’m only dealing with a few tens of thousands of rows of data in this table, in part because it gives me a name and abstraction that I can use in future queries.