Remember: Don’t mix = and NULL

I was recently performing an analysis on the database for my dissertation software, and wanted to look at one particular set of rows.  The table I was looking at is called Nodes, and because each node (represented by a row in the table) can optionally belong to a group, the group_id column allows for NULL values.

So, if I want to find all of the nodes in group 2, I can say

SELECT * FROM Nodes WHERE group_id = 2;

My data consists (of course) of several tables, including Versions, which has a many-to-one relationship with Nodes.  If I want all of the versions for nodes in group 2, I can say:

SELECT COUNT(*) FROM Versions WHERE node_id in (SELECT id FROM Nodes WHERE group_id = 2);

Now, let’s say that I want to get all of the versions that are not from group 2.  How would I do that?  Well, one way would be:

SELECT COUNT(*) FROM Versions WHERE node_id 
   NOT IN (SELECT id FROM Nodes WHERE group_id = 2);

But at some point, I flipped the logic, turning the NOT IN into IN, and = into <>:

SELECT COUNT(*) FROM Versions WHERE node_id 
  IN (SELECT id FROM Nodes WHERE group_id <> 2);

That, of course, was a bug.  That’s because SQL uses trinary (or ternary) logic, not the binary true-false logic to which we’re accustomed.  The result of comparing anything with NULL results in a NULL.  For example, consider the following table:

select * from truth_test ;
| val    |
| t      |
| f      |
| [null] |

I can create a truth table pretty easily:

SELECT val, val = true as is_true, val=false as is_false, 
       val=null as equal_null, val is null as is_null 
FROM truth_test;
| val    | is_true | is_false | equal_null | is_null |
| t      | t       | f        | [null]     | f       |
| f      | f       | t        | [null]     | f       |
| [null] | [null]  | [null]   | [null]     | t       |

You can see, then, that TRUE is equal to TRUE, and FALSE is equal to FALSE, which we would expect.  But using the = operator with NULL is always going to result in NULL — even with itself!

If you want to compare with NULLs, then you need to use IS NULL.  (Or you can use a function such as COALESCE, which returns the first non-NULL value passed to it.)  And that’s the mistake that I made in my queries yesterday: I was trying to get all of the nodes for which group_id was not 2.  But by using =, I inadvertently excluded all of the rows for which group_id was NULL — in other words, all of the nodes that weren’t assigned to a group.  Which, it turns out, was the majority of them.  Whoops.

My solution was to create a view containing all of the IDs that I want to exclude:

CREATE OR REPLACE VIEW group2_model_ids
FROM nodes
WHERE nodes.group_id = 2;

Now, if I want the number of versions that aren’t in group 2, I can use a subselect, as follows:

FROM Versions 
WHERE node_id NOT IN (SELECT id FROM group2_model_ids);

Of course, I could have simply rewritten the query, either to use COALESCE or to check for both NULL and not-NULL values.  But I prefer this way, especially since I’m only dealing with a few tens of thousands of rows of data in this table, in part because it gives me a name and abstraction that I can use in future queries.