Remember: Don’t mix = and NULL

May 28, 2014 . By Reuven

I was recently performing an analysis on the database for my dissertation software, and wanted to look at one particular set of rows.  The table I was looking at is called Nodes, and because each node (represented by a row in the table) can optionally belong to a group, the group_id column allows for NULL values.

So, if I want to find all of the nodes in group 2, I can say

SELECT * FROM Nodes WHERE group_id = 2;

My data consists (of course) of several tables, including Versions, which has a many-to-one relationship with Nodes.  If I want all of the versions for nodes in group 2, I can say:

SELECT COUNT(*) FROM Versions WHERE node_id in (SELECT id FROM Nodes WHERE group_id = 2);

Now, let’s say that I want to get all of the versions that are not from group 2.  How would I do that?  Well, one way would be:

SELECT COUNT(*) FROM Versions WHERE node_id 
   NOT IN (SELECT id FROM Nodes WHERE group_id = 2);

But at some point, I flipped the logic, turning the NOT IN into IN, and = into <>:

SELECT COUNT(*) FROM Versions WHERE node_id 
  IN (SELECT id FROM Nodes WHERE group_id <> 2);

That, of course, was a bug.  That’s because SQL uses trinary (or ternary) logic, not the binary true-false logic to which we’re accustomed.  The result of comparing anything with NULL results in a NULL.  For example, consider the following table:

select * from truth_test ;
+--------+
| val    |
+--------+
| t      |
| f      |
| [null] |
+--------+

I can create a truth table pretty easily:

SELECT val, val = true as is_true, val=false as is_false, 
       val=null as equal_null, val is null as is_null 
FROM truth_test;
+--------+---------+----------+------------+---------+
| val    | is_true | is_false | equal_null | is_null |
+--------+---------+----------+------------+---------+
| t      | t       | f        | [null]     | f       |
| f      | f       | t        | [null]     | f       |
| [null] | [null]  | [null]   | [null]     | t       |
+--------+---------+----------+------------+---------+

You can see, then, that TRUE is equal to TRUE, and FALSE is equal to FALSE, which we would expect.  But using the = operator with NULL is always going to result in NULL — even with itself!

If you want to compare with NULLs, then you need to use IS NULL.  (Or you can use a function such as COALESCE, which returns the first non-NULL value passed to it.)  And that’s the mistake that I made in my queries yesterday: I was trying to get all of the nodes for which group_id was not 2.  But by using =, I inadvertently excluded all of the rows for which group_id was NULL — in other words, all of the nodes that weren’t assigned to a group.  Which, it turns out, was the majority of them.  Whoops.

My solution was to create a view containing all of the IDs that I want to exclude:

CREATE OR REPLACE VIEW group2_model_ids
AS
SELECT nodes.id
FROM nodes
WHERE nodes.group_id = 2;

Now, if I want the number of versions that aren’t in group 2, I can use a subselect, as follows:

SELECT COUNT(*) 
FROM Versions 
WHERE node_id NOT IN (SELECT id FROM group2_model_ids);

Of course, I could have simply rewritten the query, either to use COALESCE or to check for both NULL and not-NULL values.  But I prefer this way, especially since I’m only dealing with a few tens of thousands of rows of data in this table, in part because it gives me a name and abstraction that I can use in future queries.

Related Posts

Reminder: My “Intro to SQL” course is this Sunday!

Reminder: My “Intro to SQL” course is this Sunday!

Join the data revolution with my “Intro to SQL” course!

Join the data revolution with my “Intro to SQL” course!

Yes, you can master regular expressions!

Yes, you can master regular expressions!
{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
>