Selecting SQL Data with Non-Duplicate Column Values

I ran into an interesting SQL problem this week. I needed to select “unambiguous” data in a sense — only rows that did not contain duplicate values in one of the columns. Really the only way to explain is with an example. Consider this hypothetical data in a table named tblSectorData:

pid  sector
==========
1111  1
2222  2
2222  3
3333  4
5555  5
5555  6
6666  2
7777  2
8888  2
8888  4

The PIDs represent users and sector is a geo-location. Most users have just one location sector. But some users have two or more sectors. For example user 2222 is in sector 2 and also sector 3. Given a particular sector value, I want to select all the users that are in that sector, but only those users that are associated with just one sector. For example, if my target sector is 2, then a simple SELECT pid FROM tblSectorData WHERE sector=2 would give me users 2222, 6666, 7777, 8888. But I only want users 6666 and 7777 because 2222 and 8888 have multiple sectors.

This is a cousin to, but significantly different from, the well-known problem of identifying and removing duplicate rows.

As usual with SQL, there are about a bazillion different ways to attack this problem. Here is one possibility I finally got working:

select A.pid
from tblSectorData A
join tblSectorData B
on A.pid=B.pid
where B.sector=2
group by A.pid
having COUNT(A.pid) = 1

To be honest, the statement is really ugly to me as a procedural paradigm programmer, but sometimes with SQL you’re satisfied just to get anything that works. I also hate SQL because I know in my heart there must be some super-efficient, uber-cool trick I don’t know about. Curse you SQL!

The key is the self-join. Consider this statement:

select A.pid, A.sector, B.pid, B.sector
from tblSectorData A
join tblSectorData B
on A.pid=B.pid
where B.sector=2

The result is:

pid   sector  pid   sector
--------------------------
2222   2      2222   2
2222   3      2222   2
6666   2      6666   2
7777   2      7777   2
8888   2      8888   2
8888   4      8888   2

I could also have used A.sector=2 in the WHERE clause. Notice that if we group by pid, we can find instances where there are more than just a single associated sector by using COUNT(pid), or in other words, by selecting just rows where COUNT(A.pid) = 1 we get the desired result.

About these ads
This entry was posted in Software Test Automation. Bookmark the permalink.

2 Responses to Selecting SQL Data with Non-Duplicate Column Values

  1. mvaneerde says:

    This technique works for columns that admit for the < operator:

    SELECT
    pid
    FROM
    tblSectorData
    GROUP BY
    pid
    HAVING
    COUNT(*) = 1 AND MAX(sector) = @sector

    This version is slightly more general. Note the inner query does not depend on the outer, so this is not really a JOIN.a

    SELECT
    pid
    FROM
    tblSectorData
    WHERE
    sector = @sector AND
    pid IN (
    SELECT
    pid
    FROM
    tblSectorData
    GROUP BY
    pid
    HAVING
    COUNT(*) = 1
    )

Comments are closed.