SQL: Finding Duplicate Rows

<span class="k">A great way to detect duplicates in PostgreSQL is by using window functions.

SELECT</span> <span class="n">id</span> <span class="k">from</span> <span class="p">(</span><span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">ROW_NUMBER</span><span class="p">()</span> <span class="n">OVER</span><span class="p">(</span>

<span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">first_name</span><span class="p">,</span> <span class="n">last_name</span><span class="p">,</span> <span class="n">email</span>

<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">id</span>

<span class="p">)</span>

<span class="k">AS</span> <span class="n">user_row_number</span> <span class="k">FROM</span> <span class="n">users</span> <span class="p">)</span> <span class="n">duplicates</span> <span class="k">WHERE</span> <span class="n">duplicates</span><span class="p">.</span><span class="n">user_row_number</span> <span class="o">></span> <span class="mi">1</span>

The Best Postgres Feature You’re Not Using – CTEs Aka WITH Clauses

The feature itself is known as CTEs or common table expressions, you may also here it referred to asWITH clauses. The general idea is that it allows you to create something somewhat equivilant to a view that only exists during that transaction. You can create multiple of these which then allow for clear building blocks and make it simple to follow what you’re doing.

..

--- Created by Craig Kerstiens 11/18/2013
--- Query highlights users that have over 50% of tasks on a given project
--- Gives comma separated list of their tasks and the project


--- Initial query to grab project title and tasks per user
WITH users_tasks AS (
  SELECT 
         users.id as user_id,
         users.email,
         array_agg(tasks.name) as task_list,
         projects.title
  FROM
       users,
       tasks,
       project
  WHERE
        users.id = tasks.user_id
        projects.title = tasks.project_id
  GROUP BY
           users.email,
           projects.title
),

--- Calculates the total tasks per each project
total_tasks_per_project AS (
  SELECT 
         project_id,
         count(*) as task_count
  FROM tasks
  GROUP BY project_id
),

--- Calculates the projects per each user
tasks_per_project_per_user AS (
  SELECT 
         user_id,
         project_id,
         count(*) as task_count
  FROM tasks
  GROUP BY user_id, project_id
),

--- Gets user ids that have over 50% of tasks assigned
overloaded_users AS (
  SELECT tasks_per_project_per_user.user_id,

  FROM tasks_per_project_per_user,
       total_tasks_per_project
  WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2)
)

SELECT 
       email,
       task_list,
       title
FROM 
     users_tasks,
     overloaded_users
WHERE
      users_tasks.user_id = overloaded_users.user_id