INNER JOIN of pagevies, contacts and companies - duplicated entries

Question

INNER JOIN of pagevies, contacts and companies - duplicated entries

91 views Asked by Patryk At 01 October 2020 at 11:22

In short: 3 table inner join duplicates records

I have data in BigQuery in 3 tables:

Pageviews with columns:

timestamp
user_id
title
path

Contacts with columns:

website_user_id
email
company_id

Companies with columns:

id
name

I want to display all recorded pageviews and, if user and/or company is known, display this data next to pageview.

First, I join contact and pageviews data (SQL is generated by Metabase business intelligence tool):

SELECT 
  `analytics.pageviews`.`timestamp` AS `timestamp`, 
  `analytics.pageviews`.`title` AS `title`, 
  `analytics.pageviews`.`path` AS `path`,
  `Contacts`.`email` AS `email`
FROM `analytics.pageviews` 
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
ORDER BY `timestamp` DESC

It works as expected and I can see pageviews attributed to known contacts.

Next, I'd like to show pageviews of contacts with known company and which company is this:

SELECT 
  `analytics.pageviews`.`timestamp` AS `timestamp`,
  `analytics.pageviews`.`title` AS `title`, 
  `analytics.pageviews`.`path` AS `path`, 
  `Contacts`.`email` AS `email`, 
  `Companies`.`name` AS `name`
FROM `analytics.pageviews` 
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id` 
INNER JOIN `analytics.companies` `Companies` ON `Contacts`.`company_id` = `Companies`.`id`
ORDER BY `timestamp` DESC

With this query I would expect to see only pageviews where associated contact AND company are known (just another column for company name). The problem is, I get duplicate rows for every pageview (sometimes 5, sometimes 20 identical rows).

I want to avoid selecting DISTINCT timestamps because it can lead to excluding valid pageviews from different users but with identical timestamp.

How to approach this?

Original Q&A

There are 1 answers

**Gordon Linoff** · Accepted Answer · 2020-10-01T13:06:23+00:00

Your description sounds like you have duplciates in companies. This is easy to test for:

select c.id, count(*)
from `analytics.companies` c
group by c.id
having count(*) >= 2;

You can get the details using window functions:

select c.*
from (select c.*, count(*) over (partition by c.id) as cnt
      from `analytics.companies` c
     ) c
where cnt >= 2
order by cnt desc, id;

TechQA.

INNER JOIN of pagevies, contacts and companies - duplicated entries

There are 1 answers

Related Questions in SQL

Related Questions in GOOGLE-BIGQUERY

Related Questions in METABASE

Popular Questions

Popular Tags

Trending Questions