INNER JOIN of pagevies, contacts and companies - duplicated entries

96 views Asked by At

In short: 3 table inner join duplicates records

I have data in BigQuery in 3 tables:

Pageviews with columns:

  • timestamp
  • user_id
  • title
  • path

Contacts with columns:

  • website_user_id
  • email
  • company_id

Companies with columns:

  • id
  • name

I want to display all recorded pageviews and, if user and/or company is known, display this data next to pageview.

First, I join contact and pageviews data (SQL is generated by Metabase business intelligence tool):

SELECT 
  `analytics.pageviews`.`timestamp` AS `timestamp`, 
  `analytics.pageviews`.`title` AS `title`, 
  `analytics.pageviews`.`path` AS `path`,
  `Contacts`.`email` AS `email`
FROM `analytics.pageviews` 
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
ORDER BY `timestamp` DESC

It works as expected and I can see pageviews attributed to known contacts.

Next, I'd like to show pageviews of contacts with known company and which company is this:

SELECT 
  `analytics.pageviews`.`timestamp` AS `timestamp`,
  `analytics.pageviews`.`title` AS `title`, 
  `analytics.pageviews`.`path` AS `path`, 
  `Contacts`.`email` AS `email`, 
  `Companies`.`name` AS `name`
FROM `analytics.pageviews` 
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id` 
INNER JOIN `analytics.companies` `Companies` ON `Contacts`.`company_id` = `Companies`.`id`
ORDER BY `timestamp` DESC

With this query I would expect to see only pageviews where associated contact AND company are known (just another column for company name). The problem is, I get duplicate rows for every pageview (sometimes 5, sometimes 20 identical rows).

I want to avoid selecting DISTINCT timestamps because it can lead to excluding valid pageviews from different users but with identical timestamp.

How to approach this?

1

There are 1 answers

0
Gordon Linoff On BEST ANSWER

Your description sounds like you have duplciates in companies. This is easy to test for:

select c.id, count(*)
from `analytics.companies` c
group by c.id
having count(*) >= 2;

You can get the details using window functions:

select c.*
from (select c.*, count(*) over (partition by c.id) as cnt
      from `analytics.companies` c
     ) c
where cnt >= 2
order by cnt desc, id;