I'm facing an issue with creating materialized views in ClickHouse where each row is duplicated in the materialized view, despite no changes to my setup. My data warehouse is populated through an ELT process from a PostgreSQL database using Airbyte. Interestingly, the issue persists even when creating the materialized view directly through the ClickHouse console.
The problem surfaced about 3-4 weeks ago. Previously, the materialized view was created as expected without any duplicates. Running the query outside of a materialized view context returns the correct result set (approximately 8 million rows). However, when I create a materialized view with the same query, the result set unexpectedly doubles in size to approximately 16 million rows, effectively duplicating each row.
Here's the query used for the materialized view:
CREATE MATERIALIZED VIEW transactions
ENGINE = MergeTree
ORDER BY (ContainerLocationsId) POPULATE AS
SELECT
stores.Id AS StoreId,
stores.Type as StoreSubtype,
stores.Name AS StoreName,
stores.City AS City,
containerLocations.Id AS ContainerLocationsId,
containerLocations.OwnerSince AS OwnerSince,
containerLocations.OwnerTill AS OwnerTill,
containerLocations.OwnerId AS OwnerId,
containerLocations.NextOwnerId AS NextOwnerId,
containerLocations.ContainerId AS ContainerId,
containerLocations.TransactionType AS TransactionType,
containerTypes.Name AS ContainerTypeName
FROM containerLocations
LEFT JOIN userLinks ON containerLocations.OwnerId = userLinks.UserId
LEFT JOIN containers ON containers.Id = containerLocations.ContainerId
LEFT JOIN stores ON userLinks.StoreId = stores.Id
LEFT JOIN storeCategories ON storeCategories.Id = stores.CategoryId
LEFT JOIN containerTypes ON containerTypes.Id = containers.TypeId
WHERE containerLocations.IsBillableTransaction = 1
The ContainerLocationsId is designed to be a unique, incrementing ID. I haven't made any modifications to the query or the underlying data structure that could account for this sudden change in behavior.
The data successfully loads into the warehouse without any duplication issues when running the query normally. Despite attempts to resolve the issue by adding more columns to the sorting key, the duplication persists. The integrity of the original tables in the warehouse confirms that the ELT process is operating correctly. Additionally, my search through the ClickHouse documentation has not yielded a solution. As I am relatively new to ClickHouse, I'm uncertain about the cause of this issue.