ScyllaDB webscraper table design issues

18 views Asked by At

As my first project with ScyllaDB Ive decided on a webscraper. The scraper has to not re-download any pages that have been measured in the past 3 minutes(spam protection) also it should redownload pages that have been measured over 2 months ago.

My current table design is this(optimized for final user selection):

CREATE TABLE pages
(
    second_level_domain text,
    domain_measured_at timestamp static,
    page_measured_at timestamp,
    top_level_domain text,
    subdomain text,
    <and some other columns>
    primary key((second_level_domain), top_level_domain, subdomain)
)

The idea behind this was that once downloader downloads a page with for example a second level domain of "google", then it would update the single downloaded page in scylla and by doing so update the static column for the entire partition key thus causing the domain to not be spammed anymore.

My problem is that i now need to tell the downloader what to download. Thus i need a query that meets these needs:

SELECT * FROM pages WHERE domain_measured_at < :threeMinutesAgoTime AND page_measured_at > :olderThanTwoMonthsAgo

What is to proper way to accomplish this without using ALLOW FILTERING and preferably without much duplication of all the other columns?

I tried removing the static from domain_measured_at and instead creating a materalized view, that would contain both domain domain_measured_at and page_measured_at as the primary keys - you cannot create materalized views with two primary keys that were not in the original table

0

There are 0 answers