So, I've been starting to write some basic compute queries for keen.io over a dataset of about 170,000 events total each with maybe 10 properties at most.
With just testing this out, we've already wracked up about 600 million properties queried even though most of those queries have been for my own local data with just a handful of events stored, so I'm guessing that the compute queries are running over all 170k events, not the 50-100 filtered results set.
Is there a way to run compute queries on a filtered dataset? Should I be (for example) assigning each user their own event_collection or something?
Note: I'm a platform engineer and architect at Keen IO.
The short answer is: no, there is no direct way to filter down the events for which you'll be charged other than reducing the number of events in the collection and timeframe being queried.
Keen bills based on properties scanned. You can follow the link for details, but the short version is:
Properties Scanned = [# of events matching collection and timeframe] * [# of properties needed to evaluate your query]`.In some cases splitting events into different collections may in fact be a reasonable solution. But be careful, because doing so effectively eliminates the ability to evaluate queries that span multiple/all collections. One option, if you're willing to double-write every event, is to write into both one global collection and one collection-per-ID (or whatever you want to "index" on). This denormalization may create some other complications, as denormalization always does, but it does have the potential to dramatically reduce your compute usage.
(For what it's worth, this is a point of friction that we've noticed many customers starting to run into lately. We are actively looking into options for adding some kind of native support for "secondary indexing" to address this directly in the product. But unfortunately I can't commit to any specific functionality/timeline right now.)
One other thing to point out is that in some cases you may be able to get a good solution using Keen's Cached Datasets feature, which effectively pre-computes a query result for every observed value of some indexed property. This is a somewhat more advanced feature, but in the right scenarios it can be a dramatic improvement over other options in terms of cost, speed, and simplicity.
Lastly: if you need more detailed help figuring out the best technical approach to your problem, [email protected] is a great resource. Hope that helps!