I want to be able to use Map-reduce to process model entities on queries ordered by a datetime property or perhaps any non-key property.
It looks like the crucial factor for map reduce is to be able to evenly split the range, and down to a minimum range "space" (i.e., not based on # of entities, but based on possible # of entities for the range). The built-in range is a key range, which GAE has designed to be evenly distributed and also limited to 1 per key.
It also looks like creating a range iterator on any other property has two possible problems: (1) even distribution; and (2) # of entities at any given value. For issue (2) as an example, there may be multiple entities at one datetime value. This seems to create a problem of determining batch size for splitting the range.
My question is: Is there a practical solution to creating a map reduce model iterator with range iterator not based on model keys and possibly neither evenly distributed nor predictable entity counts for any range?
Mapreduce will try to split the input as best it can. In the case of an inequality query IE: Between timestamp X and Y. It will split the property range (such as timestamp) evenly. So if there is a poor distribution of values by timestamp this will result in some shards getting more entities than others. This is somewhat mitigated by the fact that it oversamples. (IE: Each shard receives multiple non-adjacent ranges) For equality queries (IE: Where Foo=Bar and Baz=Bat) It does much better. It uses the "__scatter__" property which is a value that is randomly applied to 1 in every 512 entities. It does a query on these to get a sample of the distribution of entities through keyspace and then partitions keyspace accordingly. This obviously does not provide exact partitioning, but because it follows the actual distribution of the data as opposed to simply assuming a uniform distribution, it does fairly well in practice.