I'm running a query to bifurcate splunk results into buckets. I want to divide and count files based on sizes they are taking on disk. This can be achieved using rangemap or eval case.
As I read here using eval is faster than rangemap. But I'm getting different results on using both.
This is the query I'm running -
<source>
| eval size_group = case(SizeInMB < 150, "0-150 MB", SizeInMB < 200 AND SizeInMB >= 150, "150-200 MB", SizeInMB < 300 AND SizeInMB >= 200, "200-300 MB", SizeInMB < 500 AND SizeInMB >= 300, "300-500 MB", SizeInMB < 1000 AND SizeInMB >= 500, "500-1000 MB", SizeInMB > 1000, ">1000 MB")
| stats count by size_group
and this is the result I'm getting -
Whereas using rangemap this is the query -
<source>
| rangemap field=SizeInMB "0-150MB"=0-150 "151-200MB"=150-200 "201-300MB"=200-300 "301-500MB"=300-500 "501-999MB"=500-1000 default="1000MB+"
| stats count by range
I tried this range too - rangemap field=SizeInMB "0-150MB"=0-150 "150-200MB"=150-200 "200-300MB"=200-300 "300-500MB"=300-500 "500-1000MB"=500-1000 default="1000MB+" and I get the same result -
There is not a huge difference in both the images results, and we can probably live with it - but I see for the range 150-200MB - it is 445958 vs 445961, and for 200-300 MB it is 3676 vs 3677 and for 300-500 MB it is 3346 vs 3348. I want to understand why is that difference, and which one should I trust more? Speedwise eval seems better, but datawise is it not so correct?


The problem you're seeing is your
rangemaphas overlapping values.Whereas with the
evalformat, you're trimming the ranges "properly" withcase.Sidebar - you can make that
casesimpler thusly:Since
caseexpressions stop evaluating as soon as a match is made, no need to useANDas you'd had it. And using0=0for your last possibility will always evaluate true (think ofdefaultincasestatements in C or C++).