Nested loops for large datasets

372 views Asked by At

As part of the project I'm working on I have to build various reports. To build these reports I gather data from very large data sets. As such I'm trying to optomize my loop structures to minimize checks inside the inner loops.

One of the globals (key-value type of data set) I'm often using has roughly the following structure:

dataStatus->inputDate->state->region->street->data

I have to loop through each of the preceding sections to get to the data that I'm after. The problem is that while I always have the state (it is a required field in the input screen), the region and street may be omitted. If the street is left blank I must loop through all the streets in the region. Likewise, if the region is left blank I must loop through all the regions within the state and again through all the streets in each of the regions.

The easy solution would be something like this:

loop through dataStatuses {
    loop through dates {
        if street is set {
            get data
        } else if region is set {
            loop through streets {
                get data
            }
        } else {
            loop through regions {
                loop through streets {
                    get data
                }
            }
        }
    }
}

However, I would really like a way to skip checks inside. The best solution I could come up with so far is something like this:

if street is set {
    loop through dataStatuses {
        loop through dates {
            get data
        }
    }
} else if region is set {
    loop through dataStatuses {
        loop through dates {
            loop through streets {
                get data
            }
        }
    }
} else {
    loop through dataStatuses {
        loop through dates {
            loop through regions {
                loop through streets {
                    get data
                }
            }
        }
    }
}

Is there a more elegant solution than this, maybe something that would cater for n levels before reaching the data?

Any help will be much appreciated.

3

There are 3 answers

0
NotMyName On BEST ANSWER

I have come up with a solution that works well for me. This might not be the perfect I was initially looking for, but it works for me for these reasons:

  1. When I'm done with this work other programmers will periodically work on the reports and I want to make their life as easy as possible in this regard.
  2. A report-generator will be nice, but not the most productive thing to do at the moment.

I believe my solution is fairly easy to read and maintain without sacrificing too much performance:

GatherData
    set command="do LoopThroughRegions(status,date,state,.dataCollection)"
    if ($length(street)>0){
        set command="do LoopThroughData(status,date,state,region,street,.dataCollection)"
    } elseif ($length(region)>0){
        set command="do LoopThroughStreets(status,date,state,region,.dataCollection)"
    }

    set date=fromDate-1,status=""
    for{
        set status=$order(^GLOBAL(status)) quit:status=""
        for{
            set date=$order(^GLOBAL(status,date)) quit:((date>toDate)||(date=""))
            xecute (command)
        }
    }
 quit

LoopThroughRegions(status,date,state,dataCollection)
    set currentRegion="" 
    for{
        set currentRegion=$order(^GLOBAL(status,date,region,currentRegion)) quit:currentRegion=""
        do LoopThroughStreets(status,date,state,currentRegion,.dataCollection)
    }
 quit

LoopThroughStreets(status,date,state,region,dataCollection)
    set currentStreet=""
    for{
        set currentStreet=$order(^GLOBAL(status,date,state,region,currentStreet)) quit:currentStreet=""
        do LoopThroughData(status,date,state,region,currentStreet,.dataCollection)
    }
 quit

LoopThroughData(status,date,state,region,street,dataCollection)
    set dataItem="" 
    for{
        set dataItem=$order(^GLOBAL(status,date,state,region,street,dataItem)) quit:dataItem=""
        // Do stuff
    }
 quit

Unless a better solution is provided I will select my own answer for future reference. Hopefully it might even help someone else as well.

1
psr On

Hmm. Well, first of all,this code is probably so heavily I/O bound that optimizing anything but disk access is probably inconsequential. You might try throwing a pointless loop to 1000 before every if-test just to see how much difference that makes.

It doesn't appear like indexing would help you here, since you must access all the data when a value isn't specified, but if you are computing aggregates then caching those might help. Intersystem's DeepSee product is designed for that.

But it sounds like maybe you are just running a big report printing all the data, and want to optimize what you can. In that case, I think your solution does that. As far as elegance, well, it usually isn't found in conjunction with heavy optimization. The problem with generalizing a solution is that it's usually slower than a tailored solution.

I think possibly using tags and goto statements, while much harder to read, might run a little faster than your solution - if you think it's worth it. The fact that Intersystems does tags and gotos in their compiled SQL routines makes me think it's likely the fastest option. I haven't measured the difference, but I suppose Intersystems probably has. If you really need speed then you should try different approaches and measure the speed. Be sure to test with warm and cold routine and global caches.

As far as a general solution, you could, if you really wanted to, write something that generates code similar to what you hand-coded, but can generate it for a general n level solution. You tell it about the levels and it generates the solution. That would be both general and fast - but I very much doubt that for this example it would be a good use of your time to write that general solution, since hand-coding is pretty trivial and most likely not that common.

1
SSH On

You are going in right direction with your proposed answer, but I would change it slightly to avoid using xecute (xecute and performance rarely play well) and make overall architecture a bit better.

Instead of deciding what to do in main routine and passing in only needed parameters, pass in all parameters and decide what you need to do in subroutine itself.

So your code will become:

GatherData(...)
    set date=fromDate-1,status=""
    for {
        set status=$order(^GLOBAL(status)) quit:status=""
        for {
            set date=$order(^GLOBAL(status,date)) quit:((date>toDate)||(date=""))
            do LoopThroughRegions(status,date,state,region,street,.dataCollection)
        }
    }
 quit

LoopThroughRegions(status,date,state,region,street,dataCollection)
    if region'="" {
        do LoopThroughStreets(status,date,state,region,street,.dataCollection) 
        quit
    }
    set currentRegion="" 
    for {
        set currentRegion=$order(^GLOBAL(status,date,region,currentRegion)) quit:currentRegion=""
        do LoopThroughStreets(status,date,state,currentRegion,street,.dataCollection)
    }
 quit

LoopThroughStreets(status,date,state,region,street,dataCollection)
    if street'="" {
        do LoopThroughData(status,date,state,region,street,dataCollection) 
        quit
    }
    set currentStreet=""
    for{
        set currentStreet=$order(^GLOBAL(status,date,state,region,currentStreet)) quit:currentStreet=""
        do LoopThroughData(status,date,state,region,currentStreet,.dataCollection)
    }
 quit

LoopThroughData(status,date,state,region,street,dataCollection)
    set dataItem="" 
    for{
        set dataItem=$order(^GLOBAL(status,date,state,region,street,dataItem)) quit:dataItem=""
        // Do stuff
    }
 quit

I would also recommend to change these small subroutines to private procedures.