Handling Parallel Jobs/Threads

121 views Asked by At

I'm trying to refactoring my project and now I'm trying to research for best ways to increase the application's performance.

Question 1. SpinLock vs Interlocked

To creating a counter, which way has better performance.

Interlocked.increament(ref counter)

Or

SpinLock _spinlock = new SpinLock()
bool lockTaken = false;
try
{
    _spinlock.Enter(ref lockTaken);
    counter = counter + 1;                
}
finally
 { 
     if (lockTaken) _spinlock.Exit(false);
 } 

And if we need to increment another counter, like counter2, should we declare another SpinLock object? or its enough to use another boolean object?

Question 2. Handling nested tasks or better replacement

In this current version of my application, I used tasks, adding each new task to an array and then used Task.WaitAll()

After a lot of research I just figured out that using Parallel.ForEach has better performance, But how can I control the number of current threads? I know I can specify a MaxDegreeOfParallelism in a ParallelOptions parameter, but the problem is here, every time crawl(url) method runs, It just create another limited number of threads, I mean if I set MaxDegree to 10, every time crawl(url) runs, another +10 will created, am I right?, so how can I prevent this? should I use semaphore and threads instead of Parallel? Or there is a better way?

public void Start() {
    Parallel.Invoke(() => { crawl(url) } );
}


crawl(string url) {
    var response =  getresponse(url);
    Parallel.foreach(response.links, ParallelOption, link => {
        crawl(link);
    });
}

Question 3. Notify when all Jobs (and nested jobs) finished.

And my last question is how can I understand when all my jobs has finished?

2

There are 2 answers

0
Enigmativity On

I'd suggest looking at Microsoft's Reactive Framework for this. You can write your Crawl function like this:

public IObservable<Response> Crawl(string url)
{
    return
        from r in Observable.Start(() => GetResponse(url))
        from l in r.Links.ToObservable()
        from r2 in Crawl(l).StartWith(r)
        select r2;
}

Then to call it try this:

IObservable<Response> crawls = Crawl("www.microsoft.com");

IDisposable subscription =
    crawls
        .Subscribe(
            r => { /* process each response as it arrives */ },
            () => { /* All crawls complete */ });

Done. It handles all the threading for you. Just NuGet "System.Reactive".

0
TheGeneral On

There a is a lot of misconceptions here, I'll point out just a few.

To creating a counter, which way has better performance.

They both do, depending on your exact situation

After a lot of research I just figured out that using Parallel.ForEach has better performance

This is also very suspect, and actually just wrong. Once again it depends on what you want to do.

I know I can specify a MaxDegreeOfParallelism in a ParallelOptions parameter, but the problem is here, every time crawl(url) method runs, It just create another limited number of threads

Once again this is wrong, this is your own implementation detail, and depends on how you do it. also TPL MaxDegreeOfParallelism is only a suggestion, it will only do what it thinks heuristically is best for you.

should I use semaphore and threads instead of Parallel? Or there is a better way?

The answer is a resounding yes.


OK, let's have a look at what you are doing. You say you are making a crawler. A crawler, accesses the internet, each time you access the internet or a network resource or the file system you are (said simplistically) waiting around for an IO completion port callbacks. This is what's knows as an IO workload.

With IO Bound tasks we don't want to tie up the thread pool with threads waiting for IO completion ports. It's inefficient, you are using up valuable resources waiting for callback on threads that are effectively paused.

So for IO bound work, we don't want to spin up new tasks, and we don't want to use Parallel ForEach to wait around using up threads waiting for events to happen. The most appropriate modern pattern for IO bound tasks is the async and await pattern.

For CPU bound work (if you want to use as much CPU as you can) smash the thread pool, use TPL Parallel or as many tasks that is effective.

The async and await pattern works well with completion ports, because instead of waiting around idly for a callback it will give the threads back and allow them to be reused.

...

However what I suggest is using another approach, where you can take advantage of async and await and also control degrees of parallelisation. This enables you to be good to your thread pool, not using up resources waiting for callbacks, and allowing IO to be IO. I give you TPL DataFlow ActionBlock and TransformManyBlocks


This subject is a little above a simple working example, but I can assure you its an appropriate path for what you are doing. What I suggest is you have a look at the following links.

In Summary, there are many ways to do what you want to do, and there are many technologies. But the main thing is you have some very skewed ideas about parallel programming. You need to hit the books, hit the blogs, and start getting some really solid design principles from the ground up, and stop trying to figure this all out for your self by nit picking small bits of information.