How to read a big excel file using multithreaded programming in Coldfusion/Lucee?

333 views Asked by At

I am trying to read a big excel file with using 10 threads as shown in the code bellow:

<cfset var _Qry = 0 />

<cfloop from="1" to="10" index="idx">
    <cfthread name="Thread#idx#" action="run" src="#arguments._file#">                                      
           <cfspreadsheet 
                 action="read" 
                 src="#src#" 
                 sheet="1" 
                 query="_Qry" 
                 headerrow="1" 
                 excludeHeaderRow="true">
    </cfthread>
</cfloop>


<cfloop from="1" to="10" index="idx">
    <cfthread name="Thread#idx#" action="join" />               
</cfloop>

<cfdump var="#_Qry#">

But when I'm dumping the _Qry variable it is not resulting the expected query read from the file. It dumps 0. I'm suspecting it's a Thread scope or thread result merging issue which I have no idea how to get around with.

Thanks for your kind help :)

1

There are 1 answers

0
isapir On

Multi-Threading is Not the Solution to Every Problem

There are multiple issues here

1) Unless you have a distributed file system like HDFS, for example, this is not a good approach to the problem. Only one thread can read the file from a standard file system at a time, and the bottleneck is the file system and not the CPU.

For Big Data there are solutions like Hadoop, which contains HDFS, and Apache Spark. Distributed file systems like HDFS were designed specifically to address this issue, so that you can read huge files in parallel, where you read a different portion of the file from a separate node with a separate disk.

2) Your code snippet "attempts" to read the whole file in each thread.

That means that if you had a distributed file system you would not gain anything because you are reading the whole file from each node. On a regular file system this is much worse -- you will likely read the file up to 10 times because of the bottleneck mentioned above.

It will probably not be the full 10 times thanks to optimizations and caching that most file systems employ, but it will always be slower than reading the file normally.

3) When you call <cfthread query="_qry" ...> the _qry variable is set to the local scope of the thread and is out of scope when the thread completes its execution.

If you want to be able to access it after the thread completes, you need to set it to the THREAD Scope. Consider the following example:

<cfscript>
thread name="T1" {
  x = getTickCount();         // local scope in thread T1
  thread.y = round(x / 1000); // THREAD scope in thread T1
}

thread action="join";

dump(cfthread.T1.y);          // will show the value of y from thread T1

//     dump(cfthread.T1.x);   // error, does not exist
//     dump(x);               // error, does not exist, this is what you have
</cfscript>

See it on trycf.com

So what can you do

If you do have multiple CPU cores, and if you determine that parsing the Excel file is a slow process, then you might be able to read the file in a single thread, but then split it into multiple parts and parse each of them in multiple threads.

That may or may not be more efficient, since multi-threading adds overhead, and you will have to split the data and then put it back together in the end, so YMMV - you have to do some testing for that.

See also

I recommend reading the related blog posts that I published on Rasia.io :

Easy Parallelism in Lucee

Safe Concurrency with Lucee

Using Java to Read Part of a Large File Efficiently