Is the VBScript RegEx Flavor Lookaround Method known to have problems with textfiles exceeding 5MB?

42 views Asked by At

I would like to know why the following RegEx's:

\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.)

and:

\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.|HOW WHO HOW WHO HOW\,\sWHO\sHOW\.)

seem to work perfectly fine on the following test string:

THIS THAT THIS THAT THIS,
THAT
THIS.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

WHAT WHERE WHAT WHERE WHAT,
WHERE
WHAT.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

HOW WHO HOW WHO HOW,
WHO
HOW.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

IF OR IF OR IF.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

TO FOR TO FOR
TO FOR TO FOR.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

IN UNDER IN
UNDER IN UNDER.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

LEFT RIGHT LEFT
RIGHT LEFT.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

UP DOWN UP DOWN UP
DOWN.

CHAPTER 1

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 2

Text text text 2 text text text 3 text text text 4 text text text.

CHAPTER 3

Text text text 2 text text text 3 text text text 4 text text text.

THE END.

But, when I use the same type of Expression on files that exceed 5MB, it fails.

The VBScript that I am using is as follows:

Option Explicit

Dim strPath : strPath = "myFile.txt"

If Instr(1, WScript.FullName, "CScript", vbTextCompare) = 0 Then
    With CreateObject("WScript.Shell")
        .Run "cmd.exe /k cscript //nologo """ & WScript.ScriptFullName & """", 1, False
        WScript.Quit
    End With
Else
    With CreateObject("Scripting.FileSystemObject")
        If .FileExists(strPath) Then 
            Call Main(strPath)
        Else
            WScript.Echo "Input file doesn't exists"
        End If
    End With
End If

Private Sub Main(filePath)
    Dim TempDictionary, Books, Book, b
    Set TempDictionary = CreateObject("Scripting.Dictionary")
    Set Books = RegEx(GetFileContent(filePath),"\b\w{7}\b\s[1]\s[\S\s]+?THE SECOND BOOK OF MOSES")
    If Books.Count > 0 Then 
        For Each Book In Books 
            WScript.Echo Replace(Left(Book.Value,70),vbCrLf," ")
        Next 
    Else 
        WScript.Echo "Document didn't contain any valid books" 
        WScript.Quit 
    End If 
End Sub

Private Function GetFileContent(filePath)
    Dim objFS, objFile, objTS
    Set objFS = CreateObject("Scripting.FileSystemObject")
    Set objFile = objFS.GetFile(filePath)
    Set objTS = objFile.OpenAsTextStream(1, 0)
    GetFileContent = objTS.Read(objFile.Size)
    Set objTS = Nothing
End Function

Private Function RegEx(str,pattern)
    Dim objRE, Match, Matches
    Set objRE = New RegExp
    objRE.Pattern = pattern
    objRE.Global = True
    Set RegEx = objRE.Execute(str)
    WScript.Echo objRE.Test(str)
End Function

Editor that I am using is here: http://www.regexr.com/

Q: What are you trying to do?

A: I want to be able to split any textfile into several string chunks, based on a smart regex code that captures anything between two strings. The first string determiner is a fixed term, i.e. "CHAPTER 1", but the second string determiner is unfixed. The second string determiner is unfixed and changing, but it is known. It can be placed into an array, and then parsed. The problem that I am having is that the Lookaround (?=) seems to either escape or get stuck in a loop. I have been playing around with the "|" operator, as you can see in the second RegEx at the start of this OP. The test file that I am working with seems to parse just fine. No problem. But the larger files that I am working with... I don't know. Something just goes wrong.

0

There are 0 answers