I would like to know why the following RegEx's:
\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.)
and:
\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.|HOW WHO HOW WHO HOW\,\sWHO\sHOW\.)
seem to work perfectly fine on the following test string:
THIS THAT THIS THAT THIS,
THAT
THIS.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
WHAT WHERE WHAT WHERE WHAT,
WHERE
WHAT.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
HOW WHO HOW WHO HOW,
WHO
HOW.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
IF OR IF OR IF.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
TO FOR TO FOR
TO FOR TO FOR.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
IN UNDER IN
UNDER IN UNDER.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
LEFT RIGHT LEFT
RIGHT LEFT.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
UP DOWN UP DOWN UP
DOWN.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
THE END.
But, when I use the same type of Expression on files that exceed 5MB, it fails.
The VBScript that I am using is as follows:
Option Explicit
Dim strPath : strPath = "myFile.txt"
If Instr(1, WScript.FullName, "CScript", vbTextCompare) = 0 Then
With CreateObject("WScript.Shell")
.Run "cmd.exe /k cscript //nologo """ & WScript.ScriptFullName & """", 1, False
WScript.Quit
End With
Else
With CreateObject("Scripting.FileSystemObject")
If .FileExists(strPath) Then
Call Main(strPath)
Else
WScript.Echo "Input file doesn't exists"
End If
End With
End If
Private Sub Main(filePath)
Dim TempDictionary, Books, Book, b
Set TempDictionary = CreateObject("Scripting.Dictionary")
Set Books = RegEx(GetFileContent(filePath),"\b\w{7}\b\s[1]\s[\S\s]+?THE SECOND BOOK OF MOSES")
If Books.Count > 0 Then
For Each Book In Books
WScript.Echo Replace(Left(Book.Value,70),vbCrLf," ")
Next
Else
WScript.Echo "Document didn't contain any valid books"
WScript.Quit
End If
End Sub
Private Function GetFileContent(filePath)
Dim objFS, objFile, objTS
Set objFS = CreateObject("Scripting.FileSystemObject")
Set objFile = objFS.GetFile(filePath)
Set objTS = objFile.OpenAsTextStream(1, 0)
GetFileContent = objTS.Read(objFile.Size)
Set objTS = Nothing
End Function
Private Function RegEx(str,pattern)
Dim objRE, Match, Matches
Set objRE = New RegExp
objRE.Pattern = pattern
objRE.Global = True
Set RegEx = objRE.Execute(str)
WScript.Echo objRE.Test(str)
End Function
Editor that I am using is here: http://www.regexr.com/
Q: What are you trying to do?
A: I want to be able to split any textfile into several string chunks, based on a smart regex code that captures anything between two strings. The first string determiner is a fixed term, i.e. "CHAPTER 1", but the second string determiner is unfixed. The second string determiner is unfixed and changing, but it is known. It can be placed into an array, and then parsed. The problem that I am having is that the Lookaround (?=) seems to either escape or get stuck in a loop. I have been playing around with the "|" operator, as you can see in the second RegEx at the start of this OP. The test file that I am working with seems to parse just fine. No problem. But the larger files that I am working with... I don't know. Something just goes wrong.