Using Scanner to display entire line if the line contains any part of a string match (Java)

425 views Asked by At

so I am working on this a project that analyzes hex dumps for specific file signatures. The problem I am having is when trying to analyze dumps that are large in size 16+ GB, I get a OutOfMemoryError: Java heap space error. So my thought is to redesign the algorithm I am using.

Right now my code looks something similar to:

public class Test
{    
     private static ArrayList<String> JPGHeaders = new ArrayList<String>();
     private static ArrayList<String> JPGTrailers = new ArrayList<String>();
     private static ArrayList<String> entireTextFile = new ArrayList<String>();

     public static void main (String[] args)
     {
         Scanner scanner = new Scanner(new File("C:\\HexAnalyser\\HexDump\\fileTest.txt"));

         while (scanner.hasNextLine())
         {
             entireTextFile.add(scanner.nextLine());
         }

         for (String line : entireTextFile)
         {
             if(line.contains(Constants.JPGHEADER))
             {
                JPGHeaders.add(line);
             }

             if(line.contains(Constants.JPGTRAILER))
             {
                JPGTrailers.add(line);
             }
         }

     }
}

so it's adding the entire file to the entireTextFile ArrayList and then searching that ArrayList for the specific file headers and trailers.

For those of you who don't know what a typical hex dump looks like, its similar to:

0012be0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0012bf0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0012c00: ffd8 ffe0 0010 4a46 4946 0001 0201 0050  ......JFIF.....P
0012c10: 0050 0000 ffed 166e 5068 6f74 6f73 686f  .P.....nPhotosho
0012c20: 7020 332e 3000 3842 494d 03ed 0000 0000  p 3.0.8BIM......
0012c30: 0010 0050 0000 0001 0001 0050 0000 0001  ...P.......P....
0012c40: 0001 3842 494d 040d 0000 0000 0004 0000  ..8BIM..........
0012c50: 002d 3842 494d 03f3 0000 0000 0008 0000  .-8BIM..........

since the header for a JPEG is "ffd8 ffe0", the only line I would want to be added to my JPGHeaders ArrayList is :

0012c00: ffd8 ffe0 0010 4a46 4946 0001 0201 0050  ......JFIF.....P

I know this is similar to grep in Linux, but I'm doing this for a java project done in eclipse on Windows Platform. Is there any easier way to search each line of the file while it's being scanned initially and add those specific lines to the according arraylist? or am I stuck scanning the entire file into an ArrayList, then searching through said ArrayList for string literals?

1

There are 1 answers

1
RealSkeptic On BEST ANSWER
public class Test
{    
     private static ArrayList<String> JPGHeaders = new ArrayList<String>();
     private static ArrayList<String> JPGTrailers = new ArrayList<String>();
     private static ArrayList<String> entireTextFile = new ArrayList<String>();

     public static void main (String[] args)
     {
         Scanner scanner = new Scanner(new File("C:\\HexAnalyser\\HexDump\\fileTest.txt"));

         while (scanner.hasNextLine())
         {
             String line = scanner.nextLine();
             if(line.contains(Constants.JPGHEADER))
             {
                JPGHeaders.add(line);
             }

             if(line.contains(Constants.JPGTRAILER))
             {
                JPGTrailers.add(line);
             }
         }

     }
}

Why keep the whole thing in memory? As soon as you read the line, analyze it. If it is not relevant, discard it.