I have the PAF raw data in several files (list of all addresses in the UK).
My goal is to create a PostCode lookup in our software.
I have created a new database but there is no need to understand it for the moment.
Let's take a file, his extension is ".c01" and can be open with a text editor. The data in this file are in the following format :
0000000123A
With (according to the developer guide), 8 char for the KEY, 50 char for the NAME.
This file contains 2,449,652 rows (it's a small one !)
I create a Parsing class for this
private class SerializedBuilding
{
public int Key
{
get; set;
}
public string Name
{
get; set;
}
public bool isValid = false;
public Building ToBuilding()
{
Building b = new Building();
b.BuildingKey = Key;
b.BuildingName = Name;
return b;
}
private readonly int KEYLENGTH = 8;
private readonly int NAMELENGTH = 50;
public SerializedBuilding(String line)
{
string KeyStr = null;
string Name = null;
try
{
KeyStr = line.Substring(0, KEYLENGTH);
}
catch (Exception e)
{
Console.WriteLine("erreur parsing key line " + line);
return;
}
try
{
Name = line.Substring(KEYLENGTH - 1, NAMELENGTH);
}
catch (Exception e)
{
Console.WriteLine("erreur parsing name line " + line);
return;
}
int value;
if (!Int32.TryParse(KeyStr, out value))
return;
if (value == 0 || value == 99999999)
return;
this.Name = Name;
this.Key = value;
this.isValid = true;
}
}
I use this method to read the file
public void start()
{
AddressDataContext d = new AddressDataContext();
Count = 0;
string line;
// Read the file and display it line by line.
System.IO.StreamReader file =
new System.IO.StreamReader(filename);
SerializedBuilding sb = null;
Console.WriteLine("Number of line detected : " + File.ReadLines(filename).Count());
while ((line = file.ReadLine()) != null)
{
sb = new SerializedBuilding(line);
if (sb.isValid)
{
d.Buildings.InsertOnSubmit(sb.ToBuilding());
if (Count % 100 == 0)
d.SubmitChanges();
}
Count++;
}
d.SubmitChanges();
file.Close();
Console.WriteLine("building added");
}
I use Linq to SQL classes to insert data to my database. The connection string is the default one.
This seems to work, I have added 67200 lines. It just crashed but my questions are not about that.
My estimations :
- 33,647,015 rows to parse
Time needed for execution : 13 hours
It's a one-time job (just needs to be done on my sql and on the client server later) so I don't really care about performances but I think it can be interesting to know how it can be improved.
My questions are :
- Is
readline()
andsubstring()
the most powerful ways to read these huge files ? - Can the performance be improved by modifying the connection string ?