reading complex and uneven data from text using java

164 views Asked by At

I have to read texts from a file which are not even and a little complex Basically the are in this order

Index . word / DOC_id : position1 postition2 (....and so on), DOC_id : position1 postition2 (....and so on),

So a word could appear in n number of documents and could appear n number of times in a document. As an example i am copying a small section of the file, i cannot put words which occur too many times because of the space constraints.

Example:

13137 . speeding / D85 : 5999  , 
13138 . spell / D53 : 1513  , 
13139 . spelling / D3 : 344 351  , 
13140 . spending / D71 : 398  , 
13141 . spiderman / D60 : 650 733 997 1023 1053 1133 1152 1169  , 
13142 . spiders / D75 : 704  , D91 : 19834  ,
(...and so on)

Please could anyone help me with this. Also, could i format the file in a better way as i generated this file, may be i can reformat it and generate a better formatted text file.

Thank You :)

1

There are 1 answers

1
Vino On BEST ANSWER

Perhaps you should use the new line as delimiter. Here's what I mean

13137 . speeding / D85 : 5999
13138 . spell / D53 : 1513 
13139 . spelling / D3 : 344 351
13140 . spending / D71 : 398
13141 . spiderman / D60 : 650 733 997 1023 1053 1133 1152 1169
13142 . spiders / D75 : 704 , D91 : 19834

In other words, a format of the following nature

Index . word / DOC_id : position1 postition2 ... , DOC_id : position1 ...
Index . word / DOC_id : position1 postition2 ... , DOC_id : position1 ...
Index . word / DOC_id : position1 postition2 ... , DOC_id : position1 ...

Edit

Now that you can retrieve one line at a time, push them into a Scanner or StringTokenizer or even use String.split remembering that whitespace will be used as a delimiter. Parse through each token keeping track of .,/,: and ,. You already know the format of each line and what separators are used; use that information and proceed.