I have to read texts from a file which are not even and a little complex Basically the are in this order
Index . word / DOC_id : position1 postition2 (....and so on), DOC_id : position1 postition2 (....and so on),
So a word could appear in n number of documents and could appear n number of times in a document. As an example i am copying a small section of the file, i cannot put words which occur too many times because of the space constraints.
Example:
13137 . speeding / D85 : 5999 ,
13138 . spell / D53 : 1513 ,
13139 . spelling / D3 : 344 351 ,
13140 . spending / D71 : 398 ,
13141 . spiderman / D60 : 650 733 997 1023 1053 1133 1152 1169 ,
13142 . spiders / D75 : 704 , D91 : 19834 ,
(...and so on)
Please could anyone help me with this. Also, could i format the file in a better way as i generated this file, may be i can reformat it and generate a better formatted text file.
Thank You :)
Perhaps you should use the new line as delimiter. Here's what I mean
In other words, a format of the following nature
Edit
Now that you can retrieve one line at a time, push them into a
Scanner
orStringTokenizer
or even useString.split
remembering that whitespace will be used as a delimiter. Parse through each token keeping track of.
,/
,:
and,
. You already know the format of each line and what separators are used; use that information and proceed.