I'm doing my project on Text Categorization.I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line: The DTD file lewis.dtd is included in the distribution. Following the document type declaration line are individual Reuters articles marked up with SGML tags.
I need help to write a java program to read those 21578 documents or transform them into 21578 seperated text files.
can somebody plzz help me?????
Though it's very old post but my answer is for future needy persons because I struggled a lot before doing it in this way. I can't say that its a suitable approach or a good solution but it served the purpose and for last 6 months its running continuously to do batch process. I wrote some custom code to read and parse the SGML files and it successfully did the job for even quit large files. Though the output format is in different structure as required in my case. You can have a look and if it seems useful you can do some tweaking to utilise it. Please have a look here