Read words/phrases from a file, make distinct, and write out to another file, in Java

2k views Asked by At

I have a text file with a word or phrase on each line. How do I:

  • Read those phrases into memory,
  • Make distinct (eliminate duplicates),
  • Sort alphabetically,
  • Write the results back out to a file?

StackOverflow has answers for similar Questions in other languages such as C, PHP, Python, Prolog, and VB6. But I cannot find one for Java.

2

There are 2 answers

0
assylias On BEST ANSWER

You can leverage:

…to solve the problem elegantly:

public static void main(String... args) {
  Path input = Paths.get("/Users/yourUser/yourInputFile.txt");
  Path output = Paths.get("/Users/yourUser/yourOutputFile.txt");

  try {
   List<String> words = getDistinctSortedWords(input);
   Files.write(output, words, UTF_8);
  } catch (IOException e) {
    //log error and/or warn user
  }
}

private static List<String> getDistinctSortedWords(Path path) throws IOException {
  try(Stream<String> lines = Files.lines(path, UTF_8)) {
    return lines.map(String::trim)
            .filter(s -> !s.isEmpty()) // If keyword is not empty, collect it.
            .distinct()
            .sorted()
            .collect(toList());
  }
}

Note: requires static imports

import static java.nio.charset.StandardCharsets.UTF_8;
import static java.util.stream.Collectors.toList;
1
Basil Bourque On

The following Java 8 example code should help. It is neither robust nor well-tested. And it assumes all your data can fit easily in memory. While not perfect, this example should get you going in the right direction.

The Java Collections Framework (Tutorial) defines the Set interface for collecting a distinct bunch of values. As an implementation of that interface, we will use HashSet.

Note that we make an attempt to trim any white space from each line. Unfortunately, the String::trim method does not do a thorough job of this. In real work I would replace that call with a call to a better library for trimming, such as Google Guava. You probably want to remove all non-printing characters.

We test each string with the isEmpty method to filter out any strings without any characters.

The following code uses the Try-With-Resources syntax to automatically close any open file.

// Read file.
// For each line, trim string. 
// Add to Set to make collection of distinct values.
// Convert to List, and sort.
// Write back values to text file.

Set< String > set = new HashSet<>( );

String path = "/Users/yourUser/yourInputFile.txt";
File file = new File( path );
try ( BufferedReader br = new BufferedReader( new FileReader( file ) ) ) {
    String line;
    while ( ( line = br.readLine( ) ) != null ) {
        // Process the line.
        String keyword = line.trim( );
        if ( ! keyword.isEmpty( ) ) {  // If keyword is not empty, collect it.
            set.add( keyword );
        }
    }
} catch ( IOException e ) {
    e.printStackTrace( );
}

Now we have a Set of distinct values. To sort them, we need to convert to a List. The Collections class (notice the "s") offers a static method for sorting a List of items that implement Comparable.

List< String > keywords = new ArrayList< String >( set );
Collections.sort( keywords );

Let’s write that sorted list to a file in storage.

// Write to file. 
// Use Try-With-Resources to automatically close the file if opened.
try ( 
        FileWriter writer = new FileWriter( "/Users/yourUser/yourOutputFile.txt" ) ;
) {
    for ( String k : keywords ) {
        writer.write( k + "\n" );  // You may want a different newline instead of the Unix-style LINE FEED hard-coded here with "\n".
    }
} catch ( IOException e ) {
    e.printStackTrace( );
}

See the results on the console, if you wish.

System.out.println( "Size: " + keywords.size( ) );
System.out.println( "keywords: " + keywords );
System.out.println( "Done." );