Java-Debugging und JDK 9 Java-Debugging und JDK 9

https://dzone.com/articles/jhsdb-a-new-tool-for-jdk-9?edition=304143&utm_source=Daily%20Digest&utm_medium=email&utm_campaign=dd%202017-06-07

Code Snippets Code Snippets

How to Read CSV Files in Java

CSV files are common, but String.split() doesn't have the flexibility to cover al the possible uses. Let's create a reader class to start getting your data organized.

· Java Zone

Comment (17)

Save
6,951 Views
What every Java engineer should know about microservices: Reactive Microservices Architecture.  Brought to you in partnership with Lightbend.

CSV files are extensively used in data interchange between applications. They're especially useful when the only structure to the data being exchanged is rows and columns. This format is particularly popular as the data can be imported into Microsoft Excel and used for charts and visualization.

In this article, we present an easy-to-use class for parsing and reading CSV data in Java. The class allows retrieval of each row of the CSV file as an array of columns. This row can then be processed further for filtering, inserting into a database, etc.

You might be thinking – Why not just use String.split() to split row data into fields? It is, after all, readily available and returns the data as an array. Well, the answer is, CSV parsing has a lot more nuances than is possible to handle using just String.split().

Excel Compatibility

The CSV format defines certain conventions which are commonly used in applications that import and export CSV data. One of the most common applications used for visualizing CSV data is Excel. Many applications including follow these conventions and hence a CSV reader must take these into consideration. Some of these are:

Commas are used to separate fields. And a Carriage-Return Line-Feed (CRLF) combination is used to separate rows.

When commas need to be included as a part of a field value, it must be quoted with double-quotes (").

Multi-line fields can be present in the CSV file and these fields must also be quoted.

Double quotes can be included within a field by repeating the double-quote character.

Read and Strip BOM

A CSV file generated from an application on Windows might include a BOM (Byte Order Mark) character at the very beginning of the file. This character, if present, can be used to determine the encoding of the file from among UTF-8, UTF-16BE (Big Ending) or UTF-16LE (Little Endian). The CSV Reader module (presented next) uses a routine to strip the BOM from the CSV file.

CSV Reader Class

Here is the class used to parse and read CSV data. Download it here.

 
package sample;
 

 
 
import java.io.InputStream;
 
import java.io.PushbackInputStream;
 
import java.io.InputStreamReader;
 
import java.io.Reader;
 
import java.io.BufferedReader;
 

 
 
import java.util.List;
 
import java.util.ArrayList;
 

 
 

 
 
public class CSV {
 
    static final private int NUMMARK = 10;
 
    static final private char COMMA = ',';
 
    static final private char DQUOTE = '"';
 
    static final private char CRETURN = '\r';
 
    static final private char LFEED = '\n';
 
    static final private char SQUOTE = '\'';
 
    static final private char COMMENT = '#';
 

 
 

 
 
    private boolean stripMultipleNewlines;
 

 
 

 
 
    private char separator;
 
    private ArrayList < String > fields;
 
    private boolean eofSeen;
 
    private Reader in ;
 

 
 
    static public Reader stripBom(InputStream in ) throws java.io.IOException, java.io.UnsupportedEncodingException {
 
        PushbackInputStream pin = new PushbackInputStream( in , 3);
 
        byte[] b = new byte[3];
 
        int len = pin.read(b, 0, b.length);
 
        if ((b[0] & 0xFF) == 0xEF && len == 3) {
 
            if ((b[1] & 0xFF) == 0xBB &&
 
                (b[2] & 0xFF) == 0xBF) {
 
                return new InputStreamReader(pin, "UTF-8");
 
            } else {
 
                pin.unread(b, 0, len);
 
            }
 
        } else if (len >= 2) {
 
            if ((b[0] & 0xFF) == 0xFE &&
 
                (b[1] & 0xFF) == 0xFF) {
 
                return new InputStreamReader(pin, "UTF-16BE");
 
            } else if ((b[0] & 0xFF) == 0xFF &&
 
                (b[1] & 0xFF) == 0xFE) {
 
                return new InputStreamReader(pin, "UTF-16LE");
 
            } else {
 
                pin.unread(b, 0, len);
 
            }
 
        } else if (len > 0) {
 
            pin.unread(b, 0, len);
 
        }
 
        return new InputStreamReader(pin, "UTF-8");
 
    }
 

 
 
    public CSV(boolean stripMultipleNewlines, char separator, Reader input) {
 
        this.stripMultipleNewlines = stripMultipleNewlines;
 
        this.separator = separator;
 
        this.fields = new ArrayList < String > ();
 
        this.eofSeen = false;
 
        this.in = new BufferedReader(input);
 
    }
 

 
 
    public CSV(boolean stripMultipleNewlines, char separator, InputStream input) throws java.io.IOException, java.io.UnsupportedEncodingException {
 
        this.stripMultipleNewlines = stripMultipleNewlines;
 
        this.separator = separator;
 
        this.fields = new ArrayList < String > ();
 
        this.eofSeen = false;
 
        this.in = new BufferedReader(stripBom(input));
 
    }
 

 
 
    public boolean hasNext() throws java.io.IOException {
 
        if (eofSeen) return false;
 
        fields.clear();
 
        eofSeen = split( in , fields);
 
        if (eofSeen) return !fields.isEmpty();
 
        else return true;
 
    }
 

 
 
    public List < String > next() {
 
        return fields;
 
    }
 

 
 

 
 
    static private boolean discardLinefeed(Reader in , boolean stripMultiple) throws java.io.IOException {
 
        if (stripMultiple) { in .mark(NUMMARK);
 
            int value = in .read();
 
            while (value != -1) {
 
                char c = (char) value;
 
                if (c != CRETURN && c != LFEED) { in .reset();
 
                    return false;
 
                } else { in .mark(NUMMARK);
 
                    value = in .read();
 
                }
 
            }
 
            return true;
 
        } else { in .mark(NUMMARK);
 
            int value = in .read();
 
            if (value == -1) return true;
 
            else if ((char) value != LFEED) in .reset();
 
            return false;
 
        }
 
    }
 

 
 
    private boolean skipComment(Reader in ) throws java.io.IOException {
 

 
 
        int value;
 
        while ((value = in .read()) != -1) {
 
            char c = (char) value;
 
            if (c == CRETURN)
 
                return discardLinefeed( in , stripMultipleNewlines);
 
        }
 
        return true;
 
    }
 

 
 

 
 
    private boolean split(Reader in , ArrayList < String > fields) throws java.io.IOException {
 
        StringBuilder sbuf = new StringBuilder();
 
        int value;
 
        while ((value = in .read()) != -1) {
 
            char c = (char) value;
 
            switch (c) {
 
                case CRETURN:
 
                    if (sbuf.length() > 0) {
 
                        fields.add(sbuf.toString());
 
                        sbuf.delete(0, sbuf.length());
 
                    }
 
                    return discardLinefeed( in , stripMultipleNewlines);
 

 
 
                case LFEED:
 
                    if (sbuf.length() > 0) {
 
                        fields.add(sbuf.toString());
 
                        sbuf.delete(0, sbuf.length());
 
                    }
 
                    if (stripMultipleNewlines)
 
                        return discardLinefeed( in , stripMultipleNewlines);
 
                    else return false;
 

 
 
                case DQUOTE:
 
                    {
 

 
 
                        while ((value = in .read()) != -1) {
 
                            c = (char) value;
 
                            if (c == DQUOTE) {
 

 
 

 
 
                                in .mark(NUMMARK);
 
                                if ((value = in .read()) == -1) {
 

 
 

 
 

 
 
                                    if (sbuf.length() > 0) {
 
                                        fields.add(sbuf.toString());
 
                                        sbuf.delete(0, sbuf.length());
 
                                    }
 
                                    return true;
 
                                } else if ((c = (char) value) == DQUOTE) {
 

 
 

 
 

 
 
                                    sbuf.append(DQUOTE);
 
                                } else if (c == CRETURN) {
 

 
 

 
 
                                    if (sbuf.length() > 0) {
 
                                        fields.add(sbuf.toString());
 
                                        sbuf.delete(0, sbuf.length());
 
                                    }
 

 
 

 
 
                                    return discardLinefeed( in ,
 
                                        stripMultipleNewlines);
 
                                } else if (c == LFEED) {
 

 
 

 
 
                                    if (sbuf.length() > 0) {
 
                                        fields.add(sbuf.toString());
 
                                        sbuf.delete(0, sbuf.length());
 
                                    }
 

 
 

 
 

 
 
                                    if (stripMultipleNewlines)
 
                                        return discardLinefeed( in , stripMultipleNewlines);
 
                                    else return false;
 
                                } else {
 

 
 

 
 

 
 

 
 

 
 

 
 
                                    in .reset();
 
                                    break;
 
                                }
 
                            } else {
 

 
 
                                sbuf.append(c);
 
                            }
 
                        }
 

 
 
                        if (value == -1) {
 

 
 

 
 
                            if (sbuf.length() > 0) {
 
                                fields.add(sbuf.toString());
 
                                sbuf.delete(0, sbuf.length());
 
                            }
 
                            return true;
 
                        }
 
                    }
 
                    break;
 

 
 
                default:
 
                    if (c == separator) {
 
                        fields.add(sbuf.toString());
 
                        sbuf.delete(0, sbuf.length());
 
                    } else {
 

 
 
                        if (c == COMMENT && fields.isEmpty() &&
 
                            sbuf.toString().trim().isEmpty()) {
 
                            boolean eof = skipComment( in );
 
                            if (eof) return eof;
 
                            else sbuf.delete(0, sbuf.length());
 

 
 
                        } else sbuf.append(c);
 
                    }
 
            }
 
        }
 
        if (sbuf.length() > 0) {
 
            fields.add(sbuf.toString());
 
            sbuf.delete(0, sbuf.length());
 
        }
 
        return true;
 
    }
 
}
 

 

Read CSV File Into an Array

The following code illustrates how to read a CSV file and load the rows and columns into a List. The file is opened with an InputStream so the Byte Order Marker can be automatically detected and discarded. Also, the first row of the CSV file is assumed to be column headers and is loaded into a separate array.

 
List < List < String >> rows = new ArrayList < > ();
 
try (InputStream in = new FileInputStream(csvFile);) {
 
    CSV csv = new CSV(true, ',', in );
 
    List < String > colNames = null;
 
    if (csv.hasNext()) colNames = new ArrayList < String > (csv.next());
 
    while (csv.hasNext()) {
 
        List < String > fields = new ArrayList < String > (csv.next());
 
        rows.add(fields);
 
    }
 
}
 

 

Parse CSV Data from a String

You can also use the class to read CSV data from a String by creating a StringReader instead of an InputStream. The following code reads and prints numbered columns from the CSV string.

 
try (Reader in = new StringReader(csvStr);) {
 
    CSV csv = new CSV(true, ',', in );
 
    int nrows = 0;
 
    while (csv.hasNext()) {
 
        List < String > fields = csv.next();
 
        for (int i = 0; i < fields.size(); i++) {
 
            System.out.printf("%-3d: %s%n", (i + 1), fields.get(i));
 
        }
 
        System.out.println();
 
        nrows++;
 
    }
 
}
 

 

Quoted Commas

Commas can be quoted with the double quote character and included in a field.

 
@Test
 
public void quotedCommaTest() throws Exception {
 
    try (Reader in = new StringReader("A,B,\"C,D\",e");) {
 
        CSV csv = new CSV(true, ',', in );
 
        assertTrue(csv.hasNext());
 
        List < String > fields = csv.next();
 
        assertTrue(fields.size() == 4);
 
        List < String > expected = Arrays.asList("A", "B", "C,D", "e");
 
        assertThat(fields, is(expected));
 
    }
 
}
 

 

Dealing with Double Quotes

Single instances of double-quotes are read and stripped away from the field value; they are assumed to quote text containing commas and new-lines.

 
@Test
 
public void quotesStrippedTest() throws Exception {
 
    try (Reader in = new StringReader("A,B,I said \"How are you?\",d");) {
 
        CSV csv = new CSV(true, ',', in );
 
        assertTrue(csv.hasNext());
 
        List < String > fields = csv.next();
 
        assertTrue(fields.size() == 4);
 
        List < String > expected = Arrays.asList("A",
 
            "B",
 
            "I said How are you?",
 
            "d");
 
        assertThat(fields, is(expected));
 
    }
 
}
 

 

To include double quotes as part of the field, it must be repeated and enclosed within double quotes.

 
@Test
 
public void includeQuotesTest() throws Exception {
 
    try (Reader in = new StringReader("A,B,\"I said \"\"How about you?\"\"\",d");) {
 
        CSV csv = new CSV(true, ',', in );
 
        assertTrue(csv.hasNext());
 
        List < String > fields = csv.next();
 
        assertTrue(fields.size() == 4);
 
        List < String > expected = Arrays.asList("A",
 
            "B",
 
            "I said \"How about you?\"",
 
            "d");
 
        assertThat(fields, is(expected));
 
    }
 
}
 

 

Multiline Columns

Text spanning across lines is sometimes included within CSV data. The class handles this case also without hiccups.

 
@Test
 
public void multilineFieldQuoteTest() throws Exception {
 
    try (Reader in = new StringReader("A,B,\"I said\n\"\"How are you?\"\"\",d");) {
 
        CSV csv = new CSV(true, ',', in );
 
        assertTrue(csv.hasNext());
 
        List < String > fields = csv.next();
 

 
 
        assertTrue(fields.size() == 4);
 
        List < String > expected = Arrays
 
            .asList("A",
 
                "B",
 
                "I said\n\"How are you?\"",
 
                "d");
 
        assertThat(fields, is(expected));
 
    }
 
}
 

 

Summary

Reading CSV data from a file is sometimes required in an application. A simplistic view would be to use String.split() to read the data, but this does not cover all the edge cases. These include commas within fields, double quotes, and multi-line text. Download and drop in the included class for a simple solution to parsing CSV files.