CSVReader读取数据缺失

xiaoxiao2026-04-23 12

最近在项目中遇到一个导入CSV文件的程序数据缺失严重.2.4G的报表600多万行,导入数据库实际只有200多万行,最后终于找到了问题的所在,并解决了.记录Mark一下前面的一些曲折过程,怀疑多线程来不及处理直接丢弃,就不讲了. 程序中用了while ((data = csvReader.readNext()) != null)循环进行读取. 查看readNext源码,也是通过BufferedReader的readLine进行一行一行的读取,只是在字符串引用和转义进行了处理.CSV程序默认使用DEFAULT_SEPARATOR = ‘,’逗号作为一列与一列的分割符,DEFAULT_QUOTE_CHARACTER = ‘”’双引号作为字符串引用,就是当一列的内容中出现特殊字符如逗号时,怎么区分这个逗号的是列里面的内容还是列之间的分割,例如,一个文件里面某列内容为a,b,c为了区分这个a b c之间的逗号为本来的内容,所以用”a,b,c”这样表示,DEFAULT_ESCAPE_CHARACTER = ‘\’;反斜杠作为转义. 在字符串引用的处理,发现某列数据以双引号开头,但是在这一行没有发现与之对应的双引号,即是说这一行的双引号为奇数个,会读取下一行进行处理,直到找到与之匹配的双引号.例如,我们的报表在151行在Geometry dash后面出现了特殊字符换行符,在xStep后面也出现了换行符用vim打开,这一行变成了三行,程序会把这三行当成一行处理,这本身没有什么问题. 但是程序中使用反斜杠作为转义,但是csv文件中使用双引号作为转义,这样就会造成\”这样的双引号不做特殊处理,导致双引号不匹配,程序继续读取下一行,造成数据丢失并且数据混乱. 由于CSVReader默认为反斜杠,又不能将转义设置为双引号,这样会和字符串引用的双引号重复,程序处理会混乱,并且程序会抛出异常The separator, quote, and escape characters must be different!,最后重写一个不带转义的CSVReader构造器,重新打个jar包,最后能够读取数据6340034行,解决附readNext关键代码: public String[] readNext() throws IOException {

String[] result = null; do { String nextLine = getNextLine(); if (!hasNext) { return result; // should throw if still pending? } String[] r = parser.parseLineMulti(nextLine); if (r.length > 0) { if (result == null) { result = r; } else { String[] t = new String[result.length+r.length]; System.arraycopy(result, 0, t, 0, result.length); System.arraycopy(r, 0, t, result.length, r.length); result = t; } } } while (parser.isPending()); return result; } private String[] parseLine(String nextLine, boolean multi) throws IOException { if (!multi && pending != null) { pending = null; } if (nextLine == null) { if (pending != null) { String s = pending; pending = null; return new String[]{s}; } else { return null; } } List<String> tokensOnThisLine = new ArrayList<String>(); StringBuilder sb = new StringBuilder(INITIAL_READ_SIZE); boolean inQuotes = false; if (pending != null) { sb.append(pending); pending = null; inQuotes = true; } for (int i = 0; i < nextLine.length(); i++) { char c = nextLine.charAt(i); if ( useEscape && c == this.escape) { if (isNextCharacterEscapable(nextLine, inQuotes || inField, i)) { sb.append(nextLine.charAt(i + 1)); i++; } } else if (c == quotechar) { if (isNextCharacterEscapedQuote(nextLine, inQuotes || inField, i)) { sb.append(nextLine.charAt(i + 1)); i++; } else { //inQuotes = !inQuotes; // the tricky case of an embedded quote in the middle: a,bc"d"ef,g if (!strictQuotes) { if (i > 2 //not on the beginning of the line && nextLine.charAt(i - 1) != this.separator //not at the beginning of an escape sequence && nextLine.length() > (i + 1) && nextLine.charAt(i + 1) != this.separator //not at the end of an escape sequence ) { if (ignoreLeadingWhiteSpace && sb.length() > 0 && isAllWhiteSpace(sb)) { sb.setLength(0); //discard white space leading up to quote } else { sb.append(c); //continue; } } } inQuotes = !inQuotes; } inField = !inField; } else if (c == separator && !inQuotes) { tokensOnThisLine.add(sb.toString()); sb.setLength(0); // start work on next token inField = false; } else { if (!strictQuotes || inQuotes) { sb.append(c); inField = true; } } } // line is done - check status if (inQuotes) { if (multi) { // continuing a quoted section, re-append newline sb.append("\n"); pending = sb.toString(); sb = null; // this partial content is not to be added to field list yet } else { throw new IOException("Un-terminated quoted field at end of CSV line"); } } if (sb != null) { tokensOnThisLine.add(sb.toString()); } return tokensOnThisLine.toArray(new String[tokensOnThisLine.size()]); }

转载请注明原文地址: https://ju.6miu.com/read-1309160.html

最新回复(0)