2020-01-05

poi解析Excel(一) 实现readLine读取

words: 3.1k views: time: 15min

Excel

读取文本时通常会使用BufferedReader，它在InputStreamReader的基础上进行装饰，提供了readLine()来简化文本行的读取。使得行数据的读取可以像流水线一样，每当读取完一行后，它就自动准备好下一行并等待获取，直至全部读取结束。所以这里也希望能够在poi的基础上提供一个类似readLine()的接口来读取Excel，并可以比较方便地获取行数据中的字段。

1. csv文本解析

先定义一个统一的读取接口，以及行数据的结构，后面对于不同的文件，在实现时再进行适配

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/IReader.java

public interface IReader extends Closeable {
    
    /*
     * 读取下一行
     */
    IRow readRow() throws Exception;
    
}

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/IRow.java

public interface IRow {

    /*
     * 获取当前行索引
     */
    int getRowIndex();
    
    /**
     * 行内容是否为空
     */
    boolean isEmpty();
    
    /**
     * 获取行列数据
     */
    List<String> getColumnList();
}

这里以csv文本文件为例，因为它有像Excel一样描述表格数据的能力，一般以\t分割行数据中的字段。由于它是纯文本文件，因此可以很容易的实现上面的接口，直接将行读取操作委托给BufferedReader，然后再用给定的分隔符对行内容进行分割

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/TextReader.java

public class TextReader implements IReader {
    
    private BufferedReader reader;
    
    private int rowIndex;
    
    private String regex;
    
    public TextReader(InputStream inputStream, String regex) throws Exception {
        this.regex = regex;
        reader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")); 
    }

    @Override
    public TextRow readRow() throws Exception {
        String line = reader.readLine();
        rowIndex++;
        if(line == null){
            return null;
        }
        
        boolean isEmpty = StringUtils.isBlank(line);
        
        if(regex == null){
            TextRow row = new TextRow(rowIndex - 1, Arrays.asList(line));
            row.setEmpty(isEmpty);
            return row;
        }
        TextRow row = new TextRow(rowIndex - 1, Arrays.asList(line.split(regex)));
        row.setEmpty(isEmpty);
        return row;
    }

    @Override
    public void close() throws IOException {
        IoUtil.close(reader);
    }
}

对于行数据中的字段，简单使用List保存

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/TextRow.java

public class TextRow implements IRow {
    
    private int rowIndex;

    private List<String> columnList;
    
    private boolean isEmpty;
    
    public TextRow(int rowIndex, List<String> rowData){
        this.rowIndex = rowIndex;
        this.columnList = rowData;
    }
    
    @Override
    public int getRowIndex() {
        return rowIndex;
    }

    @Override
    public boolean isEmpty() {
        return isEmpty;
    }
    
    @Override
    public List<String> getColumnList() {
        return columnList;
    }
    
    void setEmpty(boolean isEmpty) {
        this.isEmpty = isEmpty;
    }
}

2. Excel解析

2.1. 要考虑的问题

2.1.1. Excel数据结构

excel由多个sheet构成，所以对于读取的行数据，除了要记录当前行数据及索引，还需要保留当前sheet的名称和索引，以及当前行是否是当前sheet的最后一行，这些信息都可能是使用者识别处理数据时的依据，因此可以将行数据结构定义如下：

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelRow.java

public class ExcelRow implements IRow{

    private int rowIndex;

    private List<String> columnList;

    private int sheetIndex;

    private String sheetName;

    private boolean isLastRow;

    private boolean isEmpty;

    public ExcelRow(int rowIndex, List<String> rowData){
        this.rowIndex = rowIndex;
        this.columnList = rowData;
    }

    @Override
    public int getRowIndex() {
        return rowIndex;
    }

    @Override
    public boolean isEmpty() {
        return isEmpty;
    }

    @Override
    public List<String> getColumnList() {
        return columnList;
    }

    public boolean isLastRow() {
        return isLastRow;
    }

    public int getSheetIndex() {
        return sheetIndex;
    }

    public String getSheetName() {
        return sheetName;
    }
    
    void setEmpty(boolean isEmpty) {
        this.isEmpty = isEmpty;
    }
    
    void setLastRow(boolean isLastRow) {
        this.isLastRow = isLastRow;
    }
    
    void setSheetIndex(int sheetIndex) {
        this.sheetIndex = sheetIndex;
    }

    void setSheetName(String sheetName) {
        this.sheetName = sheetName;
    }
}

2.1.2. 数据过滤

有的场景中，可能只想处理某些sheet，并希望自定义sheet的处理顺序。对于这个问题，可以在读取流程中抽出两个步骤，然后交给使用者继承实现，好一点的方式是利用组合，即定义一个过滤器接口，如果调用者需要，就实现一个过滤器然后交给读取器，至于过滤器的定义可以如下：

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelSheetFilter.java

public interface ExcelSheetFilter {

    /**
     * 根据sheet索引(从1开始)和sheet名称过滤需要处理的sheet
     */
    boolean filter(int sheetIndex, String sheetName);
    
    /**
     * 重新定义需要读取的sheet，以及读取的顺序
     */
    void resetSheetListForRead(List<String> nameList);
}

2.1.3. Excel类型识别

Excel有xls和xlsx两种版本，并且它们的读取方式完全不同，而解析时面向的是文件流，无法判断是哪种版本，只好将Excel类型的识别交给使用者。

2.1.4. 顺序读取

上面说过希望能像流水一样读取数据，所以不管是读取下一行，还是下一个sheet，只要没到最后，就可以一直读取下去，并且读过的数据不应该被再次读取。于是为了避免读取的混乱，规定readSheet()的语义：如果当前sheet已经读取过一些行并且还有剩余，那么直接返回当前sheet剩余的行，否则直接返回下一个sheet的所有行，这样比较符合流水读取的语义。主要是考虑使用者可能穿插着调用readRow()和readSheet()，我们不能对他的行为作任何假设

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/IExcelReader.java

public interface IExcelReader extends IReader {

    String EXCEL_XLS = "xls";

    String EXCEL_XLSX = "xlsx";
    
    /**
     * read next row
     */
    ExcelRow readRow() throws Exception;
    
    /**
     * read next sheet
     * @return if the current sheet has remaining then return the rest, otherwise return the data of next sheet
     */
    List<ExcelRow> readSheet() throws Exception;
    
    /**
     * set sheet filter
     */
    void setSheetFilter(ExcelSheetFilter sheetFilter);
}

2.2. 实现

poi在Workbook初始化的时候，已经把整个Excel都解析结束并保存在内存中，所以这里能做的只是在它的结果基础上提供一些方法，方便数据处理

思路可以概括如下：

在Workbook初始化后维护一个Excel的sheet列表sheetNameList，并保持其顺序，另外维护一个sheet名称与数据的映射sheetMap
初始化一个sheetNameGivenList列表，后面sheet过滤时以及数据读取时都使用sheetNameGivenList，而sheetNameList保持不变，仅作参考。读取时，先从sheetNameGivenList中获取下一个要读取的sheet名称，然后再从sheetNameList中获取它真正的索引sheetIndex。
维护一些表示当前读取位置的索引，使用sheetIndexReading表示当前读取的sheet在sheetNameGivenList中的索引，由此也就知道了当前sheet的sheetName和sheetIndex。使用rowIndex表示当前行索引，以及cellIndex表示当前列索引。
读取流程是从给定的sheetNameGivenList依次获取sheet，当一个sheet的最后一行读完时，就获取下一个sheet，直至最后一个sheet的最后一行读完时，返回null。

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelReader.java

public class ExcelReader implements IExcelReader {

    private static final Logger LOG = LoggerFactory.getLogger(ExcelReader.class);

    private InputStream inputStream;

    private String type;

    private Workbook workbook = null;

    private Map<String, Sheet> sheetMap = new HashMap<>();

    private List<String> sheetNameList = new ArrayList<>();

    private List<String> sheetNameGivenList = new ArrayList<>();

    private int sheetIndexReading = 0;

    private Sheet sheet; 

    private int sheetIndex = 0;

    private String sheetName;

    private int rowIndex = 0;

    private int rowCount;

    private int cellIndex = 0;

    private ExcelSheetFilter sheetFilter;

    private boolean inited = false;

    public ExcelReader(String sourceUri) throws IOException {
        this(new File(sourceUri));
    }

    public ExcelReader(File file) throws IOException {
        if(!file.exists()){
            Resource resource = new ClassPathResource(file.getPath());
            file = resource.getFile();
        }
		
        String name = file.getName();
        int index = name.lastIndexOf('.');
        if(index == -1){
            throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx");
        }
		
        this.type = name.substring(index + 1);
        if(!IExcelReader.EXCEL_XLS.equalsIgnoreCase(type) && !IExcelReader.EXCEL_XLSX.equalsIgnoreCase(type)){
            throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx.");
        }
		
        this.inputStream = new FileInputStream(file);
        init();
    }

    public ExcelReader(InputStream inputStream, String type) throws IOException {
        this.type = type;
        this.inputStream = inputStream;
        init();
    }

    @Override
    public void setSheetFilter(ExcelSheetFilter sheetFilter) {
        this.sheetFilter = sheetFilter;
    }

    private void init() throws IOException{ 
        if(EXCEL_XLS.equalsIgnoreCase(type)){
            workbook = new HSSFWorkbook(inputStream);
        }else if(EXCEL_XLSX.equalsIgnoreCase(type)){
            workbook = new XSSFWorkbook(inputStream);
        }else{
            throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx");
        }
        int sheetCount = workbook.getNumberOfSheets();
        for(int i = 0;i < sheetCount;i++){
            Sheet shee = workbook.getSheetAt(i);
            sheetNameList.add(shee.getSheetName());
            sheetMap.put(shee.getSheetName(), shee);
        }
        //cann't let the customer code to directly modify sheetNameList
        sheetNameGivenList.addAll(sheetNameList);
    }

    @Override
    public List<ExcelRow> readSheet() throws Exception {
        List<ExcelRow> list = new ArrayList<>();
        ExcelRow row = null;
        while((row = readRow()) != null){
            if(!row.isLastRow()){
                list.add(row);
            }else{
                return list;
            }
        }
        return null;
    }

    @Override
    public ExcelRow readRow() {
        if(!inited){
            inited = true;
            if(sheetFilter != null){
                sheetFilter.resetSheetListForRead(sheetNameGivenList);
            }
            initSheet();
        }
        while(true){
            if(sheet == null){
                return null;
            }
            if(sheetFilter != null && !sheetFilter.filter(sheetIndex, sheetName)){
                if(++sheetIndexReading >= sheetNameGivenList.size()){
                    return null;
                }
                initSheet();
            }else{
                if(rowIndex >= rowCount){
                    if(sheetIndexReading >= sheetNameGivenList.size() - 1){
                        return null;
                    }else{
                        sheetIndexReading++;
                        initSheet();
                        continue;
                    }
                }else{
                    Row row = sheet.getRow(rowIndex);
                    rowIndex++;

                    //row not exist, don't know why
                    if(row == null){
                        ExcelRow data = new ExcelRow(rowIndex, new ArrayList<String>(0));
                        data.setSheetIndex(sheetIndex); 
                        data.setSheetName(sheetName); 
                        data.setEmpty(true); 
                        data.setLastRow(rowIndex == rowCount); 
                        return data;
                    }

                    int cellCount = row.getLastCellNum();
                    //Illegal Capacity: -1
                    if(cellCount <= 0){
                        ExcelRow data = new ExcelRow(rowIndex, new ArrayList<String>(0));
                        data.setSheetIndex(sheetIndex); 
                        data.setSheetName(sheetName); 
                        data.setEmpty(true); 
                        data.setLastRow(rowIndex == rowCount); 
                        return data;
                    }
                    List<String> list = new ArrayList<>(cellCount);

                    boolean isEmpty = true;
                    for(cellIndex = 0; cellIndex < cellCount; cellIndex++){
                        String value = getCellValue(row.getCell(cellIndex));
                        if(isEmpty && !StringUtils.isBlank(value)){
                            isEmpty = false;
                        }
                        list.add(value);
                    }
                    ExcelRow rowData = new ExcelRow(rowIndex, list);
                    rowData.setSheetIndex(sheetIndex); 
                    rowData.setSheetName(sheetName); 
                    rowData.setEmpty(isEmpty); 
                    rowData.setLastRow(rowIndex == rowCount); 
                    return rowData;
                }
            }
        }
    }

    private void initSheet(){
        rowIndex = 0;
        sheetName = sheetNameGivenList.get(sheetIndexReading); 
        sheetIndex = sheetNameList.indexOf(sheetName) + 1;
        while((sheet = sheetMap.get(sheetName)) == null){
            sheetIndexReading++;
            if(sheetIndexReading >= sheetNameGivenList.size()){
                sheet = null;
                return;
            }else{
                sheetName = sheetNameGivenList.get(sheetIndexReading); 
                sheetIndex = sheetNameList.indexOf(sheetName);
            }
        }
        rowCount = sheet.getLastRowNum() + 1;//poi row num start with 0
    }

    private String getCellValue(Cell cell) {
        if (cell == null) {
            return "";
        }

        switch (cell.getCellType()) {
            case NUMERIC -> {
                double value = cell.getNumericCellValue();
                if (DateUtil.isCellDateFormatted(cell)) {
                    Date date = DateUtil.getJavaDate(value);
                    return String.valueOf(date.getTime());
                } else {
                    return double2String(value);
                }
            }
            case STRING -> {
                return cell.getStringCellValue();
            }
            case BOOLEAN -> {
                return String.valueOf(cell.getBooleanCellValue());
            }
            case FORMULA -> {
                try {
                    return double2String(cell.getNumericCellValue());
                } catch (IllegalStateException e) {
                    try {
                        return cell.getRichStringCellValue().toString();
                    } catch (IllegalStateException e2) {
                        log.error("Excel format error: sheet=" + sheetName + ",row=" + rowIndex + ",column=" + cellIndex, e2);
                        return "";
                    }
                } catch (Exception e) {
                    log.error("Excel format error: sheet=" + sheetName + ",row=" + rowIndex + ",column=" + cellIndex, e);
                    return "";
                }
            }
            case ERROR -> {
                log.error("Excel format error: sheet=" + sheetName + ",row=" + rowIndex + ",column=" + cellIndex);
                return "";
            }
            default -> {
                return "";
            }
        }
    }

    static String double2String(Double d) {
        return formatDouble(d.toString());
    }

    static String formatDouble(String doubleStr) {
        boolean b = doubleStr.contains("E");
        int indexOfPoint = doubleStr.indexOf('.');
        if (b) {
            int indexOfE = doubleStr.indexOf('E');
            BigInteger xs = new BigInteger(doubleStr.substring(indexOfPoint + BigInteger.ONE.intValue(), indexOfE));
            int pow = Integer.parseInt(doubleStr.substring(indexOfE + BigInteger.ONE.intValue()));
            int xsLen = xs.toByteArray().length;
            int scale = xsLen - pow > 0 ? xsLen - pow : 0;
            doubleStr = String.format("%." + scale + "f", doubleStr);
        } else {
            Pattern p = Pattern.compile(".0$");
            Matcher m = p.matcher(doubleStr);
            if (m.find()) {
                doubleStr = doubleStr.replace(".0", "");
            }
        }
        return doubleStr;
    }

    @Override
    public void close() throws IOException {
        IOUtils.closeQuietly(workbook); 
        IOUtils.closeQuietly(inputStream);
    }
}

3. 示例

这样使用readRow()或者readSheet()就可以像读取文本行一样读取Excel了

:https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/test/java/io/github/echo/poi/ExcelTest.java

@Test
public void testExcelReader() throws Exception{
    IExcelReader reader = new ExcelReader("excel/ExcelTest.xlsx");
    ExcelRow row;
    while((row = reader.readRow()) != null){
        System.out.println(row);
    }
    reader.close();
    Assertions.assertTrue(true);
}

4. 存在问题：内存消耗

由于poi在初始化时将整个Excel都解析完成并保存在内存中，为了保持数据的结构顺序使用了TreeMap进行层层嵌套，并且除了数据本身之外，还维护了单元格的各种属性，因此对内存的占用相当大，很容易造成内存溢出。下面以xlsx的实现XSSF为例，看一下数据的保存：

首先在XSSFWorkbook中维护着book信息，并将所有的sheet数据维护在一个List中

/**
 * The underlying XML bean
 */
private CTWorkbook workbook;

/**
 * this holds the XSSFSheet objects attached to this workbook
 */
private List<XSSFSheet> sheets;

然后在XSSFSheet中，将所有的行数据维护在一个TreeMap中

1	private final SortedMap<Integer, XSSFRow> _rows = new TreeMap<>();

接着在XSSFRow中，继续将所有的列数据嵌套在TreeMap中

/**
 * Cells of this row keyed by their column indexes.
 * The TreeMap ensures that the cells are ordered by columnIndex in the ascending order.
 */
private final TreeMap<Integer, XSSFCell> _cells;

最后在XSSFCell中实际保存单元格数据的元素是CTCell，其中保存了单元格的数据和所有属性。

/**
  * the xml bean containing information about the cell's location, value,
  * data type, formatting, and formula
  */
private CTCell _cell;

为了避免内存溢出问题，poi也提供了基于SAX事件的读取方式，即不再一次性将Excel解析到内存中，而是在读取文件流的过程中，每当读取到一个节点就产生一个对应的事件，并交给调用者处理。SAX是java针对xml标准提供的基于事件解析的api，之所以能用来解析Excel，是因为Excel文件的本质上是一个压缩包，其中包含了XML、图片及其他描述等文件，而其数据就是通过Xml来描述的。可以将Excel后缀改为zip，解压便可以看到其内部文件结构

参考：