poi解析Excel(一) 实现readLine读取

words: 3.1k    views:    time: 15min

读取文本时通常会使用BufferedReader,它在InputStreamReader的基础上进行装饰,提供了readLine()来简化文本行的读取。使得行数据的读取可以像流水线一样,每当读取完一行后,它就自动准备好下一行并等待获取,直至全部读取结束。所以这里也希望能够在poi的基础上提供一个类似readLine()的接口来读取Excel,并可以比较方便地获取行数据中的字段。

1. csv文本解析

先定义一个统一的读取接口,以及行数据的结构,后面对于不同的文件,在实现时再进行适配

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/IReader.java
1
2
3
4
5
6
7
8
public interface IReader extends Closeable {

/*
* 读取下一行
*/
IRow readRow() throws Exception;

}
https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/IRow.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public interface IRow {

/*
* 获取当前行索引
*/
int getRowIndex();

/**
* 行内容是否为空
*/
boolean isEmpty();

/**
* 获取行列数据
*/
List<String> getColumnList();
}

这里以csv文本文件为例,因为它有像Excel一样描述表格数据的能力,一般以\t分割行数据中的字段。由于它是纯文本文件,因此可以很容易的实现上面的接口,直接将行读取操作委托给BufferedReader,然后再用给定的分隔符对行内容进行分割

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/TextReader.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
public class TextReader implements IReader {

private BufferedReader reader;

private int rowIndex;

private String regex;

public TextReader(InputStream inputStream, String regex) throws Exception {
this.regex = regex;
reader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
}

@Override
public TextRow readRow() throws Exception {
String line = reader.readLine();
rowIndex++;
if(line == null){
return null;
}

boolean isEmpty = StringUtils.isBlank(line);

if(regex == null){
TextRow row = new TextRow(rowIndex - 1, Arrays.asList(line));
row.setEmpty(isEmpty);
return row;
}
TextRow row = new TextRow(rowIndex - 1, Arrays.asList(line.split(regex)));
row.setEmpty(isEmpty);
return row;
}

@Override
public void close() throws IOException {
IoUtil.close(reader);
}
}

对于行数据中的字段,简单使用List保存

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/text/TextRow.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
public class TextRow implements IRow {

private int rowIndex;

private List<String> columnList;

private boolean isEmpty;

public TextRow(int rowIndex, List<String> rowData){
this.rowIndex = rowIndex;
this.columnList = rowData;
}

@Override
public int getRowIndex() {
return rowIndex;
}

@Override
public boolean isEmpty() {
return isEmpty;
}

@Override
public List<String> getColumnList() {
return columnList;
}

void setEmpty(boolean isEmpty) {
this.isEmpty = isEmpty;
}
}

2. Excel解析

2.1. 要考虑的问题

2.1.1. Excel数据结构

excel由多个sheet构成,所以对于读取的行数据,除了要记录当前行数据及索引,还需要保留当前sheet的名称和索引,以及当前行是否是当前sheet的最后一行,这些信息都可能是使用者识别处理数据时的依据,因此可以将行数据结构定义如下:

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelRow.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
public class ExcelRow implements IRow{

private int rowIndex;

private List<String> columnList;

private int sheetIndex;

private String sheetName;

private boolean isLastRow;

private boolean isEmpty;

public ExcelRow(int rowIndex, List<String> rowData){
this.rowIndex = rowIndex;
this.columnList = rowData;
}

@Override
public int getRowIndex() {
return rowIndex;
}

@Override
public boolean isEmpty() {
return isEmpty;
}

@Override
public List<String> getColumnList() {
return columnList;
}

public boolean isLastRow() {
return isLastRow;
}

public int getSheetIndex() {
return sheetIndex;
}

public String getSheetName() {
return sheetName;
}

void setEmpty(boolean isEmpty) {
this.isEmpty = isEmpty;
}

void setLastRow(boolean isLastRow) {
this.isLastRow = isLastRow;
}

void setSheetIndex(int sheetIndex) {
this.sheetIndex = sheetIndex;
}

void setSheetName(String sheetName) {
this.sheetName = sheetName;
}
}
2.1.2. 数据过滤

有的场景中,可能只想处理某些sheet,并希望自定义sheet的处理顺序。对于这个问题,可以在读取流程中抽出两个步骤,然后交给使用者继承实现,好一点的方式是利用组合,即定义一个过滤器接口,如果调用者需要,就实现一个过滤器然后交给读取器,至于过滤器的定义可以如下:

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelSheetFilter.java
1
2
3
4
5
6
7
8
9
10
11
12
public interface ExcelSheetFilter {

/**
* 根据sheet索引(从1开始)和sheet名称过滤需要处理的sheet
*/
boolean filter(int sheetIndex, String sheetName);

/**
* 重新定义需要读取的sheet,以及读取的顺序
*/
void resetSheetListForRead(List<String> nameList);
}
2.1.3. Excel类型识别

Excel有xls和xlsx两种版本,并且它们的读取方式完全不同,而解析时面向的是文件流,无法判断是哪种版本,只好将Excel类型的识别交给使用者

2.1.4. 顺序读取

上面说过希望能像流水一样读取数据,所以不管是读取下一行,还是下一个sheet,只要没到最后,就可以一直读取下去,并且读过的数据不应该被再次读取。于是为了避免读取的混乱,规定readSheet()的语义:如果当前sheet已经读取过一些行并且还有剩余,那么直接返回当前sheet剩余的行,否则直接返回下一个sheet的所有行,这样比较符合流水读取的语义。主要是考虑使用者可能穿插着调用readRow()readSheet(),我们不能对他的行为作任何假设

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/IExcelReader.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public interface IExcelReader extends IReader {

String EXCEL_XLS = "xls";

String EXCEL_XLSX = "xlsx";

/**
* read next row
*/
ExcelRow readRow() throws Exception;

/**
* read next sheet
* @return if the current sheet has remaining then return the rest, otherwise return the data of next sheet
*/
List<ExcelRow> readSheet() throws Exception;

/**
* set sheet filter
*/
void setSheetFilter(ExcelSheetFilter sheetFilter);
}

2.2. 实现

poi在Workbook初始化的时候,已经把整个Excel都解析结束并保存在内存中,所以这里能做的只是在它的结果基础上提供一些方法,方便数据处理

思路可以概括如下:

  1. 在Workbook初始化后维护一个Excel的sheet列表sheetNameList,并保持其顺序,另外维护一个sheet名称与数据的映射sheetMap

  2. 初始化一个sheetNameGivenList列表,后面sheet过滤时以及数据读取时都使用sheetNameGivenList,而sheetNameList保持不变,仅作参考。读取时,先从sheetNameGivenList中获取下一个要读取的sheet名称,然后再从sheetNameList中获取它真正的索引sheetIndex

  3. 维护一些表示当前读取位置的索引,使用sheetIndexReading表示当前读取的sheet在sheetNameGivenList中的索引,由此也就知道了当前sheet的sheetNamesheetIndex。使用rowIndex表示当前行索引,以及cellIndex表示当前列索引。

  4. 读取流程是从给定的sheetNameGivenList依次获取sheet,当一个sheet的最后一行读完时,就获取下一个sheet,直至最后一个sheet的最后一行读完时,返回null。

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelReader.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
public class ExcelReader implements IExcelReader {

private static final Logger LOG = LoggerFactory.getLogger(ExcelReader.class);

private InputStream inputStream;

private String type;

private Workbook workbook = null;

private Map<String, Sheet> sheetMap = new HashMap<>();

private List<String> sheetNameList = new ArrayList<>();

private List<String> sheetNameGivenList = new ArrayList<>();

private int sheetIndexReading = 0;

private Sheet sheet;

private int sheetIndex = 0;

private String sheetName;

private int rowIndex = 0;

private int rowCount;

private int cellIndex = 0;

private ExcelSheetFilter sheetFilter;

private boolean inited = false;

public ExcelReader(String sourceUri) throws IOException {
this(new File(sourceUri));
}

public ExcelReader(File file) throws IOException {
if(!file.exists()){
Resource resource = new ClassPathResource(file.getPath());
file = resource.getFile();
}

String name = file.getName();
int index = name.lastIndexOf('.');
if(index == -1){
throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx");
}

this.type = name.substring(index + 1);
if(!IExcelReader.EXCEL_XLS.equalsIgnoreCase(type) && !IExcelReader.EXCEL_XLSX.equalsIgnoreCase(type)){
throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx.");
}

this.inputStream = new FileInputStream(file);
init();
}

public ExcelReader(InputStream inputStream, String type) throws IOException {
this.type = type;
this.inputStream = inputStream;
init();
}

@Override
public void setSheetFilter(ExcelSheetFilter sheetFilter) {
this.sheetFilter = sheetFilter;
}

private void init() throws IOException{
if(EXCEL_XLS.equalsIgnoreCase(type)){
workbook = new HSSFWorkbook(inputStream);
}else if(EXCEL_XLSX.equalsIgnoreCase(type)){
workbook = new XSSFWorkbook(inputStream);
}else{
throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx");
}
int sheetCount = workbook.getNumberOfSheets();
for(int i = 0;i < sheetCount;i++){
Sheet shee = workbook.getSheetAt(i);
sheetNameList.add(shee.getSheetName());
sheetMap.put(shee.getSheetName(), shee);
}
//cann't let the customer code to directly modify sheetNameList
sheetNameGivenList.addAll(sheetNameList);
}

@Override
public List<ExcelRow> readSheet() throws Exception {
List<ExcelRow> list = new ArrayList<>();
ExcelRow row = null;
while((row = readRow()) != null){
if(!row.isLastRow()){
list.add(row);
}else{
return list;
}
}
return null;
}

@Override
public ExcelRow readRow() {
if(!inited){
inited = true;
if(sheetFilter != null){
sheetFilter.resetSheetListForRead(sheetNameGivenList);
}
initSheet();
}
while(true){
if(sheet == null){
return null;
}
if(sheetFilter != null && !sheetFilter.filter(sheetIndex, sheetName)){
if(++sheetIndexReading >= sheetNameGivenList.size()){
return null;
}
initSheet();
}else{
if(rowIndex >= rowCount){
if(sheetIndexReading >= sheetNameGivenList.size() - 1){
return null;
}else{
sheetIndexReading++;
initSheet();
continue;
}
}else{
Row row = sheet.getRow(rowIndex);
rowIndex++;

//row not exist, don't know why
if(row == null){
ExcelRow data = new ExcelRow(rowIndex, new ArrayList<String>(0));
data.setSheetIndex(sheetIndex);
data.setSheetName(sheetName);
data.setEmpty(true);
data.setLastRow(rowIndex == rowCount);
return data;
}

int cellCount = row.getLastCellNum();
//Illegal Capacity: -1
if(cellCount <= 0){
ExcelRow data = new ExcelRow(rowIndex, new ArrayList<String>(0));
data.setSheetIndex(sheetIndex);
data.setSheetName(sheetName);
data.setEmpty(true);
data.setLastRow(rowIndex == rowCount);
return data;
}
List<String> list = new ArrayList<>(cellCount);

boolean isEmpty = true;
for(cellIndex = 0; cellIndex < cellCount; cellIndex++){
String value = getCellValue(row.getCell(cellIndex));
if(isEmpty && !StringUtils.isBlank(value)){
isEmpty = false;
}
list.add(value);
}
ExcelRow rowData = new ExcelRow(rowIndex, list);
rowData.setSheetIndex(sheetIndex);
rowData.setSheetName(sheetName);
rowData.setEmpty(isEmpty);
rowData.setLastRow(rowIndex == rowCount);
return rowData;
}
}
}
}

private void initSheet(){
rowIndex = 0;
sheetName = sheetNameGivenList.get(sheetIndexReading);
sheetIndex = sheetNameList.indexOf(sheetName) + 1;
while((sheet = sheetMap.get(sheetName)) == null){
sheetIndexReading++;
if(sheetIndexReading >= sheetNameGivenList.size()){
sheet = null;
return;
}else{
sheetName = sheetNameGivenList.get(sheetIndexReading);
sheetIndex = sheetNameList.indexOf(sheetName);
}
}
rowCount = sheet.getLastRowNum() + 1;//poi row num start with 0
}

private String getCellValue(Cell cell) {
if (cell == null) {
return "";
}

switch (cell.getCellType()) {
case NUMERIC -> {
double value = cell.getNumericCellValue();
if (DateUtil.isCellDateFormatted(cell)) {
Date date = DateUtil.getJavaDate(value);
return String.valueOf(date.getTime());
} else {
return double2String(value);
}
}
case STRING -> {
return cell.getStringCellValue();
}
case BOOLEAN -> {
return String.valueOf(cell.getBooleanCellValue());
}
case FORMULA -> {
try {
return double2String(cell.getNumericCellValue());
} catch (IllegalStateException e) {
try {
return cell.getRichStringCellValue().toString();
} catch (IllegalStateException e2) {
log.error("Excel format error: sheet=" + sheetName + ",row=" + rowIndex + ",column=" + cellIndex, e2);
return "";
}
} catch (Exception e) {
log.error("Excel format error: sheet=" + sheetName + ",row=" + rowIndex + ",column=" + cellIndex, e);
return "";
}
}
case ERROR -> {
log.error("Excel format error: sheet=" + sheetName + ",row=" + rowIndex + ",column=" + cellIndex);
return "";
}
default -> {
return "";
}
}
}

static String double2String(Double d) {
return formatDouble(d.toString());
}

static String formatDouble(String doubleStr) {
boolean b = doubleStr.contains("E");
int indexOfPoint = doubleStr.indexOf('.');
if (b) {
int indexOfE = doubleStr.indexOf('E');
BigInteger xs = new BigInteger(doubleStr.substring(indexOfPoint + BigInteger.ONE.intValue(), indexOfE));
int pow = Integer.parseInt(doubleStr.substring(indexOfE + BigInteger.ONE.intValue()));
int xsLen = xs.toByteArray().length;
int scale = xsLen - pow > 0 ? xsLen - pow : 0;
doubleStr = String.format("%." + scale + "f", doubleStr);
} else {
Pattern p = Pattern.compile(".0$");
Matcher m = p.matcher(doubleStr);
if (m.find()) {
doubleStr = doubleStr.replace(".0", "");
}
}
return doubleStr;
}

@Override
public void close() throws IOException {
IOUtils.closeQuietly(workbook);
IOUtils.closeQuietly(inputStream);
}
}

3. 示例

这样使用readRow()或者readSheet()就可以像读取文本行一样读取Excel了

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/test/java/io/github/echo/poi/ExcelTest.java
1
2
3
4
5
6
7
8
9
10
@Test
public void testExcelReader() throws Exception{
IExcelReader reader = new ExcelReader("excel/ExcelTest.xlsx");
ExcelRow row;
while((row = reader.readRow()) != null){
System.out.println(row);
}
reader.close();
Assertions.assertTrue(true);
}

4. 存在问题:内存消耗

由于poi在初始化时将整个Excel都解析完成并保存在内存中,为了保持数据的结构顺序使用了TreeMap进行层层嵌套,并且除了数据本身之外,还维护了单元格的各种属性,因此对内存的占用相当大,很容易造成内存溢出。下面以xlsx的实现XSSF为例,看一下数据的保存:

首先在XSSFWorkbook中维护着book信息,并将所有的sheet数据维护在一个List

1
2
3
4
5
6
7
8
9
/**
* The underlying XML bean
*/
private CTWorkbook workbook;

/**
* this holds the XSSFSheet objects attached to this workbook
*/
private List<XSSFSheet> sheets;

然后在XSSFSheet中,将所有的行数据维护在一个TreeMap

1
private final SortedMap<Integer, XSSFRow> _rows = new TreeMap<>();

接着在XSSFRow中,继续将所有的列数据嵌套在TreeMap

1
2
3
4
5
/**
* Cells of this row keyed by their column indexes.
* The TreeMap ensures that the cells are ordered by columnIndex in the ascending order.
*/
private final TreeMap<Integer, XSSFCell> _cells;

最后在XSSFCell中实际保存单元格数据的元素是CTCell,其中保存了单元格的数据和所有属性。

1
2
3
4
5
/**
* the xml bean containing information about the cell's location, value,
* data type, formatting, and formula
*/
private CTCell _cell;

为了避免内存溢出问题,poi也提供了基于SAX事件的读取方式,即不再一次性将Excel解析到内存中,而是在读取文件流的过程中,每当读取到一个节点就产生一个对应的事件,并交给调用者处理。SAX是java针对xml标准提供的基于事件解析的api,之所以能用来解析Excel,是因为Excel文件的本质上是一个压缩包,其中包含了XML、图片及其他描述等文件,而其数据就是通过Xml来描述的。可以将Excel后缀改为zip,解压便可以看到其内部文件结构


参考: