poi解析Excel(二) SAX解决内存问题

words: 3k    views:    time: 15min

由于Excel的数据是通过xml来描述,所以解析Excel也就是解析xml。对于文本文件,可以通过换行符来确定一行的结束。而对于xml,其描述数据的单位是节点,并且可以嵌套,因此读取时是以节点结束来进行确定,但是除非事先知道xml的节点规则,即节点相互之间的嵌套结构,否则无法正确读取。这个xml规则就定义在Excel的规范中,poi也是基于这个规则来对Excel进行解析的

1. 实现

由于xlsx和xls是两套不同的规范,poi提供了XSSF和HSSF两种不同读取实现,所以在实现IExcelReader时也根据类型进行不同的初始化

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelEventReader.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
public class ExcelEventReader implements IExcelReader {

private static final Logger LOG = Logger.getLogger(ExcelEventReader.class);

private String type;

private OPCPackage pkg;

private ExcelXSSFHandler xssfHandler;

private POIFSFileSystem pfs;

private ExcelHSSFHandler hssfHandler;

private ExcelSheetFilter sheetFilter;

public ExcelEventReader(InputStream inputStream, String type) throws IOException, OpenXML4JException, SAXException, DocumentException {
this.type = type;
if(EXCEL_XLS.equalsIgnoreCase(type)){
pfs = new POIFSFileSystem(inputStream);
initHssf();
}else if(EXCEL_XLSX.equalsIgnoreCase(type)){
pkg = OPCPackage.open(inputStream);
initXssf();
}else{
throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx");
}
}

private void initHssf() throws IOException {
hssfHandler = new ExcelHSSFHandler();
HSSFRequest hssfRequest = new HSSFRequest();
hssfRequest.addListenerForAllRecords(hssfHandler);

HSSFEventFactory factory = new HSSFEventFactory();
InputStream bookStream = pfs.createDocumentInputStream("Workbook");
factory.processEvents(hssfRequest, bookStream);
}

private void initXssf() throws IOException, OpenXML4JException, SAXException, DocumentException {
XSSFReader reader = new XSSFReader(pkg);
XMLReader parser = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");
xssfHandler = new ExcelXSSFHandler(reader, parser);
parser.setContentHandler(xssfHandler);
}

@Override
public List<ExcelRow> readSheet() throws Exception {
if(EXCEL_XLS.equalsIgnoreCase(type)){
return hssfHandler.readSheet();
}else if(EXCEL_XLSX.equalsIgnoreCase(type)){
return xssfHandler.readSheet();
}else{
throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx");
}
}

@Override
public ExcelRow readRow() throws Exception {
if(EXCEL_XLS.equalsIgnoreCase(type)){
return hssfHandler.readRow();
}else if(EXCEL_XLSX.equalsIgnoreCase(type)){
return xssfHandler.readRow();
}else{
throw new UnsupportedOperationException("Excel file name must end with .xls or .xlsx");
}
}

@Override
public void close() throws IOException {
IoUtil.close(pkg);
IoUtil.close(pfs);
}

@Override
public void setSheetFilter(ExcelSheetFilter sheetFilter) {
this.sheetFilter = sheetFilter;
}

//...
}

1.1. xlsx实现

每个sheet的数据相当于一个独立的xml,然后用一个叫workbook的xml来描述这些sheet。对于xlsx,poi提供了XSSFReader用来分别获取这些xml的流,可以依次将这些流交给SAXParser,SAXParser再将xml流读取拆成一个个事件,进行处理实现

对于workbook流,由于sheet信息比较简单清晰,就直接将其转成dom树,然后再遍历获取和维护这些sheet,大体思路与之前类似,区别在于之前维护的sheet解析好的数据,而这里维护的是sheet对应xml的流,只有在真正获取sheet数据时,才会解析,不过每次读取一个新的sheet时,都是先将这个sheet全部解析完成,然后在按行依次吐给调用者

至于sheet解析的实现,可以大概描述如下:

  1. 每个节点都有一个开始和结束事件,一个开始事件必定会接着一个结束事件,如果有嵌套则依次类推,类似于方法调用栈,有进必有出,SAX也是凭借这一点实现的。在开始事件中,可以拿到节点的各种属性信息,在结束事件中,则意味着节点的信息已经获取完毕,可以进行保存或其它处理
  2. 事件处理过程中节点名称和属性名称的判断,比如c代表单元格cell,t代表cellType等,这些也并没有找对应的规范文档,而是根据Excel中获取的xml总结的一些规律,所以不保证解析处理流程是是完备的,不过目前从测试的结果看没有什么问题
  3. 空行与空格的问题,xml中直接忽略了空行和空格,不过在又数据的具体单元格上保留了其行列的索引信息,行索引是数字,而列索引是英文字母(相当于一个26进制),根据这些索引信息,便可以对空行空格进行补全
  4. 最后,sheet的数据解析完了按照接口的定义将数据返回给调用者,与之前类似,不多赘述
https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelEventReader.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
private class ExcelXSSFHandler extends DefaultHandler {

private XSSFReader xssfReader;

private XMLReader xmlReader;

private SharedStringsTable stringTable;

private StylesTable stylesTable;

private Map<String, String> sheetNameRidMap = new LinkedHashMap<>();

private List<String> sheetNameList = new ArrayList<>();

private List<String> sheetNameGivenList = new ArrayList<>();

private int sheetReadingIndex = 0;

private String sheetName;

private int sheetIndex;

private int rowMax;

private int cellMax;

private int rowIndex = 0;

private int cellIndex = 0;

private CellType cellType;

private String lastContents = "";

private List<String> columnList;

private boolean isEnd = false;

private List<ExcelRow> sheetData;

private int rowReadingIndex = 0;

private String formatString;

private short formatIndex;

public ExcelXSSFHandler(XSSFReader xssfReader, XMLReader xmlReader) throws InvalidFormatException, IOException, SAXException, DocumentException {
this.xmlReader = xmlReader;
this.xssfReader = xssfReader;
this.stringTable = xssfReader.getSharedStringsTable();
this.stylesTable = xssfReader.getStylesTable();
InputStream bookStream = null;
try{
bookStream = xssfReader.getWorkbookData();
SAXReader reader = new SAXReader();
reader.setEncoding("UTF-8");
Document doc = reader.read(bookStream);
Element book = doc.getRootElement();
Element sheets = book.element("sheets");
Iterator<?> it = sheets.elementIterator("sheet");
while(it.hasNext()){
Element sheet = (Element)it.next();
String name = sheet.attributeValue("name");
String rid = sheet.attributeValue("id");
sheetNameList.add(name);
sheetNameRidMap.put(name, rid);
}
}finally{
IoUtil.close(bookStream);
}
//cann't let the customer code to directly modify sheetNameList
sheetNameGivenList.addAll(sheetNameList);
}

public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equals("dimension")) {
String dimension = attributes.getValue("ref");
if(dimension.contains(":")){
dimension = dimension.split(":")[1];
}
rowMax = getRowIndex(dimension);
cellMax = getCellIndex(dimension);
sheetData = new ArrayList<>(rowMax);
}

if (qName.equals("row")) {
cellIndex = 0;
int currentRowIndex = Integer.valueOf(attributes.getValue("r")) - 1;
while(rowIndex < currentRowIndex){
ExcelRow data = new ExcelRow(rowIndex + 1, new ArrayList<String>(0));
data.setSheetIndex(sheetIndex);
data.setSheetName(sheetName);
data.setEmpty(true);
data.setLastRow(rowIndex == rowMax - 1);
sheetData.add(data);
rowIndex++;
}
columnList = new ArrayList<>(cellMax);
}

if(qName.equals("c")) {
lastContents = "";
int currentCellIndex = getCellIndex(attributes.getValue("r"));
while(cellIndex < currentCellIndex){
columnList.add("");
cellIndex++;
}

String type = attributes.getValue("t");
if(type == null){
cellType = CellType.BLANK;
}else{
switch (type) {
case "b" -> cellType = CellType.BOOLEAN;
case "e" -> cellType = CellType.ERROR;
case "inlineStr" -> cellType = CellType._NONE;
case "s" -> cellType = CellType.STRING;
case "str" -> cellType = CellType.FORMULA;
default -> cellType = CellType.BLANK;
}
}

String cellStyle = attributes.getValue("s");
if (cellStyle != null) {
int styleIndex = Integer.parseInt(cellStyle);
XSSFCellStyle style = stylesTable.getStyleAt(styleIndex);
formatIndex = style.getDataFormat();
formatString = style.getDataFormatString();
if (formatString == null) {
formatString = BuiltinFormats.getBuiltinFormat(formatIndex);
}
}
}
}

public void endElement(String uri, String localName, String qName) throws SAXException {
/**
* All fields are uniformly read into string,
* and the date type will be first converted to milliseconds.
*/
if (qName.equals("v")) {
switch (cellType){
case STRING:
int sharedIndex = Integer.valueOf(lastContents);
lastContents = stringTable.getItemAt(sharedIndex).toString();
break;
case _NONE:
lastContents = new XSSFRichTextString(lastContents).toString();
break;
case FORMULA:
try {
lastContents = ExcelReader.formatDouble(lastContents);
} catch (Exception e) {
lastContents = new XSSFRichTextString(lastContents).toString();
LOG.error("Excel format error: sheet=" + sheetName + ",row=" + rowIndex + ",column=" + cellIndex, e);
}
break;
case BOOLEAN:
case ERROR:
default:

}

if (formatString != null) {
try{
if(DateUtil.isADateFormat(formatIndex,formatString)) {
double value = Double.valueOf(lastContents);
if(DateUtil.isValidExcelDate(value)) {
Date date = DateUtil.getJavaDate(value, false);
lastContents = String.valueOf(date.getTime());
}
}else{
lastContents = ExcelReader.formatDouble(lastContents);
}
}catch(Exception e){
LOG.error("Excel format error: sheetName="
+ sheetName + ", rowIndex=" + (rowIndex + 1) + ",column=" + cellIndex, e);
}
formatString = null;
}
}

if (qName.equals("c")) {
columnList.add(lastContents);
cellIndex++;
}

if (qName.equals("row")) {
ExcelRow data = new ExcelRow(rowIndex + 1, columnList);
data.setSheetIndex(sheetIndex);
data.setSheetName(sheetName);
data.setEmpty(false);
data.setLastRow(rowIndex == rowMax - 1);
sheetData.add(data);
rowIndex++;
}
}

public void characters(char[] ch, int start, int length) throws SAXException {
lastContents += new String(ch, start, length);
}

private int getRowIndex(String dimension){
int len = dimension.length();
int index = 0;
for(int i = len - 1; i >= 0; i--){
if(dimension.charAt(i) > 64){//A
index = i;
break;
}
}
return Integer.valueOf(dimension.substring(index + 1));
}

private int getCellIndex(String dimension){
int len = dimension.length();
int index = 0;
for(int i = len - 1; i >= 0; i--){
if(dimension.charAt(i) > 64){//A
index = i;
break;
}
}
String cellstr = dimension.substring(0, index + 1);

int cellIndex = 0;
int indexLen = cellstr.length();
int bitIndex = 0;
for(int i = indexLen - 1; i >= 0; i--){
int base = 1;
int c = cellstr.charAt(i) - 65;
if(bitIndex > 0){
c++;//number of columns used 26 jinzhi
for(int j = 0; j < bitIndex; j++){
base *= 26;
}
}
cellIndex += c * base;
bitIndex++;
}
return cellIndex;
}

public List<ExcelRow> readSheet() throws Exception{
while(true){
List<ExcelRow> list = new ArrayList<>();
if(isEnd){
return null;
}else if(sheetData == null){
if(sheetFilter != null){
sheetFilter.resetSheetListForRead(sheetNameGivenList);
}
read();
continue;
}else if(rowReadingIndex > 0 && rowReadingIndex < sheetData.size()){
while(rowReadingIndex < sheetData.size()){
ExcelRow row = sheetData.get(rowReadingIndex);
rowReadingIndex++;
list.add(row);
}
read();
return list;
}else{
list.addAll(sheetData);
read();
return list;
}
}
}

public ExcelRow readRow() throws Exception {
while(true){
if(isEnd){
return null;
}else if(sheetData == null){
if(sheetFilter != null){
sheetFilter.resetSheetListForRead(sheetNameGivenList);
}
read();
continue;
}

if(rowReadingIndex < sheetData.size()){
ExcelRow row = sheetData.get(rowReadingIndex);
rowReadingIndex++;
return row;
}else{
read();
}
}
}

private void read() throws Exception {
if(sheetReadingIndex < sheetNameGivenList.size()) {
sheetName = sheetNameGivenList.get(sheetReadingIndex);
sheetIndex = sheetNameList.indexOf(sheetName) + 1;
if(sheetIndex == -1){
rowReadingIndex = 0;
sheetReadingIndex++;
return;
}

if(sheetFilter != null && !sheetFilter.filter(sheetIndex, sheetName)){
rowReadingIndex = 0;
sheetReadingIndex++;
return;
}

String relId = sheetNameRidMap.get(sheetName);
rowIndex = 0;
InputStream sheetStream = null;
try{
sheetStream = xssfReader.getSheet(relId);
xmlReader.parse(new InputSource(sheetStream));
}finally{
IoUtil.close(sheetStream);
}
rowReadingIndex = 0;
sheetReadingIndex++;
}else{
isEnd = true;
}
}
}

1.2. xls实现

对于xls,poi并没有提供接口来获取一个sheet的数据流,而是直接封装成了一系列的Record事件,并且没有区分sheet,之所以这样可能也是因为做不到,因为微软2007之前对于Excel有一套自己的规范,本身就不支持。这样就不好选择性的解析某个sheet,如果要实现前面的接口,只好将整个Excel都进行解析。

这里的实现就作为一个了解,有很多问题也没有解决,比如没有获取到单元格对应的格式以及最后一行数据的事件接收不到等,感觉去细究这些问题也没有太大意义,毕竟已经淘汰。而xlsx是Microsoft Excel 2007之后Excel的文件存储格式,其实现是基于open Xml和zip技术,这种存储简单,安全,传输方便,同时处理数据也变得简单。

https://github.com/shanhm1991/Echo/blob/master/echo-poi/src/main/java/io/github/echo/poi/excel/ExcelEventReader.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
private class ExcelHSSFHandler implements HSSFListener {

private SSTRecord stringTable;

private List<String> sheetNameList = new ArrayList<>();

private List<String> sheetNameGivenList = new ArrayList<>();

private int sheetIndex = 0;

private String sheetName;

private int rowMax;

private short cellMax;

private int rowIndex = 0;

private Map<String, List<ExcelRow>> bookData = new HashMap<>();

private List<ExcelRow> sheetData;

private ExcelRow rowData;

@Override
public void processRecord(Record record) {
switch (record.getSid()) {
case SSTRecord.sid:
stringTable = (SSTRecord)record;
break;
case BoundSheetRecord.sid:
BoundSheetRecord boundSheet = (BoundSheetRecord)record;
sheetNameList.add(boundSheet.getSheetname());
break;
case BOFRecord.sid:
BOFRecord bof = (BOFRecord)record;
if (bof.getType() == BOFRecord.TYPE_WORKSHEET) {
sheetName = sheetNameList.get(sheetIndex);
rowIndex = 0;
sheetIndex++;
}
break;
case DimensionsRecord.sid:
DimensionsRecord dimensions = (DimensionsRecord)record;
rowMax = dimensions.getLastRow();
cellMax = dimensions.getLastCol();
sheetData = new ArrayList<>(rowMax);
rowData = new ExcelRow(rowIndex + 1, new ArrayList<>(cellMax));
rowData.setSheetIndex(sheetIndex);
rowData.setSheetName(sheetName);
rowData.setLastRow(rowIndex == rowMax - 1);
bookData.put(sheetName, sheetData);
break;
case NumberRecord.sid:
NumberRecord number = (NumberRecord) record;
prepareRowData(number.getRow());
fillRowData(number.getColumn(), ExcelReader.double2String(number.getValue()));
break;
case LabelSSTRecord.sid:
LabelSSTRecord labelSST = (LabelSSTRecord)record;
prepareRowData(labelSST.getRow());
fillRowData(labelSST.getColumn(), stringTable.getString(labelSST.getSSTIndex()).toString());
break;
case FormulaRecord.sid:
FormulaRecord formula = (FormulaRecord)record;
prepareRowData(formula.getRow());
fillRowData(formula.getColumn(), ExcelReader.double2String(formula.getValue()));
break;
case BlankRecord.sid:
BlankRecord blank = (BlankRecord)record;
prepareRowData(blank.getRow());
fillRowData(blank.getColumn(), "");
break;
case BoolErrRecord.sid:
BoolErrRecord bool = (BoolErrRecord)record;
prepareRowData(bool.getRow());
fillRowData(bool.getColumn(), String.valueOf(bool.getBooleanValue()));
break;
}
}

private void fillRowData(short index, String value){
List<String> columnList = rowData.getColumnList();
while(columnList.size() < index - 1){
columnList.add("");
}
columnList.add(value);
}

private void prepareRowData(int rowNum){
if(rowNum > rowIndex){
rowData.setEmpty(false);
sheetData.add(rowData);
rowIndex++;
fillEmptyRow(rowNum);
rowData = new ExcelRow(rowIndex + 1, new ArrayList<>(cellMax));
rowData.setSheetIndex(sheetIndex);
rowData.setSheetName(sheetName);
rowData.setLastRow(rowIndex == rowMax - 1);
}
}

private void fillEmptyRow(int rowNum){
while(rowNum > rowIndex){
rowData = new ExcelRow(rowIndex + 1, new ArrayList<String>(0));
rowData.setSheetIndex(sheetIndex);
rowData.setSheetName(sheetName);
rowData.setEmpty(true);
rowData.setLastRow(rowIndex == rowMax - 1);
sheetData.add(rowData);
rowIndex++;
}
rowIndex = rowNum;
}

private int sheetReadingIndex = 0;

private int rowReadingIndex = 0;

private List<ExcelRow> sheetReadingData;

private boolean inited = false;

private boolean isEnd = false;

public List<ExcelRow> readSheet() throws Exception{
if(!inited){
init();
}
while(true){
List<ExcelRow> list = new ArrayList<>();
if(isEnd){
return null;
}else if(!inited){
init();
read();
continue;
}else if(rowReadingIndex > 0 && rowReadingIndex < sheetData.size()){
while(rowReadingIndex < sheetData.size()){
ExcelRow row = sheetReadingData.get(rowReadingIndex);
rowReadingIndex++;
list.add(row);
}
read();
return list;
}else{
list.addAll(sheetReadingData);
read();
return list;
}
}
}

public ExcelRow readRow() throws Exception {
while(true){
if(isEnd){
return null;
}else if(!inited){
init();
read();
continue;
}
if(rowReadingIndex < sheetReadingData.size()){
ExcelRow row = sheetReadingData.get(rowReadingIndex);
rowReadingIndex++;
return row;
}else{
read();
}
}
}

private void read() throws Exception {
if(sheetReadingIndex < sheetNameGivenList.size()) {
sheetName = sheetNameGivenList.get(sheetReadingIndex);
sheetIndex = sheetNameList.indexOf(sheetName) + 1;
if(sheetIndex == -1){
rowReadingIndex = 0;
sheetReadingIndex++;
return;
}
if(sheetFilter != null && !sheetFilter.filter(sheetIndex, sheetName)){
rowReadingIndex = 0;
sheetReadingIndex++;
return;
}
sheetReadingData = bookData.get(sheetName);
rowReadingIndex = 0;
sheetReadingIndex++;
}else{
isEnd = true;
}
}

private void init(){
inited = true;
sheetNameGivenList.addAll(sheetNameList);
if(sheetFilter != null){
sheetFilter.resetSheetListForRead(sheetNameGivenList);
}
}
}

2. 存在问题

对于xlsx,虽然提供了按行获取数据的接口,其实也是以sheet作为单位整体解析的,可能Excel的单个sheet本身数据量就非常大,只是这里没有使用Map来提前保存数据,可以降低一些内存消耗。对于xls,也并没有完整实现


参考: