Friday, October 09, 2015

Reading Excel with the Open XML SDK


A while back I posted Look Mum no Office on the subject of the Microsoft Open XML SDK that I had used for writing an Excel file without the need for a copy of Office on the machine running the code. This proved to be very straightforward and there was some excellent samplecode available 

Yesterday I needed to read data from an Excel workbook and decided to write a class that I could use in multiple scenarios to grab data from this source. Using the Open XML SDK looked the way to go as this should work with all Excel files produced by versions after Office 2003.

First use the Package Manager to grab the support libraries

PM> Install-Package DocumentFormat.OpenXml

And then add WindowsBase.dll to the project references (it is in the Framework Assembly list so easy to spot using the “Add Reference” dialog.


My first Excel reading class function was to read the list of Sheet Names in a workbook.
private List<string> getSheetNames() {     List<string> sheetNames = new List<string>();     try     {         var document = SpreadsheetDocument.Open(excelFilePath, false);         var workbookPart = document.WorkbookPart;         var sheets = workbookPart.Workbook.Descendants<Sheet>();         foreach(Sheet sheet in sheets)         {             sheetNames.Add(sheet.Name);         }         document.Close();     }     catch (Exception ex)     {         ...     }     return sheetNames; }

This was a good indicator that this process was going to be pretty steady although there were a couple of twists coming up (with all making sense when thought about).

I wanted my class to have two further main features. One to take a sample of rows from a specified sheet and another to extract defined columns from a specified sheet. As both cover similar ground, let’s look at the development of the sample row extract.


The process is 1. Identify the worksheet, 2. Grab the sheet, 3. Extract the sheet data and then 4 Loop through the Rows in the sheet data. Within each row we can loop through the cells and read the string content.

private List<List<string>> initialGetSampleRows() {     List<List<string>> samples = new List<List<string>>();     try     {         int selCount = 0;         var document = SpreadsheetDocument.Open(excelFilePath, false);         var workbookPart = document.WorkbookPart;         var sheets = workbookPart.Workbook.Descendants<Sheet>();         foreach (Sheet sheet in sheets)         {             if (sheet.Name == sheetName)             {                 var workSheet = ((WorksheetPart)workbookPart.GetPartById(sheet.Id)).Worksheet;                 var sheetData = workSheet.Elements<SheetData>().First();                 List<Row> rows = sheetData.Elements<Row>().ToList();                 foreach (Row row in rows)                 {                     var thisRow = new List<string>();                     foreach (Cell cell in row.Descendants<Cell>())                     {                         var cellContent = cell.CellValue;                         thisRow.Add((cellContent == null) ? cell.InnerText : cellContent.Text);                     }                         samples.Add(thisRow);                     selCount++;                     if (selCount >= sampleRowCount)                     {                         break;                     }                 }                 break;             }         }         document.Close();     }     catch (Exception ex)     {         ...     }     return samples; }

Does it work? Well there are a few snags. Dates are (perhaps not unexpectedly) showing as a number, numbers themselves are fine but cells with a string content are being read as an integer of some sort.


Turns out that strings are held in a shared string table so we need to change the code to read that table when required based on the cell DataType ("s"). 

foreach (Cell cell in row.Descendants<Cell>()) {     var cellContent = cell.CellValue;     var cellText = (cellContent == null) ? cell.InnerText : cellContent.Text;     var dataType = cell.DataType ?? CellValues.Error;     if (dataType == CellValues.SharedString)     {         cellText = workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(Convert.ToInt32(cell.CellValue.Text)).InnerText;     }     thisRow.Add(cellText); }

Excel dates are stored as an integer number of days since 31st December 1899 – so the Excel date for 1st January 1900 is 1. There is a snag inherited for compatibility reasons from Lotus – where the year 1900 was assumed not to be a leap year when in fact (unlike the year 2000) it was. So the numbers are almost always out by 1. Dates are represented as human sensible dates on an Excel spreadsheet by using a “style”. It is possible to detect a style value for a cell although this is normally null.

Something like:

uint styleIndex = cell.StyleIndex ?? uint.MinValue; if (styleIndex == 3) {     Double serialDate;     if (Double.TryParse(cellText, out serialDate))     {         DateTime cellDate = DateTime.FromOADate(serialDate);         cellText = cellDate.ToString(dateFormat);     } }

Where we can use the inbuilt DateTime.FromOADate() function to handle the conversion, applying string formatting to suite.

There is another snag. It would not make any sense for an XML based Excel file to include empty cells. This means that when a row of cells is read the row only contains cells with content. This would present a problem where empty cells appear (perhaps now and again) before columns of cells of interest. Fortunately each cell object contains a CellReference that can be used to extract the Column+Row identity we are familiar with on Excel sheets. This can be used to detect a cell being presented earlier than expected in a row.

One solution (now deprecated - see below) is to create an Enumerator that can manage the detection of missing cells and supply empty ones as required. The class code was re-structured a bit at this point in time as well – partly to make the code units available to functionality to be added to the class later.


The Enumerator looks like this:

private IEnumerator<Cell> excelCellEnumerator(Row row) {     int currentCount = 0;     foreach (Cell cell in row.Descendants<Cell>())     {         string columnName = columnNameFromReference(cell.CellReference);         int thisColIndex = convertColumnNameToNumber(columnName);         while(currentCount < thisColIndex)         {             var emptyCell = new Cell()             {                 DataType = null,                 CellValue = new CellValue(string.Empty)             };             yield return emptyCell;             currentCount++;         }         yield return cell;         currentCount++;     } }

With support from:

private string columnNameFromReference(string cellReference) {     var regex = new Regex("[A-Za-z]+");     var match = regex.Match(cellReference);     return match.Value; } private int convertColumnNameToNumber(string columnName) {     char[] colAlphas = columnName.ToCharArray();     Array.Reverse(colAlphas);     int colNumber = 0;     for (int i = 0; i < colAlphas.Length; i++)     {         char alpha = colAlphas[i];         int tAlpha = i == 0 ? alpha - 65 : alpha - 64;         colNumber += tAlpha * (int)Math.Pow(26, i);     }     return colNumber; }

And with the nitty gritty of the sample extract function being changed to use the Enumerator thus:

foreach(Row row in rows) {     var thisRow = new List<string>();     var cellEnumerator = excelCellEnumerator(row);     while (cellEnumerator.MoveNext())     {         var cell = cellEnumerator.Current;         var contentString = readCellContent(cell, workbookPart);         thisRow.Add(contentString);     }     samples.Add(thisRow); }

And the reading of the cell content (all code earlier in the post) being moved to a separate function.

Addendum


A little experimentation indicated that it might also be a good idea to make provision for Boolean values in an Excel sheet. Where Booleans exist they are presented in Excel as TRUE and FALSE but the underlying values are a (reasonable) 1 and 0 respectively. You can detect Boolean values using the cell DataType which (in this instance) would be set to CellValues.Boolean.

There is also a CellValues enum CellValues.Date  but strangely this is not set in any of my sample data so the original technique of inspecting the StyleIndex still stands.


You might however wish to add the following code snippet to deal with Booleans:

if(dataType == CellValues.Boolean) {     int bVal;     if(int.TryParse(cellText, out bVal))     {         if(bVal == 0)         {             cellText = "false";         } else         {             cellText = "true";         }     } }

Parsing the value rather than just casting it may just be paranoia but…

Further Thought

Of course Yield return is sheer flummery in this context – although the Iterator was lots of fun. 

Yield should be reserved for computationally expensive processes where you want to spread the load a bit. In this instance most or all cells are going to be available in the Row object and so we would be better writing a function to return the whole list of cells in one go.

private List<Cell> getExcelRowCells(Row row) {      List<Cell> theCells = new List<Cell>();      int currentCount = 0;      foreach (Cell cell in row.Descendants<Cell>())      {          string columnName = columnNameFromReference(cell.CellReference);          int thisColIndex = convertColumnNameToNumber(columnName);          while (currentCount < thisColIndex)          {              var emptyCell = new Cell()              {                  DataType = null,                  CellValue = new CellValue(string.Empty)              };              theCells.Add(emptyCell);              currentCount++;          }          theCells.Add(cell);          currentCount++;      }      return theCells; }

And then consume the list as before but with conventional List<T> iteration:

foreach (Row row in rows) {     var thisRow = new List<string>();     List<Cell> rowCells = getExcelRowCells(row);     foreach(Cell cell in rowCells)     {         thisRow.Add(readCellContent(cell, workbookPart));     }     samples.Add(thisRow);     selCount++;     if (selCount >= sampleRowCount)     {         break;     } }

No comments: