A while back I posted
Look Mum no Office on the subject of the Microsoft Open XML SDK that I had used for writing an
Excel file without the need for a copy of Office on the machine running the
code. This proved to be very straightforward and there was some
excellent samplecode available
Yesterday I needed to read data from an Excel workbook and
decided to write a class that I could use in multiple scenarios to grab data
from this source. Using the Open XML SDK looked the way to go as this should
work with all Excel files produced by versions after Office 2003.
First use the Package Manager to grab the support libraries
PM> Install-Package
DocumentFormat.OpenXml
And then add WindowsBase.dll to the project references (it
is in the Framework Assembly list so easy to spot using the “Add Reference”
dialog.
My first Excel reading class function was to read the list
of Sheet Names in a workbook.
private List<string> getSheetNames()
{
List<string> sheetNames = new List<string>();
try
{
var document = SpreadsheetDocument.Open(excelFilePath, false);
var workbookPart = document.WorkbookPart;
var sheets = workbookPart.Workbook.Descendants<Sheet>();
foreach(Sheet sheet in sheets)
{
sheetNames.Add(sheet.Name);
}
document.Close();
}
catch (Exception ex)
{
...
}
return sheetNames;
}
This was a good
indicator that this process was going to be pretty steady although there were a
couple of twists coming up (with all making sense when thought about).
I wanted my class to
have two further main features. One to take a sample of rows from a specified
sheet and another to extract defined columns from a specified sheet. As both
cover similar ground, let’s look at the development of the sample row extract.
The process is 1.
Identify the worksheet, 2. Grab the sheet, 3. Extract the sheet data and then 4
Loop through the Rows in the sheet data. Within each row we can loop through
the cells and read the string content.
private List<List<string>> initialGetSampleRows()
{
List<List<string>> samples = new List<List<string>>();
try
{
int selCount = 0;
var document = SpreadsheetDocument.Open(excelFilePath, false);
var workbookPart = document.WorkbookPart;
var sheets = workbookPart.Workbook.Descendants<Sheet>();
foreach (Sheet sheet in sheets)
{
if (sheet.Name == sheetName)
{
var workSheet = ((WorksheetPart)workbookPart.GetPartById(sheet.Id)).Worksheet;
var sheetData = workSheet.Elements<SheetData>().First();
List<Row> rows = sheetData.Elements<Row>().ToList();
foreach (Row row in rows)
{
var thisRow = new List<string>();
foreach (Cell cell in row.Descendants<Cell>())
{
var cellContent = cell.CellValue;
thisRow.Add((cellContent == null) ? cell.InnerText : cellContent.Text);
}
samples.Add(thisRow);
selCount++;
if (selCount >= sampleRowCount)
{
break;
}
}
break;
}
}
document.Close();
}
catch (Exception ex)
{
...
}
return samples;
}
Does it work? Well there
are a few snags. Dates are (perhaps not unexpectedly) showing as a number,
numbers themselves are fine but cells with a string content are being read as
an integer of some sort.
Turns out that strings
are held in a shared string table so we need to change the code to read that
table when required based on the cell DataType ("s").
foreach (Cell cell in row.Descendants<Cell>())
{
var cellContent = cell.CellValue;
var cellText = (cellContent == null) ? cell.InnerText : cellContent.Text;
var dataType = cell.DataType ?? CellValues.Error;
if (dataType == CellValues.SharedString)
{
cellText = workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(Convert.ToInt32(cell.CellValue.Text)).InnerText;
}
thisRow.Add(cellText);
}
Excel dates are stored
as an integer number of days since 31st December 1899 – so the Excel
date for 1st January 1900 is 1. There is a snag inherited for compatibility
reasons from Lotus – where the year 1900 was assumed not to be a leap year when in
fact (unlike the year 2000) it was. So the numbers are almost always out by
1. Dates are represented as human sensible dates on an Excel spreadsheet by
using a “style”. It is possible to detect a style value for a cell although
this is normally null.
Something like:
uint styleIndex = cell.StyleIndex ?? uint.MinValue;
if (styleIndex == 3)
{
Double serialDate;
if (Double.TryParse(cellText, out serialDate))
{
DateTime cellDate = DateTime.FromOADate(serialDate);
cellText = cellDate.ToString(dateFormat);
}
}
Where we can use the
inbuilt DateTime.FromOADate() function to handle the
conversion, applying string formatting to suite.
There is another snag. It would not make any sense for an
XML based Excel file to include empty cells. This means that when a row of
cells is read the row only contains cells with content. This would present a
problem where empty cells appear (perhaps now and again) before columns of
cells of interest. Fortunately each cell object contains a CellReference that can
be used to extract the Column+Row identity we are familiar with on Excel sheets.
This can be used to detect a cell being presented earlier than expected in a
row.
One solution (now deprecated - see below) is to create an Enumerator that can
manage the detection of missing cells and supply empty ones as required. The class code was re-structured a bit at this point in time as well – partly to make the
code units available to functionality to be added to the class later.
The Enumerator looks like this:
private IEnumerator<Cell> excelCellEnumerator(Row row)
{
int currentCount = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
string columnName = columnNameFromReference(cell.CellReference);
int thisColIndex = convertColumnNameToNumber(columnName);
while(currentCount < thisColIndex)
{
var emptyCell = new Cell()
{
DataType = null,
CellValue = new CellValue(string.Empty)
};
yield return emptyCell;
currentCount++;
}
yield return cell;
currentCount++;
}
}
With support from:
private string columnNameFromReference(string cellReference)
{
var regex = new Regex("[A-Za-z]+");
var match = regex.Match(cellReference);
return match.Value;
}
private int convertColumnNameToNumber(string columnName)
{
char[] colAlphas = columnName.ToCharArray();
Array.Reverse(colAlphas);
int colNumber = 0;
for (int i = 0; i < colAlphas.Length; i++)
{
char alpha = colAlphas[i];
int tAlpha = i == 0 ? alpha - 65 : alpha - 64;
colNumber += tAlpha * (int)Math.Pow(26, i);
}
return colNumber;
}
And with the nitty
gritty of the sample extract function being changed to use the Enumerator thus:
foreach(Row row in rows)
{
var thisRow = new List<string>();
var cellEnumerator = excelCellEnumerator(row);
while (cellEnumerator.MoveNext())
{
var cell = cellEnumerator.Current;
var contentString = readCellContent(cell, workbookPart);
thisRow.Add(contentString);
}
samples.Add(thisRow);
}
And
the reading of the cell content (all code earlier in the post) being moved to a
separate function.
Addendum
A little experimentation indicated that it might also be a
good idea to make provision for Boolean values in an Excel sheet. Where Booleans
exist they are presented in Excel as TRUE and FALSE but the underlying values
are a (reasonable) 1 and 0 respectively. You can detect Boolean values using
the cell DataType which (in this instance) would be set to CellValues.Boolean.
There
is also a CellValues enum CellValues.Date but strangely this is not set in any of my
sample data so the original technique of inspecting the StyleIndex still
stands.
You
might however wish to add the following code snippet to deal with Booleans:
if(dataType == CellValues.Boolean)
{
int bVal;
if(int.TryParse(cellText, out bVal))
{
if(bVal == 0)
{
cellText = "false";
} else
{
cellText = "true";
}
}
}
Parsing the value rather than just casting it may just be
paranoia but…
Further Thought
Of course Yield
return is sheer flummery in this context – although the Iterator was lots
of fun.
Yield should be reserved for computationally expensive processes where
you want to spread the load a bit. In this instance most or all cells are going
to be available in the Row object and so we would be better writing a function
to return the whole list of cells in one go.
private List<Cell> getExcelRowCells(Row row)
{
List<Cell> theCells = new List<Cell>();
int currentCount = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
string columnName = columnNameFromReference(cell.CellReference);
int thisColIndex = convertColumnNameToNumber(columnName);
while (currentCount < thisColIndex)
{
var emptyCell = new Cell()
{
DataType = null,
CellValue = new CellValue(string.Empty)
};
theCells.Add(emptyCell);
currentCount++;
}
theCells.Add(cell);
currentCount++;
}
return theCells;
}
And then consume the list as before but with conventional
List<T> iteration:
foreach (Row row in rows)
{
var thisRow = new List<string>();
List<Cell> rowCells = getExcelRowCells(row);
foreach(Cell cell in rowCells)
{
thisRow.Add(readCellContent(cell, workbookPart));
}
samples.Add(thisRow);
selCount++;
if (selCount >= sampleRowCount)
{
break;
}
}