Monday, November 16, 2015

And This Week’s Language is F#

What? Why? Are you going to endlessly post indecipherable and incomprehensible snippets and rants on functional programming like those Haskell types? (Just joking guys – honest.) Well at least I can’t bang on about Monads as F# does not have them (‘cos the close equivalent are called “Computation Expressions”)

Continuing on my “meta” travels I started to look at what was available to the .NET developer in the general field of “numerics”. Linear Algebra, Regression, Optimization, Time Series Analysis and beyond – hopefully with a liberal license. There are a lot out there but an interesting subset were clearly developed in F# - time to clone the odd GitHub repository (and also see what Nuget could provide as packages) and see if I could make sense of anything.

You might be interested by Microsoft Research’s “Tabular”

Or indeed

Of course R.NET to “wrap” the R statistical package

And lots more…

Please don’t get the impression these are all F# projects but it does look like F# has a distinct appeal to the people that build them and those that use them.

You may already have F# as an optional programming language available within your copy of Visual Studio but VS2015 installed the option “on the fly” for me after I cloned “Tabular” from GitHub. After that I did a quick search on my C: drive for fsi.exe (the F# interactive console), located it among the Microsoft SDKs and then created a desktop shortcut. Once double clicked, this allowed me to try out some of the tutorial stuff on the web – just to get familiar with the basics of the language. Yes, in that respect at least it does feel a bit like Python – but with the comfortable option to compile code and include it in a C# app. Did I say comfortable? F# is distinctive – say that for it.

At first glance the variable assignment statements above look like those of many other languages – the double semi-colons by the way are a feature of the interactive interpreter as they indicate that a statement is complete – which is a marker for some of what is to come.

First off, variables are not variables but immutable values and are thus referred to as values. Well except when they are not (because sometimes you might need to switch to an imperative style or do stuff like populate and save data records) and that is when you use the mutable keyword:

> Let mutable x = 8;;

With a special mutable assignment operator:

> X <- 23;;
> x;;
Val it : int = 10  \\the fsi output

But I am wandering from the point – stick with values are immutable.

Functions are simply defined:

> let addfunction x y = x + y;;

Which defines a function that takes two integers and returns their integer sum. Did I mention that F# is (very) strongly typed? Probably not. I did not define any types in the code statements above but F# inferred a type – that defaults to integer. I could have written

> let addfunction (x : float) (y : float) = x + y;;

But what I could not have done (unlike most languages) is write

> let addfunction (x : float) ( y : int) = x + y;;
error the type ‘int’ does not match the type ‘float’

F# does not support implicit casts even safe ones like int to int64.

> let addfunction (x : float) ( y : int) = x + (float y);;   // works though, with the explicit cast.

Here is a longer function – our old friend Fibonacci – and no I have not yet figured out how to memoize it:

let rec fib = function
                | 0 -> 0
                | 1 -> 1
                | n -> fib (n – 1) + fib (n – 2)

The “rec” keyword indicates the function is recursive (despite that being obvious) and you can clearly see something similar to a switch statement and “arrow notation” indicating the values to be returned.

 Going back to a simpler form – say:

let addString x y = (x + y).ToString();;

the ToString() bit looks .NET normal shall we call it but if you type it into fsi.exe you get the following response:

val addString : x:int -> y:int -> string

and this needs exploring. The implication is that the second parameter is passed to a function that returns a string and the first integer parameter is passed to a function that returns an integer. This would (broadly) be correct as F# functions only take a single argument and only return a single type. So how is that? F# automatically curries functions. See for a quick dip into currying if you have not played with the technique. Then pause and thing about how that might impact upon a language where functions are not just first class citizens but rather the whole point.

To quote the Wikipedia article “F# is a strongly typed, multi-paradigm programming language that encompasses functional, imperative, and object-oriented programming techniques.” Which maybe understates things a bit as there does not seem much at the cutting edge of language development left out – and all encapsulated within a functional programming paradigm.

For the moment I will be content when I have taken on board enough of the language syntax to be able to follow the relevant math library code but I will be coming back to try and write something non trivial in F#. The syntax is superficially daunting but starts to make sense as the underlying functionality becomes a little clearer. F# feels like it could give you (like Oddball's Sherman) a very nice “edge” (in some of life's more complex coding challenges).


If you are a user of Visual Studio it is easier to run Fsi.exe withing that environment by pressing <ctrl><alt><f> with an F# project open.. The text is clearer as well.

F# is more that somewhat absorbing - although it does suffer from the "this is the only way" crowd banging on a bit. I can see my next side project might be C# for the UI and F# for all the functionality which would be a neat solution solving some of the issues raised by Python - and hey guess what? F# borrowed the concept of "significant white-space" from that very source (plus a few other bits I think).

Friday, October 16, 2015

Constructing and compiling C# using Roslyn

The optimal path for my current project requires the ability to create and use new classes at runtime. The alternatives all look rather clumsy while this approach looked elegant.

.NET has had some capacity to compile arbitrary code since the beginning but a lot of effort has recently gone in to a new compiler – known as Roslyn. On the face of it there are two ways towards my goal – use the CodeDom which is language agnostic or try out the newer (Microsoft) open source Roslyn technology that is designed to compile C# and VB.NET.

CodeDom and Roslyn are only partly related and after some research it looked like Roslyn was the way ahead as it provides a feature complete compiler for my language of choice C#. The other major plus is that Roslyn runs “in process” while CodeDom runs compiles through another executable with (perhaps) a rather clunky API.

If you fancy going the well tried CodeDom way then MSDN has a good code example towards the bottom of this article.

To add the Roslyn tooling to a project use the NuGet Package Manager option on the Tools menu and Choose “Manage NuGet Packages for Solution…” Select as the package source and then search for “Microsoft.CodeAnalysis”. Click on Microsoft.CodeAnalysis.CSharp , click the “Install” button and when offered, accept the license.

The actual development task turned out to be pretty steady.

I started with a class to define the properties of the class to be constructed

internal class ClassProperty {     public string PropertyName { get; set; } = "";     public string DataType { get; set; } = "";     public ClassProperty() { }     public ClassProperty(string propertyName, string dataType)     {         PropertyName = propertyName;         DataType = dataType;     } }
And supplied a populated List<ClassProperty> to a new class that I was going to use to construct and compile the code. In fact I created a little method to loop through the fields of a database table and populate the list (as that was the direction I was going in and I am lazy) but a List can be quickly populated with just a few lines of code.

The new class also had properties exposed for a namespace, class name, using directives, inheritance and interfaces. (The latter two are part implemented in the following code in that they are added to the class declaration but any related methods remain unimplemented at this stage.)

I backed up the property creation by creating a Dictionary of default values and you can see how it is applied after this.

private Dictionary<string, string> defaultValues = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) {     ["string"] = " = String.Empty;",     ["Int64"] = " = 0;",     ["Int32"] = " = 0;",     ["Int16"] = " = 0;",     ["DateTime"] = " = new DateTime(1970,1,1);",     ["double"] = " = 0;",     ["single"] = " = 0;",     ["Decimal"] = " = 0;",     ["bool"] = " = false;",     ["Boolean"] = " = false;",     ["int"] = " = 0;",     ["long"] = " = 0;" };

The method to create the code is pretty short.
private string buildClassCode() {     StringBuilder sb = new StringBuilder();     foreach(string uses in usings)     {         sb.AppendLine("using " + uses + ";");     }     sb.AppendLine("namespace " + nameSpace);     sb.AppendLine("{"); // start namespace     string classInherits = (inherits.Length > 0 || interfaces.Count > 0) ? " : " + inherits : "";     foreach(string inface in interfaces)     {         classInherits += (classInherits.Length > 3) ? ", " + inface : inface;     }     sb.AppendLine($"public class {tableName}{classInherits}" );     sb.AppendLine("{"); // start class     sb.AppendLine($"public {tableName}()"); // default constructor     sb.AppendLine("{}");     foreach (ClassProperty newProperty in classProperties)     {         sb.AppendLine($"public {newProperty.DataType} {newProperty.PropertyName} {"{ get; set;}"}{defaultValues[newProperty.DataType]}");     }     sb.AppendLine("}"); // end class     sb.AppendLine("}"); // end namespace     return sb.ToString(); }
I tested this by popping the output into a multi-line textbox. while the code lacks indentation it is perfectly clear as the image below shows

Now to get that code compiled and run an initial test using VS Debug mode to step through the code and watch the results.

private void buildinstance() {     string instance = buildClassCode();     SyntaxTree syntaxTree = CSharpSyntaxTree.ParseText(instance);     // inspect the tree     //SyntaxNode root = syntaxTree.GetRoot();     //foreach (var node in root.DescendantNodes())     //{     //    var x = node.GetText();     //}     string assemblyName = tableName; //Path.GetRandomFileName();     MetadataReference[] references = new MetadataReference[]     {         MetadataReference.CreateFromFile(typeof(object).Assembly.Location),         MetadataReference.CreateFromFile(typeof(Enumerable).Assembly.Location)     };     CSharpCompilation compilation = CSharpCompilation.Create(         assemblyName,         syntaxTrees: new[] { syntaxTree },         references: references,         options: new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary)); // you can also build as a console app, windows.exe etc     Assembly assembly = null;     var ms = new MemoryStream();     EmitResult eResult = compilation.Emit(ms);     if (eResult.Success)     {                  ms.Seek(0, SeekOrigin.Begin);         assembly = Assembly.Load(ms.ToArray());         Type type = assembly.GetType(nameSpace + "." + tableName);         object newObject = Activator.CreateInstance(type);         // now we can prove that worked by stepping through the code while iterating over the properties         foreach (PropertyInfo propertyInfo in newObject.GetType().GetProperties())         {             string pName = propertyInfo.Name;             string dataType = propertyInfo.PropertyType.Name;             // and test we can assign a value             if (dataType.Equals("String", StringComparison.OrdinalIgnoreCase))             {                 propertyInfo.SetValue(newObject, "foo", null);             }         }     }     else     {         IEnumerable<Diagnostic> failures = eResult.Diagnostics.Where(diagnostic =>             diagnostic.IsWarningAsError || diagnostic.Severity == DiagnosticSeverity.Error);         string msg = "";         foreach (Diagnostic diagnostic in failures)         {             msg+= $"{diagnostic.Id}: {diagnostic.GetMessage()}" + "\r\n";         }         // do something useful with the message     }     ms.Close();     ms.Dispose(); }
Ran first time, although I went back and added the diagnostics you can see just to step throght things as they happened.

The SyntaxTree built a structure for the code elements from the string. The CSharpCompilation object took that syntax tree and compiled the code as C#. The compiled assembly was saved into a memory stream and the compilation results checked for errors. Assuming no errors, the code was loaded into an Assembly object and a new instance of the compiled object (class in this instance) created and inspected.

Next I checked that I could execute a method on the compiled class. I added the following line to the class code creation method:

sb.AppendLine("public string getBar(string foo) {return foo + \" bar\";}");

and a code line to call the method after the object is created in the buildinstance() method tested above

string res = (string)type.InvokeMember("getBar", BindingFlags.Default | BindingFlags.InvokeMethod,     null, newObject, new object[] { "foo" });

which returned the string “foo bar” to the variable res.

Job done.


Thought I would check that the default class constructor was executed when the compiled object instance was created. Thought it must be but…

Adjusted the class constructor to set one of the properties and then read that back when looping over the properties. Worked as expected.

The Activator.CreateInstance() method can be called with a parameter list to actuate alternate constructors with their parameters thus:

public static object CreateInstance(Type type, params object[] args)

which probably covers that issue as an Interface can’t contain a constructor but can contain any other methods I want to expose. Clearly, there is a wee bit more code to write to get a practical implementation but it is also clear that the overall approach is very practical.

And Then The Real World

In the real world your compiled class will probably reference one or more System/.NET dll, a custom library or your new class may inherit from another class already defined in the main program assembly.

You can add such references but you need to supply the path to the relevant dll or exe file. Apparently this changed from early Roslyn releases so (like me) you might find many obsolete methods if you Google around the subject. You also need to include the relevant "using" statements in the code itself of course (see above).

In my first stab at this the CSharpCompilation constructor was supplied with a default array of MetadataReference objects. This array can be extended to include whatever is required.

The default location for the main System dlls can be found with this line of code:

var assemblyPath = Path.GetDirectoryName(typeof(object).Assembly.Location);

You can then construct a path to (say) System.dll and Syatem.Data.dll like so:

string sysTm = Path.Combine(assemblyPath, "System.dll"); string sysData = Path.Combine(assemblyPath, "System.Data.dll");

You can similarly construct the path to the classes in your own program as well (assuming they are public):

string mePath = Path.Combine(Application.StartupPath, "MyProg.exe");

The paths can then be used to add the references like so:

MetadataReference[] references = new MetadataReference[] {     MetadataReference.CreateFromFile(typeof(object).Assembly.Location),     MetadataReference.CreateFromFile(typeof(Enumerable).Assembly.Location),     MetadataReference.CreateFromFile(sysTm),     MetadataReference.CreateFromFile(sysData),     MetadataReference.CreateFromFile(mePath) };

And that is how to add those vital references to your Roslyn compiled assembly.

Tuesday, October 13, 2015

Interfaces and Unknown Classes

I wanted to make use of the ExcelReader class I created as a source of data to create class instances.

Generally the classes in question inherit from a base class that also knows how to save itself to a database. I could, of course maintain a list of such classes within my Excel reading code and call an appropriate routine as required. However I had in mind to create something that was generic enough to handle a whole raft of classes – including some created and compiled “on the fly” and thus unknowable to the source code.

The normal method for dealing with a range of classes through a common function would be to specify an interface common to all and then to interact through that interface.

Interfaces (if this is an unfamiliar term)

An interface is a list of public properties and methods (also indexers and events) that should must be exposed by a class that implements the interface. Interfaces themselves can be inherited but contain no actual code – just the signatures (if you like) of the components being specified.

One of the simplest ways of creating an interface is to (VS2015) right click on the class name in the code window and select “Extract Interface” from the “Quick Actions” context menu. Earlier versions of Visual Studio offer a “Refactor” menu option that includes extracting an interface.

Requiring a specific Interface is an excellent way of ensuring that objects calling the functionality of a class you provide supply full support. You can rely upon a method or event handler being available – even if it just a “stub”. IComparable would be a good example of just such an interface.

I have used interfaces in the past for testing alternate algorithms. An interface allowed me to construct a standardised software socket into which I could plug different classes at will. I could thus test performance and function of alternate approaches without having to change the code “consuming” the service/process.

If I come across a great in depth web page on Interfaces I will post a link here but for a couple of helpful examples you can do a lot worse than this C# Station post 

Just about the only thing my target classes have in common is an Update method. This limits the scope of any Interface a bit. The properties could be addressed through reflection and it was a reasonable bet that a class Clone() method could be used to create new class instances. So that was beginning to sound like a workable strategy.

So my interface ended up being defined like this:

interface ObjectClone {      object CloneAsObject();      void UpdateClass(); }

Each class implementing the interface needs then to add a CloneAsObject method alongside an UpdateClass method. The CloneAsObject() method is very simple:

public object CloneAsObject() {     return (MyClassName)this.MemberwiseClone(); }

There is an assumption here that all such classes expose public properties for all of the key data elements.

I defined an ExcelDataLoader class that will use an instance of the ExcelReaderclass to gather data from selected columns in a spreadsheet and use this data to create instances of a class that implements that very simple ObjectClone Interface.

The ExcelDataLoader class needs a local definition and a public property:

private ObjectClone targetClass = null; public ObjectClone TargetClass {     set { targetClass = value; } }
To pass in a new instance of the required class.

Plus I also needed a way of passing the required instructions to the ExcelDataLoader. A simple class to define the required columns and the associated data name and type seemed the thing and that could be as simple as:

public class ExcelColumnData : IComparable {     public string ColumnName { get; set; } = "";     public string DataName { get; set; } = "";     public ExcelColumnData()     {     }     public ExcelColumnData(string ColumnName, string DataName)     {         this.ColumnName = ColumnName;         this.DataName = DataName;     }     public int CompareTo(object objx)     {         ExcelColumnData x = (ExcelColumnData)objx;         int res = this.ColumnName.Length.CompareTo(x.ColumnName.Length);         return (res == 0) ? this.ColumnName.CompareTo(x.ColumnName) : res; // AA after Z     } }

Although you will note that this class implements an interface – Icomparable so that List<T>Sort() can be used to ensure the columns are presented in order.

These column definitions could then be passed as a List<ExcelColumndData> to the ExcelDataLoader class.

The ExcelDataLoader class loadData() method uses an instance of the ExcelReader class to extract some columns of data from a specified sheet and then creates an instance of the class to be created. It then uses reflection to loop through the class properties and if one of these matches a specified data name associated with an Excel data column then the expected data type is checked and an attempt made to cast (or parse) the data extracted from Excel to the required type. If the parse works then the property is set to the excel value.

private int loadData() {     int recCount = 0;     excelReader = new ExcelReader(excelFilepath, sheetName);     List<string> colNames = new List<string>();     if(excelColumns.Count > 1) { excelColumns.Sort(); }     foreach (ExcelColumnData eColData in excelColumns)     {         colNames.Add(eColData.ColumnName);     }     List<List<string>> selCols = excelReader.GetSelectedColumns(colNames, skipFirstRow);     foreach(List<string> excelRow in selCols)     {         ObjectClone newObject = (ObjectClone)targetClass.CloneAsObject();         foreach (PropertyInfo propertyInfo in newObject.GetType().GetProperties())         {             if (propertyInfo.CanWrite)             {                 string propertyName = propertyInfo.Name;                 string dataType = propertyInfo.PropertyType.Name;                 for(var eci=0; eci< excelColumns.Count; eci++)                 {                     if (excelColumns[eci].DataName.Equals(propertyName, StringComparison.OrdinalIgnoreCase))                     {                         // the strategy is to set the Property value if the incoming data is valid for the Property type                         // otherwise the value is left as the class default setting                         if (dataType.Equals("String", StringComparison.OrdinalIgnoreCase))                         {                             propertyInfo.SetValue(newObject, excelRow[eci], null);                         } else if (dataType.Equals("Int64", StringComparison.OrdinalIgnoreCase))                         {                             long intVal = 0;                             if (long.TryParse(excelRow[eci], out intVal))                             {                                 propertyInfo.SetValue(newObject, intVal, null);                             }                         } else if (dataType.Equals("Int32", StringComparison.OrdinalIgnoreCase))                         {                             int intVal = 0;                             if (int.TryParse(excelRow[eci], out intVal))                             {                                 propertyInfo.SetValue(newObject, intVal, null);                             }                         } else if (dataType.Equals("Int16", StringComparison.OrdinalIgnoreCase))                         {                             Int16 intVal = 0;                             if (Int16.TryParse(excelRow[eci], out intVal))                             {                                 propertyInfo.SetValue(newObject, intVal, null);                             }                         } ... etc                          '''                          ...                 }             }         }         newObject.UpdateClass();         recCount++;     }     return recCount; }

And you can pump data from Excel (or anywhere else) to create classes that are anonymous as far as the process is concerned and (in this instance) call a method to retain the data in a database for future use.

Usual caveats as this is prototype (although working) code and my production version will inevitably grow "horns" over time as I deal with edge cases and extra features.

Friday, October 09, 2015

Reading Excel with the Open XML SDK

A while back I posted Look Mum no Office on the subject of the Microsoft Open XML SDK that I had used for writing an Excel file without the need for a copy of Office on the machine running the code. This proved to be very straightforward and there was some excellent samplecode available 

Yesterday I needed to read data from an Excel workbook and decided to write a class that I could use in multiple scenarios to grab data from this source. Using the Open XML SDK looked the way to go as this should work with all Excel files produced by versions after Office 2003.

First use the Package Manager to grab the support libraries

PM> Install-Package DocumentFormat.OpenXml

And then add WindowsBase.dll to the project references (it is in the Framework Assembly list so easy to spot using the “Add Reference” dialog.

My first Excel reading class function was to read the list of Sheet Names in a workbook.
private List<string> getSheetNames() {     List<string> sheetNames = new List<string>();     try     {         var document = SpreadsheetDocument.Open(excelFilePath, false);         var workbookPart = document.WorkbookPart;         var sheets = workbookPart.Workbook.Descendants<Sheet>();         foreach(Sheet sheet in sheets)         {             sheetNames.Add(sheet.Name);         }         document.Close();     }     catch (Exception ex)     {         ...     }     return sheetNames; }

This was a good indicator that this process was going to be pretty steady although there were a couple of twists coming up (with all making sense when thought about).

I wanted my class to have two further main features. One to take a sample of rows from a specified sheet and another to extract defined columns from a specified sheet. As both cover similar ground, let’s look at the development of the sample row extract.

The process is 1. Identify the worksheet, 2. Grab the sheet, 3. Extract the sheet data and then 4 Loop through the Rows in the sheet data. Within each row we can loop through the cells and read the string content.

private List<List<string>> initialGetSampleRows() {     List<List<string>> samples = new List<List<string>>();     try     {         int selCount = 0;         var document = SpreadsheetDocument.Open(excelFilePath, false);         var workbookPart = document.WorkbookPart;         var sheets = workbookPart.Workbook.Descendants<Sheet>();         foreach (Sheet sheet in sheets)         {             if (sheet.Name == sheetName)             {                 var workSheet = ((WorksheetPart)workbookPart.GetPartById(sheet.Id)).Worksheet;                 var sheetData = workSheet.Elements<SheetData>().First();                 List<Row> rows = sheetData.Elements<Row>().ToList();                 foreach (Row row in rows)                 {                     var thisRow = new List<string>();                     foreach (Cell cell in row.Descendants<Cell>())                     {                         var cellContent = cell.CellValue;                         thisRow.Add((cellContent == null) ? cell.InnerText : cellContent.Text);                     }                         samples.Add(thisRow);                     selCount++;                     if (selCount >= sampleRowCount)                     {                         break;                     }                 }                 break;             }         }         document.Close();     }     catch (Exception ex)     {         ...     }     return samples; }

Does it work? Well there are a few snags. Dates are (perhaps not unexpectedly) showing as a number, numbers themselves are fine but cells with a string content are being read as an integer of some sort.

Turns out that strings are held in a shared string table so we need to change the code to read that table when required based on the cell DataType ("s"). 

foreach (Cell cell in row.Descendants<Cell>()) {     var cellContent = cell.CellValue;     var cellText = (cellContent == null) ? cell.InnerText : cellContent.Text;     var dataType = cell.DataType ?? CellValues.Error;     if (dataType == CellValues.SharedString)     {         cellText = workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(Convert.ToInt32(cell.CellValue.Text)).InnerText;     }     thisRow.Add(cellText); }

Excel dates are stored as an integer number of days since 31st December 1899 – so the Excel date for 1st January 1900 is 1. There is a snag inherited for compatibility reasons from Lotus – where the year 1900 was assumed not to be a leap year when in fact (unlike the year 2000) it was. So the numbers are almost always out by 1. Dates are represented as human sensible dates on an Excel spreadsheet by using a “style”. It is possible to detect a style value for a cell although this is normally null.

Something like:

uint styleIndex = cell.StyleIndex ?? uint.MinValue; if (styleIndex == 3) {     Double serialDate;     if (Double.TryParse(cellText, out serialDate))     {         DateTime cellDate = DateTime.FromOADate(serialDate);         cellText = cellDate.ToString(dateFormat);     } }

Where we can use the inbuilt DateTime.FromOADate() function to handle the conversion, applying string formatting to suite.

There is another snag. It would not make any sense for an XML based Excel file to include empty cells. This means that when a row of cells is read the row only contains cells with content. This would present a problem where empty cells appear (perhaps now and again) before columns of cells of interest. Fortunately each cell object contains a CellReference that can be used to extract the Column+Row identity we are familiar with on Excel sheets. This can be used to detect a cell being presented earlier than expected in a row.

One solution (now deprecated - see below) is to create an Enumerator that can manage the detection of missing cells and supply empty ones as required. The class code was re-structured a bit at this point in time as well – partly to make the code units available to functionality to be added to the class later.

The Enumerator looks like this:

private IEnumerator<Cell> excelCellEnumerator(Row row) {     int currentCount = 0;     foreach (Cell cell in row.Descendants<Cell>())     {         string columnName = columnNameFromReference(cell.CellReference);         int thisColIndex = convertColumnNameToNumber(columnName);         while(currentCount < thisColIndex)         {             var emptyCell = new Cell()             {                 DataType = null,                 CellValue = new CellValue(string.Empty)             };             yield return emptyCell;             currentCount++;         }         yield return cell;         currentCount++;     } }

With support from:

private string columnNameFromReference(string cellReference) {     var regex = new Regex("[A-Za-z]+");     var match = regex.Match(cellReference);     return match.Value; } private int convertColumnNameToNumber(string columnName) {     char[] colAlphas = columnName.ToCharArray();     Array.Reverse(colAlphas);     int colNumber = 0;     for (int i = 0; i < colAlphas.Length; i++)     {         char alpha = colAlphas[i];         int tAlpha = i == 0 ? alpha - 65 : alpha - 64;         colNumber += tAlpha * (int)Math.Pow(26, i);     }     return colNumber; }

And with the nitty gritty of the sample extract function being changed to use the Enumerator thus:

foreach(Row row in rows) {     var thisRow = new List<string>();     var cellEnumerator = excelCellEnumerator(row);     while (cellEnumerator.MoveNext())     {         var cell = cellEnumerator.Current;         var contentString = readCellContent(cell, workbookPart);         thisRow.Add(contentString);     }     samples.Add(thisRow); }

And the reading of the cell content (all code earlier in the post) being moved to a separate function.


A little experimentation indicated that it might also be a good idea to make provision for Boolean values in an Excel sheet. Where Booleans exist they are presented in Excel as TRUE and FALSE but the underlying values are a (reasonable) 1 and 0 respectively. You can detect Boolean values using the cell DataType which (in this instance) would be set to CellValues.Boolean.

There is also a CellValues enum CellValues.Date  but strangely this is not set in any of my sample data so the original technique of inspecting the StyleIndex still stands.

You might however wish to add the following code snippet to deal with Booleans:

if(dataType == CellValues.Boolean) {     int bVal;     if(int.TryParse(cellText, out bVal))     {         if(bVal == 0)         {             cellText = "false";         } else         {             cellText = "true";         }     } }

Parsing the value rather than just casting it may just be paranoia but…

Further Thought

Of course Yield return is sheer flummery in this context – although the Iterator was lots of fun. 

Yield should be reserved for computationally expensive processes where you want to spread the load a bit. In this instance most or all cells are going to be available in the Row object and so we would be better writing a function to return the whole list of cells in one go.

private List<Cell> getExcelRowCells(Row row) {      List<Cell> theCells = new List<Cell>();      int currentCount = 0;      foreach (Cell cell in row.Descendants<Cell>())      {          string columnName = columnNameFromReference(cell.CellReference);          int thisColIndex = convertColumnNameToNumber(columnName);          while (currentCount < thisColIndex)          {              var emptyCell = new Cell()              {                  DataType = null,                  CellValue = new CellValue(string.Empty)              };              theCells.Add(emptyCell);              currentCount++;          }          theCells.Add(cell);          currentCount++;      }      return theCells; }

And then consume the list as before but with conventional List<T> iteration:

foreach (Row row in rows) {     var thisRow = new List<string>();     List<Cell> rowCells = getExcelRowCells(row);     foreach(Cell cell in rowCells)     {         thisRow.Add(readCellContent(cell, workbookPart));     }     samples.Add(thisRow);     selCount++;     if (selCount >= sampleRowCount)     {         break;     } }