Name:
Location: United Kingdom

I am a software developer and consultant with more than a quarter of a century of technology change and challenges to draw experience from. While I maintain and exercise some skills from the dark ages of computing I also enjoy taming the new technologies as they turn up – always looking for ways to deliver truly effective software systems to my customers.

Tuesday, July 31, 2007

Data Entry from the Clipboard

One of my customers has a large retail arm with that retail group being managed within a multi-tear hierarchy. Acquisitions and disposals result in regular re-organisations of the management hierarchy and these can be a headache for some of the systems operated in the relevant division. Part of the problem is to do with timing – in the end the company’s central SAP system will reflect the changes but some systems need to be updated before the central accounting and reporting systems. They were therefore looking for a simple method to update an application that could be driven from a primary source such as an Excel spreadsheet.

It would be trivial to load Excel spreadsheet content into an SQL database as a table and then to run the relevant updates but as the source document structure would inevitably change (in a haphazard manner?) this would require the intervention of some technically able resources. Much better to go for a really simple approach – paste some selected columns of data into a grid, validate the content and then update the system from there.

I realised that it was a long time since I had written code to support pasting data and was happy to take a look and see what the .NET framework had to offer that advanced what was available using the Win32 API (this being an area where classic VB was never at the top of it’s game).

As this blog is supposed to be about code – let’s write some.

I started with the easy stuff – a filter to inspect the content of the clipboard to see if the data types available there were generically suitable for pasting into a table – in this instance a ListView control set to act like a grid. The paste option itself was a menu item that was a sub-item within an Edit menu item group so I put some code under the Select (not Click) event for the Edit menu item:

Private Sub mnuEdit_Select(ByVal sender As Object, ByVal e As System.EventArgs) Handles mnuEdit.Select
Dim DataObject As IDataObject = Clipboard.GetDataObject
Dim IsHTML As Boolean = DataObject.GetDataPresent(DataFormats.Html)
Dim IsCSV As Boolean = DataObject.GetDataPresent(DataFormats.CommaSeparatedValue)
Dim IsText As Boolean = DataObject.GetDataPresent(DataFormats.Text)
mnuPaste.Enabled = IsHTML Or IsText Or IsCSV
End Sub

I had decided to support multiple data formats as I could not be sure of the source of the data the user might want to paste – not even sure that it would be from an MS Office document. There is a good range of native data formats available by default – perhaps oddly missing XML but including HTML, CSV and plain old vanilla TEXT). You can copy and paste custom data objects of course (see) but that is outside of this immediate context. Thus the code above enables the paste menu item when any or all of three data types are available on the clipboard. The HTML and Text data types are essentially strings but the detail of the content will change depending upon the data source so any code managing the actual paste event has to be flexible. The Comma Separated Variable (CSV) data type is a stream and needs slightly different treatment but boils down to the simplest data source when available – even if you have to allow for commas within delimited strings (which are a pain).

As you might expect, the content of the HTML data type posted to the clipboard by MS Office programs is full of layout and style information that needs to be ignored but fortunately the tags help you find the essentials – the “cell” values you are looking to paste. I experimented with what turned up when selecting and copying from both MS Excel and MS Word. The Excel contribution is fairly straightforward – if you ignore the crud and stick to the content of the <TR> and <TD> tags.

If you copy columns of data just typed into an MS Word page you get something like:
<P>data value<some whitespace>data value<some whitespace>data value</P>
for each “row” of data.

If you copy some data from an MS Word table then you get an HTML table but with the data cell content also wrapped in <P> tags so something like:

<TR><TD><P>data value</P></TD><TD><P>data value</P></TD></TR>

The plain text equivalents are pretty straightforward. For both Excel and MS Word Tables you get cell content separated by “tab” characters with a row being completed with carriage return and line feed characters. Text from a non-formatted area of an MS Word document is just as you would expect from the equivalent typed into NotePad.

Armed with all this choice, you have to decide on your personal “hierarchy” of data types to accept. I should sound a small note of caution here. Not all programs correctly clear the clipboard when adding data to it. This can mean that a user program could load (say) text data into the clipboard alongside existing data – perhaps in HTML format. This can cause unexpected results and a deal of user frustration. Apart from resolving to always copy data onto the clipboard properly yourself there is not much you can do about this problem although one answer might be to think carefully about your data hierarchy – perhaps taking the simplest data format in preference to those that might “persist” on the clipboard from unintended sources.

Pasting the clipboard content into your grid is a pretty straightforward process that extracts each row of cells in turn from the input string and then adds the rows to the grid. However this is where you also have to consider the “shape” of the data being pasted and perhaps decide how you are going to handle differences between the number of cells in each row being presented and the number of cells in your target grid. In my simple instance I am only looking for two cells in each row and will accept any number of rows – adding then to the grid. In other circumstances you might be writing code that pasts blocks of data into a specified portion of a grid or inserting data into a grid with (notionally) infinite dimensions. You might have to parse all of the rows represented in the paste string to determine the dimensions (in cells and rows) of the data being pasted before inserting the actual cells values – you might need the user to make some choices before proceeding.

For the purposes of this post I have encapsulated the paste functionality within a class that, to be honest, is ridiculously over-engineered for the task in hand but would serve as a platform for further development and exploration. I am sure that you have spotted that what we need to implement is a simple parser – simple because there is no need to support sub-expressions – all we need to do is to detect columns of data within rows. Classically parsers use stacks to parse a given statement but in this instance we are only parsing left to right so we can use a Queue object rather than a Stack object to store the resulting “grid” of data.

The HTML table parser is the most complex of the three parsing routines because its main task is to ignore most of the HTML content as irrelevant to the requirement. You might have thought that you could use the XMLReader class for this task as HTML is a mark-up language but unfortunately HTML is not a “well formed” mark-up language and will quickly cause errors. You could experiment with the SQMLReader available for download from the (still available) GotDotNet user samples site as an alternative. The code shown here for parsing the HTML is just about acceptable but is very close to the edge – probably needs a properly formed recursive parser to handle the task.

VB .NET class to manage pasting data from the clipboard into a grid on a form:

Imports System.IO
Imports System.Text
Public Class GridPaste
#Region " Public Enum"
Public Enum PDataType
CSV = 1
HTML = 2
Text = 4
End Enum
#End Region
#Region " Private Declarations"
Private HTMLString As String
Private CSVString As String = ""
Private TextString As String = ""
Private gotType As Int16 = 0
Private MaxCols As Integer = -1
Private RowCount As Integer = -1
Private GridRows As Queue

#End Region
#Region " Public Constructor Methods and Properties"
Public Sub New()
Dim DataObject As IDataObject = Clipboard.GetDataObject
If DataObject.GetDataPresent(DataFormats.CommaSeparatedValue) Then
Dim SReader As New StreamReader(CType(DataObject.GetData(DataFormats.CommaSeparatedValue), Stream))
CSVString = SReader.ReadToEnd
SReader.Close()
gotType += PDataType.CSV
End If
If DataObject.GetDataPresent(DataFormats.Html) Then
Dim HTMLPaste As New StringBuilder(DataObject.GetData(DataFormats.Html).ToString())
'Upcase selected tags
HTMLPaste.Replace("<td", "<TD")
HTMLPaste.Replace("</td", "</TD")
HTMLPaste.Replace("<tr", "<TR")
HTMLPaste.Replace("</tr", "</TR")
HTMLPaste.Replace("<p", "<P")
HTMLPaste.Replace("</p", "</P")
HTMLPaste.Replace("<table", "<TABLE")
HTMLString = HTMLPaste.ToString
gotType += PDataType.HTML
End If
If DataObject.GetDataPresent(DataFormats.Text) Then
TextString = DataObject.GetData(DataFormats.Text).ToString().Trim
gotType += PDataType.Text
End If
End Sub
Public Function IsFormat(ByVal datatype As PDataType) As Boolean
If gotType And datatype Then
Return True
Else
Return False
End If
End Function
Public Function GetRowCount() As Integer
If RowCount = -1 Then
BestParse()
End If
Return RowCount
End Function
Public Function GetColumnCount() As Integer
If MaxCols = -1 Then
BestParse()
End If
Return MaxCols
End Function
Public Sub PasteToGrid(ByRef aListView As ListView, ByVal datatype As PDataType)
Select Case datatype
Case PDataType.CSV
ParseCSV()
Case PDataType.HTML
If HTMLString.IndexOf("<TABLE") > -1 Then
ParseHTMLTable()
Else
ParseText() ' as that's all you are getting wraped up in <P> tags
End If
Case PDataType.Text
ParseText()
End Select
AppendToGrid(aListView)
End Sub
Public Sub PasteToGrid(ByRef aListView As ListView)
'alternate call that looks for best match that results in columns and rows being pasted
BestParse()
AppendToGrid(aListView)
End Sub
#End Region
#Region " Private Methods"
Private Sub BestParse()
If gotType And PDataType.HTML Then
If HTMLString.IndexOf("<TABLE") > -1 Then
ParseHTMLTable()
End If
End If
CountCells()
If RowCount <= 0 Or MaxCols <= 0 Then
'try next format
If gotType And PDataType.CSV Then
ParseCSV()
CountCells()
End If
If RowCount <= 0 Or MaxCols <= 0 Then
If gotType And PDataType.Text Then
ParseText()
CountCells()
End If
End If
End If

End Sub
Private Sub CountCells()
RowCount = -1
MaxCols = -1
If Not IsNothing(GridRows) Then
Dim ThisRow As Queue
Dim RowEnumerator As Collections.IEnumerator = GridRows.GetEnumerator
RowCount = GridRows.Count
While RowEnumerator.MoveNext
ThisRow = RowEnumerator.Current
If ThisRow.Count > MaxCols Then
MaxCols = ThisRow.Count
End If
End While
End If
End Sub
Private Sub AppendToGrid(ByVal aListView As ListView)
If Not IsNothing(GridRows) Then
Dim ThisRow As Queue
Dim NewRow As ListViewItem
Dim ColNumber As Int16
Dim RowEnumerator As Collections.IEnumerator = GridRows.GetEnumerator
Dim ColEnumerator As Collections.IEnumerator
While RowEnumerator.MoveNext
ThisRow = RowEnumerator.Current
ColEnumerator = ThisRow.GetEnumerator
ColNumber = 0
While ColEnumerator.MoveNext
ColNumber += 1
Select Case ColNumber
Case 1
NewRow = aListView.Items.Add(ColEnumerator.Current)
Case Is <= aListView.Columns.Count
NewRow.SubItems.Add(ColEnumerator.Current)
End Select
End While
End While
End If
End Sub

Private Sub ParseText()
PreProcessText()
GridRows = New Queue
Dim ThisRow As New Queue
Dim NextChar As String, CellContent As String
For CharPos As Integer = 0 To TextString.Length - 1
NextChar = TextString.Substring(CharPos, 1)
Select Case NextChar
Case vbTab
ThisRow.Enqueue(CellContent)
CellContent = ""
Case vbLf
ThisRow.Enqueue(CellContent)
CellContent = ""
GridRows.Enqueue(ThisRow)
ThisRow = New Queue
Case Else
CellContent &= NextChar
End Select
Next
End Sub
Private Sub ParseCSV()
GridRows = New Queue
Dim ThisRow As New Queue
Dim LineEnd As Boolean = False
Dim NextChar As String, CellContent As String = ""
Dim CharPos As Integer = 0, RowEnd As Integer, DelimEnd As Integer
While CharPos < CSVString.Length
NextChar = CSVString.Substring(CharPos, 1)
CharPos += 1
Select Case NextChar
Case Chr(34)
'should be a string enclosed in quotes
RowEnd = CSVString.IndexOf(vbCrLf, CharPos)
If RowEnd = -1 Then
RowEnd = CSVString.Length
End If
DelimEnd = CSVString.IndexOf(Chr(34), CharPos, RowEnd - CharPos)
If DelimEnd > -1 Then
CellContent = CSVString.Substring(CharPos, DelimEnd - CharPos)
DelimEnd += 1
CharPos += DelimEnd - CharPos
Else
CellContent &= NextChar
End If
LineEnd = False
Case ","
ThisRow.Enqueue(CellContent)
CellContent = ""
Case vbCr, vbLf
If Not LineEnd Then
LineEnd = True
ThisRow.Enqueue(CellContent)
CellContent = ""
GridRows.Enqueue(ThisRow)
ThisRow = New Queue
End If
Case Else
CellContent &= NextChar
LineEnd = False
End Select
End While
End Sub
Private Sub ParseHTMLTable()
GridRows = New Queue
Dim ThisRow As Queue
Dim RowStart As Integer, RowEnd As Integer, NextCell As Integer
Dim CellStart As Integer, CellEnd As Integer, PStart As Integer, PEnd As Integer
Dim CellContent As String = "", TagCell As String
Dim GoodRow As Boolean = False
Dim NextRow As Integer = HTMLString.IndexOf("<TR")
While NextRow > -1
If GoodRow Then
GridRows.Enqueue(ThisRow)
End If
ThisRow = New Queue
RowEnd = HTMLString.IndexOf("</TR", NextRow)
GoodRow = RowEnd > -1
NextCell = HTMLString.IndexOf("<TD", NextRow, RowEnd - NextRow)
While NextCell > -1
'we need to find the start of the actual cell content as well as it's end
CellStart = HTMLString.IndexOf(">", NextCell, RowEnd - NextCell)
If CellStart > -1 Then
CellStart += 1
CellEnd = HTMLString.IndexOf("</TD", CellStart, RowEnd - CellStart)
If CellEnd > -1 Then
'we have a cell so
CellContent = HTMLString.Substring(CellStart, CellEnd - CellStart)
'but it might still be wrapped in a <P> tag set or contain even more junk tags yet so
PStart = CellContent.IndexOf("<P")
If PStart > -1 Then
PStart = CellContent.IndexOf(">", PStart)
PEnd = CellContent.IndexOf("</P", PStart)
If PEnd > -1 Then
PStart += 1
CellContent = CellContent.Substring(PStart, PEnd - PStart)
End If
End If
PStart = CellContent.IndexOf("<")
If PStart > -1 Then
PEnd = CellContent.IndexOf(">", PStart)
If PEnd > -1 Then
PEnd = CellContent.IndexOf(">", PEnd + 1)
If PEnd > -1 Then
If PStart > 0 Then
TagCell = CellContent.Substring(0, PStart)
Else
TagCell = ""
End If
If PEnd < CellContent.Length - 2 Then
PEnd += 1
TagCell &= CellContent.Substring(PEnd, CellContent.Length - PEnd)
End If
CellContent = TagCell
End If
End If

End If
ThisRow.Enqueue(CellContent)
CellContent = ""
Else
CellEnd = RowEnd
End If
End If
NextCell = HTMLString.IndexOf("<TD", CellEnd, RowEnd - CellEnd)
End While
NextRow = HTMLString.IndexOf("<TR", RowEnd)
End While
If GoodRow Then
GridRows.Enqueue(ThisRow)
End If
End Sub
Private Sub PreProcessText()
Dim PasteText As New StringBuilder
Dim IsWhite As Boolean = False, IsTerminal As Boolean = False
Dim NextChar As String
For Charloop As Integer = 0 To TextString.Length - 1
NextChar = TextString.Substring(Charloop, 1)
Select Case NextChar
Case " ", vbTab
If Not IsWhite Then
IsWhite = True
PasteText.Append(vbTab)
End If
Case vbCr, vbLf
If Not IsTerminal Then
IsTerminal = True
PasteText.Append(vbLf)
End If
IsWhite = False
Case Else
IsWhite = False
IsTerminal = False
PasteText.Append(NextChar)
End Select
Next
If PasteText.Chars(PasteText.Length - 1) <> vbLf Then
PasteText.Append(vbLf) 'ensure last row is completed
End If
TextString = PasteText.ToString()
End Sub
#End Region
End Class

Implementing the class functionality is pretty trivial – it could be as simple as
   Dim GridPaster As New GridPaste
GridPaster.PasteToGrid(MyListView)
in the code block for the edit menu paste event – although you would probably want to make use of the row and column counting functionality or add additional content validation before proceeding.

Useful extensions – The class could be usefully extended to support DataGrids. Another feature you might like to consider would be “Drag and Drop” support for files. For more general applications, you might very well want to support the overwriting of user selected data in the target grid.

Links:
If you end up processing very large CSV encoded data files you might like to check out Sebastian Lorion’s “Fast CSV Reader” at The Code Project
Parsing HTML using the SQMLReader at GotDotNet

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home