Loading the Heritage Health Prize Data

April 5, 2011
As I have significant interests in Public Health, Informatics, Data Integration and the like, the Heritage Health Prize is a great opportunity to participate in a competition that leverages all those skills.  The 3 Million dollar prize isn't too shabby either.  I don't have much confidence that I could win it, face it, the world's best will be plugging away at this one but that's not all that important here.  I'm using this competition as an opportunity to test out my custom built ETL libraries and integrate some new statistics libraries I'm evaluating.  It will be fun. 
In fact, its been fun already.  In about 20 minutes I was able to develop a data import utility and complete a full load of the first release of the Heritage sample data.  I even had time to make it look nice.
  

 

A few people have expressed interest in my new ETL library, and while its not ready for release yet, I can give you a glimpse of it here in the context of this data import utility.  The data released for this challenge will eventually have multiple files containing sample medical claims data.  Here is the code I used to load them:
// load claims data
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };

string[] claimFiles = Directory.GetFiles(textSourceFileDirectory.Text, "Claims*.csv");
foreach (string filename in claimFiles)
{
	GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
		new TextFileSource(filename, 1),
		new DelimitedTextParser(',')
	);

	SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");

	Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
	DatapipeResults claimResults = claimPipe.Process();
	Log("Claim " + claimResults.Summary);
}
So then.. lets break it down.
I start off creating all the transformations involved in processing this file.  I do it outside the loading loop for performance reasons, no reason to recreate them for each file since the files are all the same (in fact right now they only have one claim file, but that will change in the future). 
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };
You'll notice two of the transforms I have available in this example, the first is the InsertColumn transform, which does what you would expect, it inserts a column in the record at the index specified.  Here I'm inserting a column at position 0 to create a system identifier in my database. 
The second transform I'm using here is called the ManipulateColumns transform.  It started out as a lot of different transforms to provide the ability to do data type conversions but I quickly found that my transform library would have been way too many different converters and they would never answer the need for a really specific custom data manipulation.  That led me to create the ManipulateColumns class.  It has a few overloads but basically it allows you to specify a set of columns and a func<object,object> pointer.  That way you can create whatever transformation necessary and pass it in to apply to the columns of the record.  I then pack up the transformations into a List (any IEnumerable would work) to pass into my Datapipe, which we'll get to later.
The foreach loop is just grabbing all the files matching the filter, so that's enough about that.  Let's get to what is inside the loop.
GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
	new TextFileSource(filename, 1),
	new DelimitedTextParser(',')
);
So then, a bit of background.  When I first wrote my ETL library, it was awful.  I coupled the source and the parsing of the source way to tightly.  In my current version I have raw sources and parsers.  The raw sources are the raw input to the ETL process.  They can come from tables, files, even URLs right now.  They implement a generic interface that gives the library the opportunity to really use anything as a data source, but thats for another day.  The parsers are responsible for taking a raw input and converting it into a Record.  Records are a class that I use to standardize data moving through the pipeline and they are very simple.  So, when you put together a raw source and a parser you get a RecordSource.  I have a few different implementations of raw sources and parsers, in this example I'm using a TextFileSource which takes a URI to the file and as a second parameter the number of rows of data to skip before beginning processing.  The parser in this case is a DelimitedTextParser which breaks down the raw input into columns using a delimiter.  This file is comma delimited so I pass in the proper character.
SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");
This destination is very straight-forward, its a table in a SQL Server database.  The parameters are the connection string and the target table to load into.  And finally...
Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
DatapipeResults claimResults = claimPipe.Process();
The Datapipe is where work happens.  Here I initialize my datapipe with the source, destination and transforms I just detailed.  Then I call process, capturing the results in the DatapipeResults which provides information on the number of rows processed, time elapsed, etc.
As usual, hit me up via email or in the comments if you have any questions.

Chainsaw.Net with Style

March 29, 2011
I've been working pretty diligently during my nights at finishing version 2.0 of my custom ETL engine.  Its coming along nicely and I'm pretty happy with how this revision is coming.  Look for more on my custom ETL engine in the near future, but today I'm revisiting my .NET implementation of Chainsaw, a log viewer for log4net.  As some of you might remember I finished an initial version of the viewer a few weeks back.  I've been using it pretty heavily during the development of my ETL engine and I got tired of it looking crappy.  Tonight I took the night off from ETL development to "beautify" the log4net reporting viewer and I have to say, its easier on the eyes. 
If you compare the two screenshots you'll notice that I moved all my controls up to a single bar at the top of the page and incorporated the jQuery UI button components to replace the basic checkboxes, radio buttons and drop downs.  I've also removed the logger tree from being a permanent fixture on the page and now it toggles with a button in the top left that gives you access to the tree to make your logger selections.
 
That screenshot is full resolution if you open it in a new tab, which makes it easier to check out the new look and feel.
I found that it can be a bit tricky to pull the state out of these modified controls but it only required a few changes to the javascript.  Notably pulling the value out of the converted checkboxes and radio buttons required a slightly sophisticated select which I'll share here just in case you find yourself in a similar situation.
var sort_order = $('input:radio[name=radio-message-order]:checked').val(); 
var per_page = $('input:radio[name=option-perpage]:checked').val();
I played with swapping around some of the themes available from jQuery UI but they don't seem to want to play all that nicely which likely means I'm doing something wrong or the browsers are caching things. Either way, it will have to wait for another day.
Oh yea, I also discovered that the comment system wasn't exactly working.  I had CAPTCHA enabled but not the CAPTCHA tool.  Whoops, its working now so comment away.  Now, back to the real stuff.

I couldn't find a chainsaw, so...

March 2, 2011

I recently started working on a web based reporting console for log4net since I couldn't find a .NET version of Chainsaw, the java based app for a similar purpose.  

For those who aren't familiar log4net is a .NET port of the log4j logging framework.  You can find out more about the framework and download the bits off the Apache site.  Essentially its a logging library that you can use instead of rewriting the way you log for each project you develop.  What's really great about it is that it is completely abstracted from the storage layer, meaning you can log your messages to the local file system, just about any relational database, the windows event log, even email via smtp, all without changing your code.

My reporting console isn't nearly as flexible but works for my purposes.  I've setup an ASP.NET web application that utilizes Entity Framework to access my SQL server based logging data created by log4net, which I obviously have configured to use SQL server for storage.  The data is exposed through a web service and accessed via ajax from my front end.  

I've integrated a jquery extension called jstree to provide a hierarchical way to filter the loggers' messages to display in the report which is actually somewhat similar to what the java based log viewer Chainsaw does in its reporting.  Here is a screenshot of the reporting interface:

The top left shows the treeview courtesy of jstree.  To the right of the tree are a few simple reporting options I have implemented so far including a sort order for the log messages, a limit on the number of messages for the interface to download and filters on which columns to include in the report.  

I'm pulling the state of the options using jquery and then putting them all together in order to make the ajax call to the web service.  Gotta give a shout out to the folks over at ElegantCode for having a nice quick tutorial on reading the state of the select objects, since I always forget how to do that. Check out the basics of the data update routine below:

var sort_order = $('.option-sortorder').val();
var per_page = $('.option-perpage').val();

var bit_id = ($(".option-mask-id")[0].checked) ? '1' : '0';
var bit_date = ($(".option-mask-date")[0].checked) ? '1' : '0';
var bit_level = ($(".option-mask-level")[0].checked) ? '1' : '0';
var bit_logger = ($(".option-mask-logger")[0].checked) ? '1' : '0';
var bit_message = ($(".option-mask-message")[0].checked) ? '1' : '0';

var mask = bit_id + bit_date + bit_level + bit_logger + bit_message;

$.ajax({
	type: "POST",
	contentType: "application/json; charset=utf-8",
	url: "LoggerDetails.svc/GetMessages",
	data: "{\"ids\" : \"" + id_list.toString() + "\"," +
		"\"sort\" : " + sort_order + "," +
		"\"perpage\" : " + per_page + "," +
		"\"mask\" : \"" + mask + "\"," +
		"}",
	success: function (data, status) {
		$(".messages").html(data.d);
		$(".message-table").tablesorter({ widgets: ['zebra'] });
	},
	error: function (err) {
		alert(err.statusText);
	},
	dataType: "json"
});

I've been working a bit on timeline visualization controls and I might integrate one of them into this interface if it looks like it would be useful, which it does at this point.  I'm also considering adding a date filter so you can pull things back from specific time periods or some aggregate statistics, so you can get a higher level view of the volumne and category of entries in your log.

Source code available upon request, just hit up the comments below.

Git Rhino-ETL Going

March 1, 2011

For one of my projects I'm trying to identify a decent, open source, .NET based ETL component.  Yes, those are some picky requirements but there appears to be at least one answer called Rhino ETL.  Getting it built isn't the most straight forward task.  Here is what it takes:

First off grab the source from GIT, which might first require you have access to GIT utilities, I chose GITExtensions off of SourceForge.  After installing it and launching it I get to look at a cow.  I'm not sure why.  I'm sure there is a reason.  I don't think I want to know what it is though, so I'm going to skip googling GITExtensions and cow and just get on with it, but for your own viewing pleasure I've included a screenshot.

Once you have GITExtensions installed you can use the UI to download the current source for Rhino-ETL.  Currently the URL for Read Only access to the Rhino-ETL source is here:

https://github.com/hibernating-rhinos/rhino-etl.git

In order to grab a copy of the source you click 'Clone repository' in the above UI and then fill out the clone screen something like this:

Hit clone and now you'll have the source in the destination you specified.  

More...

About the author

Frank DeFalco has worked as software developer and informatics professional since 1996.  He is currently employed by Johnson & Johnson where he manages the Health Informatics Center of Excellence.

 

Reach me via email at frank (at) thedefalcos (dot) com.

 

Follow me on twitter @fjdefalco