As I've mentioned some of my pet projects require access to statistics and data mining algorithms. I have played with a few different approaches but none of them really had everything I was looking for in terms of methods. I eventually decided to give R a try. For those who are not familiar R is a free software environment for statistical computing and graphics that is available for a wide variety of UNIX platforms, Windows and MacOS. I had used R in the past but only on a very limited basis and never against a relational data source so I was quite suprised with what I found.
R has a massive following with thousands of contributors to an online library of algorithms. Need access to a bayesian algorithm for classification, no problem. Want to draw familial diagram with genetic markers, piece of cake. It's all in there and there are quite a few nice resources online to help you learn the platform. One of the best I have found is
http://www.statmethods.net/ but there are plenty of others. Even if what you are looking for doesn't exist in R (unlikely) you can engage with the community to get where you need to go.
Once I determined that R has all the methods I was looking for the next step was accessing my data from within R. Sure, you can load flat files directly into R but my data is all stored in a relational database. As it turns out, R can handle that too. Let's get started.
R has a terrific package management system that allows you to select a package repository called a CRAN mirror (there are many) and install additional packages.
To connect to a relational database you'll want to grab the RODBC package by selecting 'Install package(s)...' and finding it in the list after selecting a nearby CRAN mirror. Once it has been installed you can load into your current sessions as follows:
Now for the fun part. In the code below I connect to the database and establish a channel that can then be used for running your queries. Once I have a channel I get a listing of tables in the Sales schema of the AdventureWorks sample database. Finally I run a query to return some data.
channel <- odbcDriverConnect(connection = "Driver={SQL Server Native Client 10.0};Server=localhost;Database=AdventureWorks;Trusted_Connection=yes;")
sqlTables(channel, schema="Sales")
sqlQuery(channel,"select top 10 * from Sales.SalesOrderHeader")
Of course you might not be using SQL Server 2008 or you might be using a different security model. Regardless just take a trip over to
www.connectionstrings.com and find the ODBC connection string that fits your scenario and plug it into the odbcDriverConnect function. Now that you can connect R to your relational environment you can take advantage of the statistical and data mining algorithms within, free of charge.