First Impressions of VoltDB

VoltDB is an intriguing competitor in the quickly developing field of data store alternatives. Answering the charge of NoSQL supporters that “OldSQL” is simply too slow to scale effectively to meet the demands of high transaction applications, VoltDB seeks to demonstrate that it is neither SQL nor the Relational Model that is at fault, but the antiquated disk-centric architecture of the venerable relational database management systems (RDBMS).

VoltDB is a natively partitioned (sharded), single threaded, memory resident data store. Other RDBMS implement sharding to allow for distributed data storage across commodity hardware. However, VoltDB was written with this principle as the basis of the architecture. Sharding in other RDBMS often must be implemented as part of the application stack and presents onerous maintenance hazards (such as managing backups and restores). In recent versions, MySQL has added partitioning abstraction to the product and many organizations have experienced success with MySQL sharding. I can’t speak to how VoltDB compares, but certainly MySQL was not designed with partitioning as a fundamental principle. Regardless, the VoltDB implementation effectively and cleanly abstracts the details of partitioning from the application tiers.

But the most important differentiator is that VoltDB is entirely memory resident. All of the mechanisms that traditional RDBMS rely on to function in a disk-centric architecture have been discarded in the name of performance. This simple and instructive graphic demonstrates how VoltDB rethinks storing relational data.

In addition, VoltDB rejects concurrency in favor of single threaded transactions, thereby eliminating costly locking. As a result VoltDB can measure throughput in hundreds of thousands of transactions per second, a truly astonishing accomplishment.

Even though common performance issues of relational data stores are addressed by VoltDB, NoSQL solutions can still rightly claim that it does nothing to solve the impedance mismatch that software developers have spent so much time trying to work around. Certainly, in time, ORMs will be developed which will greatly ease the pain developers experience. Until someone is able to make the translation as seamless as, say, working with the excellent MongoDB, this will remain a legitimate complaint and an impediment to adoption. However, VoltDB’s response to this should be, “yes, some NoSQL stores speak your language better than VoltDB, but look at what you are giving up!” Losing ad-hoc querying, transactional consistency (ACID), a standardized data access convention (ANSI SQL) and the data integrity promoted by the relational model are a huge price to pay. And as Dr. David DeWitt rightly asked in his presentation to the PASS Summit 2010, “why would you want to write your own query optimizer?” On-Demand Recording (signup required) .

This brings us to what I see are the greatest drawbacks to VoltDB. Currently, only a very limited subset of SQL is actually supported, which takes some of the effectiveness out of the ad-hoc query argument. No doubt this will improve over time, but remains a challenge at this point. Additionally, due to architectural constraints, VoltDB can only loosely call itself ‘relational.’ The lack of foreign key and check constraints are worrisome. The responsibility for maintaining data integrity in this way is pushed to the app, just like in NoSQL solutions. I see no reason why check constraints, at least in limited form, could not be supported, but enforcing foreign key constraints across partitions is a key architectural problem.

However, the most disconcerting problem is the lack of uniqueness constraints. Although VoltDB enforces uniqueness for keys, it can do so only on a partition basis. That is, for any given relation the key is only guaranteed to be unique within the partition in which it resides. If your table is partitioned across n nodes, a key may exist in n partitions. To my mind, this calls into question the claim that VoltDB makes to be a relational data store, but this is perhaps debatable. Nevertheless, uniqueness constraints, along with foreign keys and check constraints, are essential to the data integrity promises of the relational model. ACID is fantastic and certainly something that NoSQL advocates are too quick to dismiss. And it seems that VoltDB is mounting a powerful challenge to the CAP theorem. But, my conclusion is that VoltDB is not a truly relational data store, yet it presents the vital correction to NoSQL to not assume that transactional integrity must be thrown out in the name of performance. For this reason, if for nothing else, it deserves to be taken very seriously.

There is much more that can be said about VoltDB and I encourage you to read their excellent whitepaper. VoltDB, and others with a similar approach, are extraordinarily promising and welcome rewrites of relational data storage. The established players in the relational market are going to be forced to take notice, as they certainly have of NoSQL. And when the NoSQL crowd realizes that they are now dealing with the same issues that Codd was trying to solve forty years ago, VoltDB will be there with part of the solution. But, much more work must be done to find creative solutions to the lack of some essential relational features before we can truly say that it can carry both the data integrity and performance torches. In the meantime, VoltDB can rightfully claim to be one of the best options out there for what it claims to do well: high throughput OLTP workloads.

A special thanks to Seb Coursol of VoltDB for his openness and enthusiasm for his product.

Moving Data Between SQL Servers with Powershell

Scripting data is a pain, especially if you do not want the entire contents of a table. OpenRowset is verbose and clunky. And working with linked servers may not be the seventh layer of hell, but you’re going to smell some smoke. What I really want is to pass some parameters to a function and be done with it. Hello, PowerShell. Here’s how it works:

Create a test table on two different servers

CREATE TABLE EmailAddress
(
     AddressID INT NOT NULL,
     EmailAddress VARCHAR(100) NOT NULL
) ;

Create some test data on one server.

WITH email AS
(
     SELECT 1 AS id, 'bgates@hotmail.com' AS EmailAddress
     UNION ALL
     SELECT 2, 'mzuckerberg@facebook.com' 
)
INSERT INTO EmailAddress
SELECT id,EmailAddress FROM email 

Here’s the function

function Move-Data 
{

Param
(
$sourceServerName ,
$sourceDatabaseName,
$destinationServerName,
$destinationDatabaseName,
$destinationTableName,
$query
)

$dataTable = Invoke-Sqlcmd -Query $query -Database $sourceDatabaseName -ServerInstance $sourceServerName ;

$connectionString = "Data Source=$destinationServerName;Integrated Security=true;Initial Catalog=$destinationDatabaseName;" ;
$bulkCopy = new-object ("Data.SqlClient.SqlBulkCopy") $connectionString ;
$bulkCopy.DestinationTableName = "$destinationTableName" ;
$bulkCopy.WriteToServer($dataTable) ;
}

Call the function

$query = "SELECT AddressID,EmailAddress FROM EmailAddresses WHERE AddressID = 1" ;
Move-Data server1 MyDB server2 MyDB EmailAddress $query

Dynamically Importing Data from Excel Using PowerShell

Since I spend a lot of time with ETL (Extraction, Transform, Load) processes, one of the most common tasks I face is importing data from an Excel spreadsheet into SQL Server. For many reasons, I’ve never been particularly happy with bcp, BULK INSERT or SSIS for this task. Recently I started experimenting with PowerShell for ETL and have been happy with the results.

Using PowerShell, I came across a simple way to accomplish a task for which I’ve never found a satisfactory solution: dynamically creating a table based on the columns in an Excel spreadsheet. Utilizing the powerful SMO assemblies, and the .NET DataTable data structure, the following script reads the structure of an Excel spreadsheet, generates a table in SQL Server and imports the data into the table.

[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.SMO') | Out-Null
[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.SMO.SqlDataType') | Out-Null

# You can very easily convert this into a function / shell script by using parameters or $args[i] for these variables
$serverName = 'testserver';
$query = 'select * from [Sheet1$]';
$databaseName = "ExcelImport" ;
$tableName = "ExelSheet" ;
$filepath = "c:\workbook.xlsx";

# ACE works well for Excel 2007 or greater.  You need to use JET 4.0 for Excel 2003.  Use ACE whenever possible.
$connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=`"$filepath`";Extended Properties=`"Excel 12.0 Xml;HDR=YES`";"

# Instantiate some objects which will be needed
$serverSMO = New-Object Microsoft.SqlServer.Management.Smo.Server($serverName)
$db = $serverSMO.Databases[$databaseName];

$newTable = New-Object Microsoft.SqlServer.Management.Smo.Table ;
$newTable.Parent = $db ;
$newTable.Name = $tableName ;

$conn = New-Object System.Data.OleDb.OleDbConnection($connectionString)
$conn.open()

$cmd = New-Object System.Data.OleDb.OleDbCommand($query,$conn) 
$dataAdapter = New-Object System.Data.OleDb.OleDbDataAdapter($cmd) 
$dataTable = New-Object System.Data.DataTable 

$dataAdapter.fill($dataTable)
$conn.close()

# Drop the table if it exists
if($db.Tables.Contains($tableName).Equals($true))
{
	($db.Tables[$tableName]).Drop()
}

# Iterate the columns in the DataTable object and add dynamically named columns to the SqlServer Table object.	
foreach($col in $dataTable.Columns)
{
	$sqlDataType = [Microsoft.SqlServer.Management.Smo.SqlDataType]::Varchar
	$dataType = New-Object Microsoft.SqlServer.Management.Smo.DataType($sqlDataType);
	$dataType.MaximumLength = 8000;
	$newColumn = New-Object Microsoft.SqlServer.Management.Smo.Column($newTable,$col.ColumnName,$dataType);
	$newColumn.DataType = $dataType;
	$newTable.Columns.Add($newColumn);
}
$newTable.Create();

$connectionString = "Data Source=$serverName;Integrated Security=true;Initial Catalog=$databaseName;"
$bc = New-Object ("Data.SqlClient.SqlBulkCopy") $connectionString
$bc.DestinationTableName = "$tableName"
$bc.WriteToServer($dataTable)

Resources

ACE – Microsoft Access Database Engine : This is required for using the ACE driver if you do not have Office installed.