HUGE SQL Azure Features in Preview Today

Earlier this month at the Build conference the news broke of major changes to Microsoft Azure SQL Database (a.k.a. the platform formerly known as Windows Azure SQL Database a.k.a. the platform formerly known as SQL Azure). This morning those changes were released as preview features. I’m going to go ahead and take credit for the two big ones, based on my comment in this Brent Ozar blog post. You’re welcome, #sqlfamily. These changes come on top of the introduction of a Premium tier last year, which significantly expanded the scale of workloads that Azure can support.

So, what’s new? First to my mind, and to many DBAs out there who are cosmically empathic to the health of their recovery strategy, is point-in-time restores. Gone are the BACPACs and database copies (for which there was no ‘Azure’ way of scheduling). Now you can simply submit a restore request, specify the time to restore, and you’re on your way. Huge. HUGE! There’s even a PowerShell cmdlet: Start-AzureSqlDatabaseRestore . Read more here and here .

Second, the maximum size of a database has been increased from 100 GB to 500 GB. Even the newly introduced ‘Standard’ tier has a 250 GB ceiling. For the vast majority of applications out there this should be plenty. Sure, before you could have used federated databases, but there are significant architectural and maintenance challenges using sharding patterns like that so this is extremely welcome news.

For the past 8 months or so I’ve been working on a project for which the data and service tiers will be Azure-only. So far the experience has been mostly positive, but the entire time I’ve had nagging concerns about these limitations which, given the resources of my team, posed significant challenges. These changes essentially eliminate those concerns. Consider, if this is happening to me, this effect is rippling through countless projects and IT divisions. Remember this as the day SQL Azure made The Leap (but, can we please have XEvents, Microsoft?).

The Case of the Orphan Transactions

I usually have trouble making it through long blog posts and this isn’t going to be short. But, this is quite a story and can’t be told quickly.

It all started with an open transaction; well, almost two THOUSAND open transactions. The database maintenance job was blocked and throwing alerts. It was not blocking anything else, but of course that’s because the developers, as a matter of pride, always use NOLOCK on every query they write. Who cares about accurate queries anyway?

Here’s the partial output from sp_WhoIsActive:

wia
Any experienced DBA knows immediately what’s going on here. Some developer wrote code that opens transactions without properly committing or rolling them back. This results in a cascade of nested transactions and, eventually, a killed SPID and lost work. So what to do? Being an experienced DBA I knew the answer to that, too: I blame the developers. Of course, they SWEAR they aren’t opening any transactions. There used to be a problem where connections were not always being closed, but they fixed that months ago. Being an experienced DBA I don’t believe them, so I acquire a copy of the code myself. Shockingly I find the developers are actually correct; the code (although a spaghetti mess of static methods, nested internal and partial classes, and SQL injection vulnerabilities) is (more or less) properly managing it’s connections and is not opening transactions. It was time to investigate (what else am I going to do on a Friday night?).

Taking an inventory of what I knew: the connection opened on July 19th, but had only been sleeping for 23 seconds. Obviously the connection was still being utilized. And the same stored procedure was being executed on the connection over and over again. But, open_tran_count was not increasing. The connection was over ODBC using a DSN that is shared by numerous processes with the same domain user account. This service ran on 8 servers, but only one of them was causing a problem. I needed more information, so I headed to the DMVs.

Sys.dm_tran_active_transactions
at

This is interesting. There are 1,926 open transactions but there is only one in this DMV. So, for which one of those 1,926 transactions does this DMV report? The eminent Adam Machanic, author of sp_WhoIsActive, reveals that he has to go to the deprecated sysprocesses view in order to find out how many transactions are actually open. Books online is no help. A bit of testing shows that transaction_begin_time refers to the time the FIRST transaction was started. Let’s check another:

sys.dm_tran_database_transactions
dt

Again I had the same problem: only one transaction record. More testing shows that database_transaction_begin_time is the time the first transaction in the series touched the database with a modification. So there were 30 hours between the time the first transaction started and the time one actually touched (attempted to modify) the database. This was all interesting, but not particularly useful since there wasn’t a clear explanation of this behavior in the code.

The question remained, what was opening all of those transactions? The stored procedure was solid. It used TRY … CATCH with appropriate commit and rollback. It wrapped an explicit BEGIN TRANSACTION around an INSERT, but, while extraneous, that was not of great concern. The code followed the try..catch…finally pattern with the call to the OdbcConnection.Close() method in the finally block. Everything looked good there. Was it possible that, contrary to the documentation, Close() did not actually roll back open transactions? I opened up Visual Studio and in a few minutes had hammered out a quick app to test. And, in fact, the documentation is incorrect or misleading. Close() does NOT roll back open transactions when connection pooling is enabled. The transactions are only rolled back when another process acquires the same connection from the pool and sp_reset_connection is executed. However, this means that those transactions should have been rolling back because the code was properly closing the connection. The mystery thickened. At this point I should have become more suspicious, but I was too deep into the weeds looking for the zebra. I was grasping at straws.

What ELSE could be opening transactions and leaving them open? The only thing I could think of was
implicit transactions
, but NO ONE uses implicit transactions except Oracle, right? But, it would offer a partial explanation. Every time the stored procedure was called two transactions would be opened, but only one committed, leaving the second open. The prospect was intriguing. However, based on my test results on the previous experiment, I should expect perhaps one to be held open after each call to the proc, but not 1,926. But, maybe there was some strange confluence of events since now we had to consider read-only transactions in the mix, as well. Some quick investigation should put it to bed; if there are no connections using implicit transactions it’s a dead end. Since that particular connection setting doesn’t seem to show up in the DMVs (maybe it’s hidden somewhere I couldn’t find) I quickly grabbed the ExistingConnection trace from Profiler. Of course, of all apps in the world to use this esoteric connection setting it would have to be this one. Hidden somewhere in compiled vendor code to which we didn’t have access it was doing this:

profiler

Implicit transactions and repeatable read is an interesting combination. Is it possible that some race condition was the culprit where the vendor app opened the connection using the same DSN and set implicit transactions, but the setting wasn’t changed when the connection was grabbed from the pool by this custom written process? Perhaps sp_reset_connection had some loophole or quirk? It turns out that sp_reset_connection is not well documented at all. It is considered an API procedure by Microsoft (you can’t execute it without access to the API) and they have chosen not to document it. In fact, this Stackoverflow post seems to be the best source of documentation out there and is also referenced by John Huang among others. So, it doesn’t change the isolation level, perhaps it doesn’t change implicit transaction, as well. More code in my test app would prove that this, also, was a dead end. Not only does the setting return to implicit_transactions off, but the transactions would have rolled back anyway.

I was well and truly stumped. This was a big problem, not so much because of user or system impact, but because problems like this eat at me until I find an explanation. I was in deep and was scrambling for a way out. It was so consuming my thoughts that I subjected my lunch companion to a thorough overview of my efforts. Just as I was about to leave, it struck me; what if the code I was reading was not actually the code that was running? Not that the version of the code I had and the version of the code that was on the server were different. I knew they were the same. What if the code on the server’s hard drive was not the code that was in the server’s memory? This code ran as a Windows service. What if the .exe and .dll were replaced, but the service was never restarted? A quick test proved that you can, in fact, replace the files without stopping the service. The fix the developers put in months ago to properly close the connection was never actually implemented. In fact, when they restarted the service after I presented my theory it failed because they had also changed the service account at the same time they changed the code and it did not have the permissions required to run the service.

So, returning to the simple stored procedure, if the single insert timed out because it was blocked by, say, an index rebuild operation the COMMIT would never execute, leaving the explicit transaction open forever. The app would continue to time out and run the procedure again and again, allowing the transactions to build up over the course of time. When whatever process that was blocking and causing timeouts was cleared, the transactions would start to commit again, causing the watch to stop, so to speak, at 1,926 open transactions. The procedure did not have the XACT_ABORT ON statement, so the condition would never be properly handled by SQL Server.

In retrospect, there were a few other things I could have tried to get more information. I could have used Fn_dblog() to inspect what the process was doing. I could have queried WMI to see how long the process had been running to validate my theory (I got lucky with the service account). And I still have outstanding questions (why was there a 30 hour gap in the transaction start times in the DMVs?). But, all-in-all, it was a fascinating troubleshooting experience and I learned quite a bit.

  • OdbcConnection.Close() does NOT roll back open transactions when connection pooling is enabled. They are rolled back by sp_reset_connection when another request pulls the connection from the pool.
  • Sp_reset_connection DOES reset the implicit_transaction setting, but not isolation level.
  • You can overwrite dependent files for a Windows service without stopping the service.
  • We could use a bit more information about transactions from the DMVs.
  • ODBC connection pooling is controlled at the driver level, so you can’t specify different setting for different DSNs. I think you can extend the connection string, however, but I didn’t test or confirm.

PowerShell Studio Review

Over the years I’ve noticed certain characteristics in the IT professionals I respect the most. These range from the rather obvious intellectual curiosity that goes beyond all matters IT to the more subtle religious devotion to particular code formatting styles. The germane attribute to this post is that they also recognize and adore good tools. They sense when their current toolset is deficient and spend time looking for better ones. When a good one is found they become passionate about them and share them with other people they respect.

When I first started out in PowerShell the ISE was quickly determined to be less than ideal. I rarely spend time in the shell itself, but I write quite a lot of scripts, functions, and modules. A couple of years ago there wasn’t much in the way of alternatives. I used PowerGUI for a while, which wasn’t (and isn’t) bad. It provided Intellisense and a heap explorer (for lack of a better term; it’s a variable list), but at the time there weren’t all that many cmdlets to keep track of anyway. When I began exploring multithreading with PowerShell I ran into the big problem that the console PowerGUI wraps executes in an STA thread. That’s death for a PowerShell ISE for me now. The next tool I tried was Idera’s PowerShell Plus. It was a step ahead of PowerGUI, but was a bit ‘clunky’. Idera makes some great tools and provide awesome support for the PowerShell community (they run powershell.com). Also, their former VP of Marketing gave a copy of PowerShell Plus to every attendee in one of my SQL Saturday sessions. I probably would have stuck with it if it wasn’t for discovering Sapien’s PrimalScript.

PrimalScript is, so far as I can tell, the gold standard of scripting tools. It’s got more bells and whistles than I could ever explore. That was really my only problem with the tool. When it comes to scripting I’m a PowerShell only kind of guy. Due to PrimalScript’s long history PowerShell really became a ‘tacked on’ addition to the tool. It works remarkably well, but there is an awful lot of clutter if all you do is PoSH. A while back I discovered that they made dramatic upgrades to the tool they once called PrimalForms, which I had tried and discarded as a one-trick-pony for the rather esoteric task of writing PowerShell driven GUI forms. Renamed PowerShell Studio, it looked like it took all the good elements of PrimalScript and left out all the non-PowerShell stuff that I really didn’t care about. For a couple of weeks I’ve been working with it exclusively and it’s quite good. There are quite a few features that separate it from the other ISEs out there:

Swap v2 and v3 engines

This is an inspired feature. Especially with some of the known compatibility issues experienced running v2 scripts in v3, having this easy switch is a great help.

Layouts

There are so many features and toolbars that you can go a little nuts trying to navigate them all. And when I want to code I just want a blank window to hammer out the text. Layouts give you the opportunity to configure the interface to your heart’s desire *for any given scenario.* When I’m just pounding the keyboard I want a particular set of toolbars, but when I’m debugging I want something different. Likewise, if I’m doing code modifications and regression testing, I have even different needs. It’s quite easy to configure the interface and save the configuration for instant recall later. Here’s what I look at most of the time (Click to enlarge).

layout

Object Browser

This is a killer feature. The object browser is a toolbar that presents all of the components in PowerShell, the .NET framework, WMI, and your filesystem in one (searchable) interface. I find myself spending a lot of time in BOL trying to figure out various esoteric SMO issues, find overloaded constructor definitions, and umpteen other time chewers that an only-kind-of-developer like me has to do. This is a great help to be able to quickly browse through what is available and, if I need more information, the right-click -> MSDN Help shortcut is just awesome.

object_browser

Contextual help

As you type cmdlets the Get-Help contents opens automagically in a dock window. This is particularly helpful now that v3 added all sorts of cmdlets and there’s no way you can be familiar with them all. And there’s always that ‘what was the parameter name’ moment that I have shamefully often. It’s very handy and shows that the developers of the tool really understand PowerShell scripting.

help

Script packager

Bundle your script as an .exe? I didn’t get it at first. But, I tried it out and I’ll definitely have use for it. I’m constantly wanting to share something cool with people, but I can’t teach everyone PowerShell. Now I can just package it up and send them an .exe. As an added benefit, they can’t edit or see the code so it prevents changes that may break your code as well as perhaps protecting some secret sauce you don’t want to share (but, don’t be that guy!).

Compare files

I work in a scenario where I have to go through very clearly defined steps of testing and elevating PowerShell code between environments. Being able to click a button and view the differences between two different files is a big time saver.

There really are too many great features to list here and many of them I have yet to really dig into. Here’s a quick (but certainly not exhaustive) look at a few more.

  • Remoting – Set up your quick access list and open a remote PowerShell console without leaving the tool.
  • Embeded console that actually works – I don’t have to have a separate PowerShell window open. I can do everything within the tool.
  • Snippets – Other tools have snippets, but I find Powershell Studio to have the superior implementation. They are easy to use with the shortcut + tab pattern and it’s simple to create your own.
  • Auto-module import – Try to use a function from a module you haven’t loaded and PowerShell Studio will insert the Import line.
  • Projects – Being able to create a project is very handy for working with modules or multiple scripts/functions that are generally modified together.
  • Function Explorer – When you’re working with modules it’s very helpful to be able to see a list of all the functions within and navigate to their place in the module file.

Wish List

  • Better configuration – PowerShell Studio is sufficiently configurable. Most of the things that would annoy me I’ve been able to turn off or alter. But, as you’ve probably guessed I’m rather picky about my development environments. One of the first things I do when I try out a new IDE is set the colors. I think I’ve come up with the absolute best, objectively speaking, color scheme of all time that I’ve spent hours perfecting and I’m UNABLE TO USE it in PowerShell Studio because you can’t enter your own RGB values. Travesty! Also, I can’t configure line number coloring at all! The horror! As you can see here, what I came up with isn’t too bad, but not nearly as beautiful as my PrimalScript settings.

    colors

  • Disable tabs grabbing focus – I don’t care to know every time an auto-save occurs. I don’t need the output tab to pop up every time.
  • Code formatting – The more configurable the better!
  • License and embed Sublime Text – I greatly admire text editors. It’s an area most people ignore, but there are so many awesome things you can do with a good one. Sublime Text is my current fave (thanks to one of those guys I respect who shared it with me). Of course, a vi style editor would also be fantastic. Free me from the mouse!
  • Better integration with source control – I understand Sapien has their own source control product, but I can’t use it. I have TFS and I simply wasn’t able to get them to play well together.

Conclusion

I really wish I could return PrimalScript and go with PowerShell Studio. I’m a PrimalScript fan and will happily continue to use it, but when it comes time to upgrade I’m going with its PowerShell specific sister. If you spend a lot of time writing PowerShell code there’s just no better ISE out there.

psasync Module: Multithreaded PowerShell

The series I did on Concurrency in PowerShell contains by far the most popular posts on my blog (getting tens of views every day!), particularly the one on using runspaces. The code demonstrated in the post was an early attempt at using the method. Not long after I refactored the functions I was using into something more modular and flexible. Thus, the psasync module was born. It has proved to be a very simple way for me to kick off background work in runspaces in scripts or from the shell. It still has room for improvement (the functions don’t process pipeline input, for instance), but it’s a start. Since I hope to be updating it as suggestions come in or as I find ways to make it more robust, I’ve started a Codeplex project (my first, so be gentle). If you would like to be a contributor feel free to send me something and, if it’s good, I’ll give you credit on the project.

The module contains the following functions:

Get-RunspacePool

This should have, perhaps, been named Create-RunspacePool. As its name suggests, it returns a pool of runspaces that can be used (and reused).

Invoke-Async

This is a much improved version of the function introduced in the initial post. The big improvement was the addition of the AsyncPipeline class definition, which allows the creation of a simple object to keep track of both the pipeline and the AsyncResult handle which is returned by BeginInvoke(). This allows the process of looking at statuses of running processes and consuming results to be much simpler. The function also allows passing an array of parameters for script blocks with multiple arguments.

Receive-AsyncResults

This function wraps the code for pulling results (or errors, as the case may be) off the pipelines in the runspace pool utilizing the AsyncPipeline objects output from Invoke-Async.

Receive-AsyncStatus

A handy function for working with runspaces from the shell, Receive-AsyncStatus simply returns information about the status of the operations running in the pipelines you have invoked. Since Receive-AsyncResults is synchronous, this allows you to continue to work until your last process completes or selectively use Receive-AsyncResults on those that have completed.

Example Code

To demonstrate the use of the module, consider the same scenario presented in the Concurrency series: You have a series of Excel documents that you need to load into an SQL Server database. As before, set up the script block that will execute the real work.

Import-Module psasync

$AsyncPipelines = @()

$ScriptBlock = `
{
    Param($File)
    
    . <your_path>\Import-ExcelToSQLServer.ps1
    
    Import-ExcelToSQLServer -ServerName 'localhost' -DatabaseName 'SQLSaturday' -SheetName "SQLSaturday_1" `
        -TableName $($File.BaseName) -FilePath $($File.FullName)
}

# Create a pool of 3 runspaces
$pool = Get-RunspacePool 3

$files = Get-ChildItem <path-to-files> 

foreach($file in $files)
{
	 $AsyncPipelines += Invoke-Async -RunspacePool $pool -ScriptBlock $ScriptBlock -Parameters $file
}

Receive-AsyncStatus -Pipelines $AsyncPipelines

Receive-AsyncResults -Pipelines $AsyncPipelines -ShowProgress

You’ll notice there is nothing particularly complex here in the code. But, all the warnings from the runspace post apply. Multithreading is awesome and powerful, but use it with care.

Innovation from Microsoft?

I bit the bullet and installed Windows 8. The verdict? It’s the best 1st generation ‘next generation’ OS Microsoft has ever released. It’s better than Windows 95 and better than Vista (relative to what they were are the time, of course). What I thought they would fail to implement (the split personality interface-formerly-known-as-Metro vs. classic desktop experience) they somehow managed to make work. Even though the cartoonish Start screen makes this old beige tower professional IT guy a little squeamish, it still works. And the press is correct: Windows 8 is a reinvention of the Microsoft OS. What the press has largely missed, however, is that Microsoft has, mostly unnoticed and unpublicized, already reinvented itself and Windows 8 is one of the last steps, not the first. And Microsoft has not been particularly vocal about what it is doing.

Even those that sense something more determine that Microsoft is now about being a social platform or
hardware independent. It is much more than that. Some have also made the mistake of thinking that Microsoft wants to become like Apple, but I don’t believe that is the case. Apple’s ignorance of business concerns is a major flaw. It understands the commercial market, but there is only so much money there and is fundamentally fickle and exhaustion prone. It’s nature is directed towards commoditization. They are already falling into the litigiousness that other companies tend towards when they see their competitive advantages slipping away. And people notice. As it turns out, Apple’s hipster cultural inclinations that made it a success is poorly serving it now.

Microsoft wants to make money whenever *anyone* use technology services. If that happens to be socially focused services, it’ll provide a way for you to interact with them in an easier and more consolidated fashion (you can integrate your major social media accounts in the native Windows 8 ‘People’ app, for instance). Microsoft is no longer a software company; it is a *service platform* company. That is, their core products are no longer shrink-wrapped boxes sold off the shelves or through OEM channels. The revenue numbers, still heavily bent towards the traditional product lines, do not tell the full story, which is why I think it has not been captured by the business press at large. They haven’t picked up that Office is becoming Office 365 which is integrated with outlook.com which has the same look and feel of Windows 8 which integrates back to your Office 365 calendar which is connects with your company’s CRM (customer relationship management) solution. And I don’t think most understand how truly different Azure is. Amazon has a well-developed cloud platform (which is in some ways better than Azure). You can even deploy SQL Server and SharePoint in Amazon’s cloud in a show of Microsoft’s new openness. But, they are simply exposing platforms with the primary benefit being hardware abstraction and scalability. They have not made it easier to work with these platforms as services (at least not to the same extent).

From a solution point of view Microsoft now offers not individual services, but a *pattern.* Just like, say, MVVM or MVC gives a developer a broad protective boundary in which to work, the Microsoft ecosystem gives corporate IT a path to bring its services under control. Microsoft knows that the biggest challenge to enterprise IT is making all of those systems that have invaded the company actually work effectively together. Just about every major IT shop in the world has a mish-mash of products and services gathered from this or that vendor whose marketing VP went to Harvard with the CIO cobbled together with poorly documented integration efforts. The holy grail of corporate IT is to make it easier, faster, and less expensive to deploy and maintain business-facing services and applications. Microsoft has long provided products that business rely on. And they have always generally worked *better together* than competitor’s products. Microsoft’s products were not always best of breed individually, but as a collective they had a clear differentiation. Until now, that integration has been internal. Now, Microsoft wants to allow you to bring non-Microsoft products into the fold. Want Hadoop? Sure, here’s an SQL Server connector and we’ll also make it fundamentally better. And we’ll give you a way to deploy it in Azure. Want Linux ? Deploy it to Azure IaaS. And it’s FLUID. If your data architects decide it’s better to have a your data store in Azure vs. SQL Server, right click and select ‘Deploy Database to SQL Azure.’ If you want to keep writing Java, go ahead, they will provide source control for you and give you a platform to run your app servers. If that technology is developed and marketed by Microsoft, even better. But, if not, they’re going to make it easier than anyone else out there to work it into your existing plans. And they’re going to present AppFabric to let you tie it all together. They’ve been laying the groundwork on this for years.

Windows 8 is simply the way people are going to experience all the machinations behind the scene. Microsoft needed a way that it could expose these services to end users in a way that made sense and was consistent no matter which device they decided to use. Windows 8 does that. But, the important advancement Microsoft has made is that it will be easier for companies to pull themselves together into the ecosystem and start presenting themselves more effectively to customers.

It’s still too early to determine how successful this is going to be. Windows 8 isn’t exactly flying off the shelves and there are still major hurdles to be overcome before we see widespread adoption of Microsoft’s model. Some of these challenges are technical and some reach right down into the chaotic nature of corporate IT. Even with my admiration for what Microsoft is trying to do I have serious reservations about it’s ability to execute. That’s a subject for another post. But, the fact that Microsoft has been able to make this fantastically difficult cultural shifts is extraordinary. Steering a huge ship that became as calcified as Microsoft was for so many years is a feat certainly worth noticing.

Speaking at #sqlsat145 and #sqlsat167

Continuing my torrid pace of speaking at SQL Saturdays once in a while, I’ve been fortunate enough to be selected to speak at #167 in Columbus, GA (which I also have a small part in organizing) and at Nashville #145 in October. I’m particularly excited because I’m delivering one of my favorite talks: T-SQL Beyond the Basics. This presentation digs into what happens after you submit a query for execution. Many SQL developers spend an awful lot of time writing queries, but never understand the beautiful marvel of computer science that goes on just on the other side of their SSMS window or application. So I get to spend an hour talking about SQL Server architecture and execution plans. It’s one of those types of presentations where the ‘Ah ha!’ moments I love abound.

One of my own ‘Ah ha!’ moments is another reason why I enjoy this presentation so much. At my first PASS Summit I went to Maciej Pilecki’s (T) session on execution plans (plan caching specifically) and during that 90 minutes all sorts of bits of learning that I had been gathering over the years fell into place. Digging into the internals of SQL Server query executions suddenly made connections that were big revelations to me and my excitement level about my profession was through the roof.

I’m extraordinarily fortunate to do what I do. And I’m looking forward to sharing some of the excitement I have about the amazing world of data and the tools we use with the sql family.

Concurrency in PowerShell: Multi-threading with Runspaces

<This is part 2 in a series.>

Update: The psasync module described in this post is a better version of the code implemented below.

Download the code for this post.

In my last post I looked at using background jobs to execute PowerShell code concurrently, concluding that for many tasks the large amount of overhead makes this method counter productive. Fortunately, there is a better way. What I present here is an expansion on the work of Oisin Grehan (B | T), who deserves the credit for introducing this method. This blog post introduced me to the concepts upon which I expound in this post.

A quick note to begin. I am presenting something that is not well documented and outside the ‘normal’ operations of PowerShell. I don’t think PowerShell was designed to be used in this way, as evidenced by the lack of thread safety in cmdlets and no native synchronization mechanisms (that I can find). I’d love it if someone reading this blog can provide more color around PowerShell’s philosophy of multi-threading, but judging by the built in mechanisms (jobs) the designers wanted to avoid the issue, and for good reasons! Use this at your own risk and only after testing EXTENSIVELY.

To review; the test scenario for this series involves a series of Excel documents that must be loaded into an SQL Server database. The goal is to speed up the process by loading more than one file at a time. So, I need to gather a collection of the file objects and execute a PowerShell script block to execute the ETL (Extraction Transform Load) code against each file. As you can see, this is very simple code, but it must be executed many times … the ideal (but not only) use case for this pattern.

$ScriptBlock = `
{
    Param($File)
    
    . <your_path>\Import-ExcelToSQLServer.ps1
    
    Import-ExcelToSQLServer -ServerName 'localhost' -DatabaseName 'SQLSaturday' -SheetName "SQLSaturday_1" `
        -TableName $($File.BaseName) -FilePath $($File.FullName)
}

What I need is some way for PowerShell to act as a dispatcher to generate other threads on which these import processes can operate. The key elements to this are the RunspacePool and PowerShell classes in the System.Management.Automation namespace. These are classes meant to enable applications to utilize PowerShell processes, but I am using it for a different purpose. Yep, it’s about to get very developery on this blog. But, have no fear non-developers (like my fellow DBA’s) I’m working on making this easier for you.

Every PowerShell pipeline, defined by the PowerShell class, must have an environment of resources and session constructs (environment variables, loaded cmdlets, etc.) in which to run. In other words, every pipeline needs a runspace. A pipeline can only exist on one runspace. However, pipelines can also be queued onto runspace pools. It is this ability to create runspace pools that allows for (albeit clumsy) multi-threading capabilities.

RunspacePools are created through the CreateRunspacePool static method of the RunspaceFactory class. This method has 9 overloads, so there’s plenty of options to explore. The simplest method is:

$pool = [RunspaceFactory]::CreateRunspacePool(1, 3)

This simple line of code creates a pool of 3 runspaces upon which pipelines can run. You can do a lot more with runspace pools, such as establishing session state configurations that can be shared by all the runspaces in the pool. This is handy for, say, loading specific modules or establishing shared variables. But, it is beyond the scope of this post. Choosing the size of your runspace pool is very important. Too many and you will find diminishing (or worse) performance. Too few and you will not reap the full benefits. This is a decision that must be made per computer and per workload. More on this later.

Part of the configuration of the runspace pool is the apartment state value. With this code, I specify that all of the runspaces will be in single-threaded apartments and then open the pool for use.

$pool.ApartmentState = "STA"
$pool.Open()

Apartment states are very complicated topics and I’m not going to attempt to describe them here. I will only say that this is an attempt to force thread synchronization of COM objects. You also need to be aware of these since certain code will only work in a multi-threaded or single-threaded apartment. You should also be aware of what your IDE uses. For instance, the ISE uses STA, while the shell itself (in v2) is MTA. This can be confusing! Since this is a COM mechanism that doesn’t really ‘exist’ in Windows per say, it is not sufficient to solve your thread safety concerns. But, it is my attempt to provide what automatic synchronization I can. With that, a quick word on thread safety.

If you compare the script block above with the script block from my post on background jobs, you will see that the properties of the file objects are quite different. This is because the RunspacePool method does *not* serialize / deserialize objects, but passes the objects to the runspaces by reference. This means that an object on thread B that was created by thread A points to precisely the same memory location. So, if thread B calls a method of the object at the same time thread A is executing code in the same method, thread B’s call could be making modifications to local variables within the method’s memory space that change the outcome of, or break, thread A’s execution and vice versa. This is generally considered to be a bad thing. Be careful with this. You should take care in your code to ensure that the same object cannot be passed to more than one thread. Again, use at your own risk.

At this point, I can begin creating pipelines and assigning them to the runspace pool. In the code download you will see that I run this in a loop to add a script block for every file to the pool, but I’m keeping it simple here. There are a few other bits in the sample code that I don’t expound on in this post, too.

$pipeline  = [System.Management.Automation.PowerShell]::create()
$pipeline.RunspacePool = $pool 
$pipeline.AddScript($ScriptBlock).AddArgument($File)

Here the PowerShell pipeline object is captured and then assigned to the previously created run pool. The script is then added to the pipeline and a file object is passed as an argument. (Note that you can pass n parameters to a script block by appending additional AddArgument() calls. You can also queue n scripts or commands to a pipeline and they will be executed syncronously within the runspace.) The script is not executed immediately. Rather, two methods exist that cause the pipeline to begin executing. The Invoke() method is the synchronous version, which causes the dispatching thread to wait on the pipeline contents to process and return. BeginInvoke() is the asynchronous method that allows for the pipeline to be started and control returned to the dispatching thread.

$AsyncHandle = $pipeline.BeginInvoke()

BeginInvoke() returns an asynchronous handle object the properties of which include useful information such as the execution state of the pipeline. It’s also how you are able to hook into the pipeline at the appropriate time. To do so, the EndInvoke() method is used. EndInvoke() accepts the handle as it’s argument and will wait for the pipeline to complete before returning whatever contents (errors, objects, etc.) that were generated. In this code sample, the results are simply returned to the host pipeline. Also note that since the PowerShell class is unmanaged code, calling Dispose() is wise. Otherwise, garbage collection will not release the memory grants and your powershell.exe process will be bloated until such time as the object is disposed or the process is closed (just for fun you can test this using [GC]::Collect()). Closing the RunspacePool is also good practice.

$pipeline.EndInvoke($AsyncHandle)
$pipeline.Dispose()
$pool.Close()

Notes on usage

You shouldn’t use this method for every task and when you do every decision should be carefully considered. Take the size of your runspace pool, for instance. Think carefully about how and where your code will be executed. Where are the bottlenecks? Where is the resource usage occuring? And, of course, how many CPU cores are on the machine where the code will be executed (both host machine and remote)?

For example, I have used this method to perform index maintenance on SQL Servers. But, consider all of the pieces. If you didn’t know that index rebuilds (but not reorgs!) could be multi-threaded by SQL Server, you could get into some trouble. I came across a database tool that professes to multi-thread index rebuilds, but it’s method is to simply calculate the number of cores available to the server and kick off that number of rebuilds. Ignoring for a moment that you have not left any processors for Windows to use, you’ve also not considered the operations of the index rebuilds themselves. If the max degree of parallelism setting is 0 on the index definition (or any number other than 1), you could be looking at serious resource conflict. Imagine an 8 core server. That’s potentially 64 simultaneous threads! It will work, but the scheduler yields, CPU cache thrashing, context changes, ( cross-NUMA node access costs?) may have serious impact to your system.

So, be careful and think through the impact of the decisions you make when using this method.

Follow

Get every new post delivered to your Inbox.