Concurrency in PowerShell: Multi-threading with Runspaces

<This is part 2 in a series.>

Update: The psasync module described in this post is a better version of the code implemented below.

Download the code for this post.

In my last post I looked at using background jobs to execute PowerShell code concurrently, concluding that for many tasks the large amount of overhead makes this method counter productive. Fortunately, there is a better way. What I present here is an expansion on the work of Oisin Grehan (B | T), who deserves the credit for introducing this method. This blog post introduced me to the concepts upon which I expound in this post.

A quick note to begin. I am presenting something that is not well documented and outside the ‘normal’ operations of PowerShell. I don’t think PowerShell was designed to be used in this way, as evidenced by the lack of thread safety in cmdlets and no native synchronization mechanisms (that I can find). I’d love it if someone reading this blog can provide more color around PowerShell’s philosophy of multi-threading, but judging by the built in mechanisms (jobs) the designers wanted to avoid the issue, and for good reasons! Use this at your own risk and only after testing EXTENSIVELY.

To review; the test scenario for this series involves a series of Excel documents that must be loaded into an SQL Server database. The goal is to speed up the process by loading more than one file at a time. So, I need to gather a collection of the file objects and execute a PowerShell script block to execute the ETL (Extraction Transform Load) code against each file. As you can see, this is very simple code, but it must be executed many times … the ideal (but not only) use case for this pattern.

$ScriptBlock = `
{
    Param($File)
    
    . <your_path>\Import-ExcelToSQLServer.ps1
    
    Import-ExcelToSQLServer -ServerName 'localhost' -DatabaseName 'SQLSaturday' -SheetName "SQLSaturday_1" `
        -TableName $($File.BaseName) -FilePath $($File.FullName)
}

What I need is some way for PowerShell to act as a dispatcher to generate other threads on which these import processes can operate. The key elements to this are the RunspacePool and PowerShell classes in the System.Management.Automation namespace. These are classes meant to enable applications to utilize PowerShell processes, but I am using it for a different purpose. Yep, it’s about to get very developery on this blog. But, have no fear non-developers (like my fellow DBA’s) I’m working on making this easier for you.

Every PowerShell pipeline, defined by the PowerShell class, must have an environment of resources and session constructs (environment variables, loaded cmdlets, etc.) in which to run. In other words, every pipeline needs a runspace. A pipeline can only exist on one runspace. However, pipelines can also be queued onto runspace pools. It is this ability to create runspace pools that allows for (albeit clumsy) multi-threading capabilities.

RunspacePools are created through the CreateRunspacePool static method of the RunspaceFactory class. This method has 9 overloads, so there’s plenty of options to explore. The simplest method is:

$pool = [RunspaceFactory]::CreateRunspacePool(1, 3)

This simple line of code creates a pool of 3 runspaces upon which pipelines can run. You can do a lot more with runspace pools, such as establishing session state configurations that can be shared by all the runspaces in the pool. This is handy for, say, loading specific modules or establishing shared variables. But, it is beyond the scope of this post. Choosing the size of your runspace pool is very important. Too many and you will find diminishing (or worse) performance. Too few and you will not reap the full benefits. This is a decision that must be made per computer and per workload. More on this later.

Part of the configuration of the runspace pool is the apartment state value. With this code, I specify that all of the runspaces will be in single-threaded apartments and then open the pool for use.

$pool.ApartmentState = "STA"
$pool.Open()

Apartment states are very complicated topics and I’m not going to attempt to describe them here. I will only say that this is an attempt to force thread synchronization of COM objects. You also need to be aware of these since certain code will only work in a multi-threaded or single-threaded apartment. You should also be aware of what your IDE uses. For instance, the ISE uses STA, while the shell itself (in v2) is MTA. This can be confusing! Since this is a COM mechanism that doesn’t really ‘exist’ in Windows per say, it is not sufficient to solve your thread safety concerns. But, it is my attempt to provide what automatic synchronization I can. With that, a quick word on thread safety.

If you compare the script block above with the script block from my post on background jobs, you will see that the properties of the file objects are quite different. This is because the RunspacePool method does *not* serialize / deserialize objects, but passes the objects to the runspaces by reference. This means that an object on thread B that was created by thread A points to precisely the same memory location. So, if thread B calls a method of the object at the same time thread A is executing code in the same method, thread B’s call could be making modifications to local variables within the method’s memory space that change the outcome of, or break, thread A’s execution and vice versa. This is generally considered to be a bad thing. Be careful with this. You should take care in your code to ensure that the same object cannot be passed to more than one thread. Again, use at your own risk.

At this point, I can begin creating pipelines and assigning them to the runspace pool. In the code download you will see that I run this in a loop to add a script block for every file to the pool, but I’m keeping it simple here. There are a few other bits in the sample code that I don’t expound on in this post, too.

$pipeline  = [System.Management.Automation.PowerShell]::create()
$pipeline.RunspacePool = $pool 
$pipeline.AddScript($ScriptBlock).AddArgument($File)

Here the PowerShell pipeline object is captured and then assigned to the previously created run pool. The script is then added to the pipeline and a file object is passed as an argument. (Note that you can pass n parameters to a script block by appending additional AddArgument() calls. You can also queue n scripts or commands to a pipeline and they will be executed syncronously within the runspace.) The script is not executed immediately. Rather, two methods exist that cause the pipeline to begin executing. The Invoke() method is the synchronous version, which causes the dispatching thread to wait on the pipeline contents to process and return. BeginInvoke() is the asynchronous method that allows for the pipeline to be started and control returned to the dispatching thread.

$AsyncHandle = $pipeline.BeginInvoke()

BeginInvoke() returns an asynchronous handle object the properties of which include useful information such as the execution state of the pipeline. It’s also how you are able to hook into the pipeline at the appropriate time. To do so, the EndInvoke() method is used. EndInvoke() accepts the handle as it’s argument and will wait for the pipeline to complete before returning whatever contents (errors, objects, etc.) that were generated. In this code sample, the results are simply returned to the host pipeline. Also note that since the PowerShell class is unmanaged code, calling Dispose() is wise. Otherwise, garbage collection will not release the memory grants and your powershell.exe process will be bloated until such time as the object is disposed or the process is closed (just for fun you can test this using [GC]::Collect()). Closing the RunspacePool is also good practice.

$pipeline.EndInvoke($AsyncHandle)
$pipeline.Dispose()
$pool.Close()

Notes on usage

You shouldn’t use this method for every task and when you do every decision should be carefully considered. Take the size of your runspace pool, for instance. Think carefully about how and where your code will be executed. Where are the bottlenecks? Where is the resource usage occuring? And, of course, how many CPU cores are on the machine where the code will be executed (both host machine and remote)?

For example, I have used this method to perform index maintenance on SQL Servers. But, consider all of the pieces. If you didn’t know that index rebuilds (but not reorgs!) could be multi-threaded by SQL Server, you could get into some trouble. I came across a database tool that professes to multi-thread index rebuilds, but it’s method is to simply calculate the number of cores available to the server and kick off that number of rebuilds. Ignoring for a moment that you have not left any processors for Windows to use, you’ve also not considered the operations of the index rebuilds themselves. If the max degree of parallelism setting is 0 on the index definition (or any number other than 1), you could be looking at serious resource conflict. Imagine an 8 core server. That’s potentially 64 simultaneous threads! It will work, but the scheduler yields, CPU cache thrashing, context changes, ( cross-NUMA node access costs?) may have serious impact to your system.

So, be careful and think through the impact of the decisions you make when using this method.

About these ads

16 Responses to Concurrency in PowerShell: Multi-threading with Runspaces

  1. Tim Green says:

    I really liked this article and found that the technique described here works much better than PowerShell’s built-in job functionality for testing programs with multiple threads. Testing with Invoke-Command -AsJob I’ve found one can’t be sure that all the threads will be running simultaneously, where the Invoke-Async function works well.

    I have run into one problem though. On a couple servers – connecting to one via Citrix and the other via RDP – I’ve found Invoke-Async fails repeatedly with the simplest of tests at this line:

    $pipelines[$i].EndInvoke($jobs[$i])

    There error shown is:
    WARNING: error: Method invocation failed because [System.Threading.ManualResetEvent] doesn’t contain a method named ‘EndInvoke’.

    Followed by:
    Method invocation failed because [System.Threading.ManualResetEvent] doesn’t contain a method named ‘Dispose’.
    + $pipelines[$i].Dispose <<<< ();
    + CategoryInfo : InvalidOperation: (Dispose:String) [], RuntimeException
    + FullyQualifiedErrorId : MethodNotFound

    I'm quite puzzled about why this occurs, especially when it works fine on my PC.

  2. jboulineau says:

    I haven’t come across that problem. If you post more of your code I’ll take a look.

    • Tim Green says:

      I believe this is my error. I put the Invoke-Async in a module and have found when I use it from there, I get this error. If I import the function using dot sourcing, it doesn’t occur. I don’t really understand why that is exactly – a scope problem maybe?

      While trying to figure out what was going on though I did notice a bug in the function. $ObjArray is indexed using $i, but $i starts at 1 instead of 0 and with 0-based arrays in PowerShell, the first element in $ObjArray is never processed and the last thread gets a null argument. As $jobs, $pipelines and $waithandles are all hashes instead of arrays, I believe only the following line needs to change:
      [void]$pipelines[$i].AddScript($ScriptBlock).AddArgument($ObjArray[$i-1])

      I think there are a few ways Invoke-Async could be enhanced. For one, I think one could use WaitAny() instead of WaitOne() and then get notified immediately when the first thread finishes. If I understand the way it works now correctly, it’s going to wait for the 1st to finish, and then the 2nd, etc. A single, long running may block and the progress bar wouldn’t update even though other threads are finishing. At least, I think so if I understand how this is working correctly, which I may not :)

      Also, it would be nice if the output were returned to the console when the thread finished, like receive-job works with background jobs. I believe it’s doable and if I get some spare time, I hope to investigate it.

      Anyway, thanks again. I’ve learned a lot

      • jboulineau says:

        The problem with WaitAny is that it has a limit of 64 handles.

        Thanks for the code review. I’ll check on the arrays and try to figure out what I was thinking when I wrote the code. I’ve actually got a full module that I hope to post on soon that will provide some encapsulated multi-threading. It takes a different approach than this post, which was really just to demonstrate the concepts.

        Thanks for reading!

  3. Very good Article ! Thank you !

    I searched a way to report Progress wit a not pooled Runspace and endeup wit [HashTable]::Synchronized(@{}).

    Example:
    $sharedData = [HashTable]::Synchronized(@{})
    $sharedData.Counter = 0

    $PS = [PowerShell]::Create()
    $PS.Runspace.SessionStateProxy.setVariable(“sharedData”, $sharedData)
    $PS.AddScript({1..1000 | % {$sharedData.Counter = $_}})
    $Handle = $PS.BeginInvoke()
    while ($Handle.IsCompleted –eq $false) {
    write-Host $([string]$sharedData.Counter + ‘,’) -NoNewline}
    $Result = $PS.EndInvoke($Handle)

    P.S. Avery very good Webcast about PowerShell Speed up, Memory usage and Multithreading you can find at http://www.idera.com/Events/RegisterWC.aspx?EventID=297
    From PowerShell MVP Dr. Tobias Weltner

  4. pamkkkkk says:

    I searched a way to report Progress from a non Pooled Runspace so i endend up with [HashTable]::Synchronized(@{}).

    Example:
    $sharedData = [HashTable]::Synchronized(@{})
    $sharedData.Counter = 0

    $PS = [PowerShell]::Create()
    $PS.Runspace.SessionStateProxy.setVariable(“sharedData”, $sharedData)
    $PS.AddScript({1..1000 | % {$sharedData.Counter = $_}})
    $Handle = $PS.BeginInvoke()
    while ($Handle.IsCompleted –eq $false) {
    write-Host $([string]$sharedData.Counter + ‘,’) -NoNewline}
    $Result = $PS.EndInvoke($Handle)

    P.S. A very very good Webcast about PowerShell Speed up, Memory usage and Multithreading you can find at http://www.idera.com/Events/RegisterWC.aspx?EventID=297
    From PowerShell MVP Dr. Tobias Weltner

  5. Pingback: Parallel PowerShell | rambling cookie monster

  6. Pingback: psasync Module: Multithreaded PowerShell «

  7. Martin9700 says:

    Really interesting stuff, thanks for posting!

    Would it be fair to say that using this Runspace code has essentially been replaced with the workflow process in Powershell 3.0?

    I understand it would still work, and for those still on 2.0 or lower (majority of Powershell users?) then it’s a better option, but it seems like workflow’s are the way to go, going forward?

    • jboulineau says:

      I need to do a post on workflows. There is no doubt that workflows are an improvement over the *-Job pattern. However, there are still some aspects that may not be what you want. For instance, in the workflow objects are serialized when passed to spawned threads (runspaces I assume). Another issue is apparently that parallel workflow execution isn’t particularly fast (http://shellyourexperience.com/2012/07/31/powershell-3-0-workflows-and-sql-server-dont-think-wrong-like-i-did/). I haven’t done all the testing yet, so I can’t confirm. Also, you don’t have the same granularity of control.

      That being said, there are features of workflows that the runspace method cannot easily match. One example is that workflows can be restarted at a checkpoint if interrupted. So I think ultimately workflows and runspace pools are two different use cases, both with legitimate spots in your toolbox. For me, the psasync module makes using runspaces so simple I think I will continue to use that method primarily instead of workflows simply because it is more powerful. And the path to 3.0 is probably not a short one for me due to the environment in my workplace. However, I’m sure I’ll find use for workflows from time to time in the future.

      • Martin9700 says:

        That makes sense, since they’re mapping PS to workflow items. Has to be a lot of overhead with that. I have a script that uses Jobs quite extensively so I will have to adapt it to Runspaces and compare the two. Always love comparing techniques and seeing which ones are best!

  8. jjohn says:

    Hi,
    Awesome post. Just what i wanted. I was sick of background jobs as they were gnawing my performance.
    I had a ps script that did some work serially and it just took very long. I modified it using runspaces to spawn 10 threads each time and results are pretty fast. I’m satisfied with that. But for the purposes of logging, I would like to know the thread ID of each such thread. So my question is: Is it possible to get the native thread id from inside powershell? I need this because I’m trying to correlate certain administrative events on a HyperV host(which have a thread id) with the actual thread that caused the event.

    I have tried : [System.Threading.Thread]::CurrentThread which does not give native thread id’s
    I cannot get Appdomain.GetCurrentThreadID() to work though.

    Any pointers?
    Thanks

    • jboulineau says:

      I’m afraid your question has exceeded my expertise. I don’t have a deep enough understanding of the Windows threading model to give you an adequate answer. My guess would be to look at the powershell.exe process itself since the managed threads run in that context. But, that’s just a guess. If you find out, post another comment to let me (and my tens of readers) know!

  9. Micah says:

    Can you elaborate more on Sharing State Configurations?

    I am trying to share loaded assemblies, functions and variables between all the runspaces in the pool. Unfortunately, I feel like I am missing something.

    • jboulineau says:

      I’ve passed variables between runspaces before, but always ran into thread safety issues. For assemblies and functions I simply load them up into each runspace separately.

      If you do find the secret to success, blog about it!

  10. Pingback: Multi Threaded Powershell | Ron's Space

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: