The Case of the Orphan Transactions
July 31, 2013 2 Comments
I usually have trouble making it through long blog posts and this isn’t going to be short. But, this is quite a story and can’t be told quickly.
It all started with an open transaction; well, almost two THOUSAND open transactions. The database maintenance job was blocked and throwing alerts. It was not blocking anything else, but of course that’s because the developers, as a matter of pride, always use NOLOCK on every query they write. Who cares about accurate queries anyway?
Here’s the partial output from sp_WhoIsActive:
Any experienced DBA knows immediately what’s going on here. Some developer wrote code that opens transactions without properly committing or rolling them back. This results in a cascade of nested transactions and, eventually, a killed SPID and lost work. So what to do? Being an experienced DBA I knew the answer to that, too: I blame the developers. Of course, they SWEAR they aren’t opening any transactions. There used to be a problem where connections were not always being closed, but they fixed that months ago. Being an experienced DBA I don’t believe them, so I acquire a copy of the code myself. Shockingly I find the developers are actually correct; the code (although a spaghetti mess of static methods, nested internal and partial classes, and SQL injection vulnerabilities) is (more or less) properly managing it’s connections and is not opening transactions. It was time to investigate (what else am I going to do on a Friday night?).
Taking an inventory of what I knew: the connection opened on July 19th, but had only been sleeping for 23 seconds. Obviously the connection was still being utilized. And the same stored procedure was being executed on the connection over and over again. But, open_tran_count was not increasing. The connection was over ODBC using a DSN that is shared by numerous processes with the same domain user account. This service ran on 8 servers, but only one of them was causing a problem. I needed more information, so I headed to the DMVs.
This is interesting. There are 1,926 open transactions but there is only one in this DMV. So, for which one of those 1,926 transactions does this DMV report? The eminent Adam Machanic, author of sp_WhoIsActive, reveals that he has to go to the deprecated sysprocesses view in order to find out how many transactions are actually open. Books online is no help. A bit of testing shows that transaction_begin_time refers to the time the FIRST transaction was started. Let’s check another:
Again I had the same problem: only one transaction record. More testing shows that database_transaction_begin_time is the time the first transaction in the series touched the database with a modification. So there were 30 hours between the time the first transaction started and the time one actually touched (attempted to modify) the database. This was all interesting, but not particularly useful since there wasn’t a clear explanation of this behavior in the code.
The question remained, what was opening all of those transactions? The stored procedure was solid. It used TRY … CATCH with appropriate commit and rollback. It wrapped an explicit BEGIN TRANSACTION around an INSERT, but, while extraneous, that was not of great concern. The code followed the try..catch…finally pattern with the call to the OdbcConnection.Close() method in the finally block. Everything looked good there. Was it possible that, contrary to the documentation, Close() did not actually roll back open transactions? I opened up Visual Studio and in a few minutes had hammered out a quick app to test. And, in fact, the documentation is incorrect or misleading. Close() does NOT roll back open transactions when connection pooling is enabled. The transactions are only rolled back when another process acquires the same connection from the pool and sp_reset_connection is executed. However, this means that those transactions should have been rolling back because the code was properly closing the connection. The mystery thickened. At this point I should have become more suspicious, but I was too deep into the weeds looking for the zebra. I was grasping at straws.
What ELSE could be opening transactions and leaving them open? The only thing I could think of was
implicit transactions, but NO ONE uses implicit transactions except Oracle, right? But, it would offer a partial explanation. Every time the stored procedure was called two transactions would be opened, but only one committed, leaving the second open. The prospect was intriguing. However, based on my test results on the previous experiment, I should expect perhaps one to be held open after each call to the proc, but not 1,926. But, maybe there was some strange confluence of events since now we had to consider read-only transactions in the mix, as well. Some quick investigation should put it to bed; if there are no connections using implicit transactions it’s a dead end. Since that particular connection setting doesn’t seem to show up in the DMVs (maybe it’s hidden somewhere I couldn’t find) I quickly grabbed the ExistingConnection trace from Profiler. Of course, of all apps in the world to use this esoteric connection setting it would have to be this one. Hidden somewhere in compiled vendor code to which we didn’t have access it was doing this:
Implicit transactions and repeatable read is an interesting combination. Is it possible that some race condition was the culprit where the vendor app opened the connection using the same DSN and set implicit transactions, but the setting wasn’t changed when the connection was grabbed from the pool by this custom written process? Perhaps sp_reset_connection had some loophole or quirk? It turns out that sp_reset_connection is not well documented at all. It is considered an API procedure by Microsoft (you can’t execute it without access to the API) and they have chosen not to document it. In fact, this Stackoverflow post seems to be the best source of documentation out there and is also referenced by John Huang among others. So, it doesn’t change the isolation level, perhaps it doesn’t change implicit transaction, as well. More code in my test app would prove that this, also, was a dead end. Not only does the setting return to implicit_transactions off, but the transactions would have rolled back anyway.
I was well and truly stumped. This was a big problem, not so much because of user or system impact, but because problems like this eat at me until I find an explanation. I was in deep and was scrambling for a way out. It was so consuming my thoughts that I subjected my lunch companion to a thorough overview of my efforts. Just as I was about to leave, it struck me; what if the code I was reading was not actually the code that was running? Not that the version of the code I had and the version of the code that was on the server were different. I knew they were the same. What if the code on the server’s hard drive was not the code that was in the server’s memory? This code ran as a Windows service. What if the .exe and .dll were replaced, but the service was never restarted? A quick test proved that you can, in fact, replace the files without stopping the service. The fix the developers put in months ago to properly close the connection was never actually implemented. In fact, when they restarted the service after I presented my theory it failed because they had also changed the service account at the same time they changed the code and it did not have the permissions required to run the service.
So, returning to the simple stored procedure, if the single insert timed out because it was blocked by, say, an index rebuild operation the COMMIT would never execute, leaving the explicit transaction open forever. The app would continue to time out and run the procedure again and again, allowing the transactions to build up over the course of time. When whatever process that was blocking and causing timeouts was cleared, the transactions would start to commit again, causing the watch to stop, so to speak, at 1,926 open transactions. The procedure did not have the XACT_ABORT ON statement, so the condition would never be properly handled by SQL Server.
In retrospect, there were a few other things I could have tried to get more information. I could have used Fn_dblog() to inspect what the process was doing. I could have queried WMI to see how long the process had been running to validate my theory (I got lucky with the service account). And I still have outstanding questions (why was there a 30 hour gap in the transaction start times in the DMVs?). But, all-in-all, it was a fascinating troubleshooting experience and I learned quite a bit.
- OdbcConnection.Close() does NOT roll back open transactions when connection pooling is enabled. They are rolled back by sp_reset_connection when another request pulls the connection from the pool.
- Sp_reset_connection DOES reset the implicit_transaction setting, but not isolation level.
- You can overwrite dependent files for a Windows service without stopping the service.
- We could use a bit more information about transactions from the DMVs.
- ODBC connection pooling is controlled at the driver level, so you can’t specify different setting for different DSNs. I think you can extend the connection string, however, but I didn’t test or confirm.