How to Hang a SharePoint Backup Job Without a Rope
Here’s a scenario that many seasoned consultants will be familiar with. A customer has a major problem – usually self-induced – that they have been trying to solve themselves for some period of time or hired some junior consultant to elevate from a minor annoyance to a complete catastrophe. So they call me in a panic to come bail them out (which, if they’d done in the first place, would have prevented the problem altogether, but they never seem to learn). So I hop on a plane, parachute in (figuratively if not literally), and start pounding keys until the mess is cleaned up.
Sometimes, when you do this sort of thing, you learn something valuable along the way; mostly, you just discover that someone, somewhere, did something really stupid, which was compounded by gross incompetence and exacerbated by complete hysteria. But every so often you come across a situation that was really nobody’s fault and doesn’t involve a malicious moron going out of their way to make your life more difficult. These are rare but they do happen.
In point of fact, I spent Valentine’s Day Weekend in just such a situation. A customer for whom I built a 2003 portal years ago had been trying for some time to migrate to 2007 using internal staff and making limited progress. They finally had it mostly ready to go but they were struggling to move it from development to production. They had migrated most of their old content to a development server (or so they told me), designed their new master pages and navigation structure, installed a medium server farm, configured a nice new SSP, and then hit the wall.
Naturally, this is the part where I swoop in to save the day. What I found when I got there was a development environment that had never had a single patch applied to it, content that was partially migrated five months ago, a production farm on SP1 with a mixture of x86 and x64 servers, and a customer who didn’t see any problem with having legacy OWA web parts on the home page of an Intranet that serves thousands of users. Oh, the joys of being a consultant.
The first step, of course, was to get all the servers straightened out. While I couldn’t convince them to go with a pure x64 environment (which they are going to regret in, oh, I dunno, the first five minutes after it goes live), I did get free reign to patch everything up to current revs. So off I went on the never-ending merry-go-round of download/install/run config wizard/repeat. Since the migration plan called for backing up the dev content databases and restoring them to production, this meant we had to bring the dev servers up to the same level as production, lest we spend eternity in a purgatory of mismatched content databases. Since this ate up the remainder of the first day, the plan (which I honestly believe is a set of utopian ideals we humans delude ourselves with just to give us something to cling to when everything goes completely down the toilet) was to come in the next day, migrate the remaining content, do the backup/restore, and get out early enough to salvage something of the frustrated expectations of everyone’s significant other, it being Valentine’s Day and all (which was fine for them but didn’t do me much good being two thousand miles from home).
Heh. What’s that saying about the best laid plans of mice and men?
First, we discovered, after much mucking around in the 2003 database that I don’t recommend anyone ever try even at home and that brought back very painful memories from the dark days of SP development, that there were, jeez, only like 800 document libraries and 200 lists that had been updated since the original content migration (a total of nearly 10,000 list items). Who would have thought that users might actually upload stuff in the last four months? There went lunch. After abandoning all hope of doing anything about it until after the cutover (did I mention the go live date was only 72 hours away?), we set about trying to backup and restore what content we did have. so at least the design team could finish their changes and get the thing up and running in production; maybe, with any luck, we’d make dinner before the crush of couples descended upon every halfway decent restaurant in the greater Washington, DC area.
Yeah, right…as if. You see my friends, the dev server, which had behaved itself so admirably by completing backup after backup in the months gone by, suddenly decided that no content database backup, however big or small, was to be performed that day. Configure the parameters, submit the job, it creates the basic set of .bak files and .xml goodies, backs up the config database, then – WHAM! Full stop. No errors, no explanation, no nothing. The timer job just hangs, with the last entry in the backup log consisting of a very helpful notice that SQL would check back with us in exactly 4.22 hours to see if we had received satisfactory customer service. WTF?
After wasting the next couple of hours searching for ghosts in the machine and trying every timer job trick I’ve ever come across, it finally dawned on me to check the SQL server itself. You see, I got to thinking what had changed that would stop the backups from working, which got me to double-checking that all the patches I ran the previous day had, indeed, installed correctly, right up to the December cumulative update. That’s when it hit me – was it possible, just maybe, that something in one of the SharePoint patches was dependent upon a certain patch level of SQL 2005? After all, every patch we’ve received so far has managed to break something important (SPD workflows after SP1, anyone? How ’bout the lovely API changes in the Infrastructure Update? Oh, you wanted AAM’s to work with reverse proxies after the August patches? How ridiculous.) so it stood to reason that someone had never bothered to tell us that some line of code they "fixed" needed another "fix" that someone on the SQL team thought was really, really important.
And that’s when I found it. The development SQL database hadn’t been patched at all – not ever. Not one tiny hotfix. And here we are at SQL 2008 with three service packs under our belt for the previous version. Naturally, this meant a big fat download and a ton of time watching a crawling status bar would be the highlight of the rest of my afternoon as we all know how quick and efficient SQL service pack installations are. So much for an early dinner.
After the patch, I was finally able to complete the backup set. Luckily, someone did have the presence of mind to patch the production database in advance, so I didn’t have to repeat that little adventure. But, of course, the tale couldn’t end there, as the restore process had it’s own set of hurdles to put in my path. First, of course, is the fabulous requirement to use UNC mappings for backup and restore locations. This is because, whether you realize it or not, a good portion of the backup/restore work is actually done on the SQL server itself, which needs to be able to access the directory the backup files are in (which is why a local mapping on the WFE or App server would never work). But that inevitably leads to all sorts of permission issues, especially if some GENIUS decides to run the SQL service under the LOCAL SYSTEM account. Does this security principal have permissions to access network resources like the lovely UNC share we just created? Of course not! That would just be silly. So now I have to store the backup files in a local directory on my production SQL box (and share them for good measure so the WFE’s can see them) which goes against just about every good security practice I can possibly think of.
I cringe when I have to remote into a production SQL box for anything but I especially hate it when a half-dozen people are standing around watching. Not because I’m going to break anything (although that’s always a possibility) but because they will then think they can do it without breaking anything. That’s just asking for bad things to happen. But be that as it may, I was finally able to complete the restore process and get the production machines up and going. The internal team still had thousands of list items to migrate but that was their job – my work was done and I could finally have a well-deserved drink (good thing the hotel bar has a decent selection of Scotch).
The moral of this story, of course, is keep those servers patched! You may not ever think about a dependency between patch levels on different pieces of your architecture but it’s a very real issue, as I hope I’ve demonstrated adequately. Your SQL servers must be on at least SP2 (if you’re running SQL 2005) if your MOSS level is SP1 or higher for everything to work right. And don’t assume the network or server people are on top of it – chances are they think SharePoint is your problem and you should be responsible for every piece of that puzzle. And please follow best practices for installing everything – don’t let the DBA’s convince you that it’s just peachy to run SQL as a local account (I guarantee it will bite you in the arse at some point, as I also discovered this week when I encountered a customer who installed a single-box solution and failed to document the ‘sa’ password for a SQL Express instance that was also running under LOCAL SYSTEM).
So ends my sordid tale for this fine holiday. And a Happy Valentine’s Day to you. Now where’s that Cupid guy with my heart-shaped box containing a dozen Padron Serie 1926’s?