At my day job, we purchased 3 Sun (er, Oracle) 7210’s (part of their ‘Unified Storage’ platform; the machines are X4540’s with their specialized Unified Storage OS) on the recommendation of one of our vendors. The models we purchased contained 48 250gb disks and 32gb memory — no SSD’s. Our intent was to use these for our VM image storage (VM infrastructure is Xen, using LVM-over-iSCSI to the 7210’s.) We planned to install two units in production, one in our DR environment, and to use ZFS replication to have VM disaster recovery bliss.
It’s been a disaster.
Update [2010/02/26]: Sun has really stepped up to the plate and gotten the right people on this issue. I’m not sure if it’s what I posted here, the numerous ex-Sun contacts that I had who pinged people they knew, or the president of our VAR hammering at some of his contacts within Sun/Oracle. I will keep updating as we work towards RCA’s and a resolution.
Update [2010/08/27]: Thanks to some hard work from our reseller and people at Oracle, we were able to return these units a few months ago. I wish we had been able to work through the issues with Oracle, but needed to get something that we could trust online ASAP. For the record, I do think that we probably did receive a “batch” of bad units; I still have not heard from anyone else who has had multiple independent units fail simultaneously as we did. I will also keep comments open here, and encourage anyone who has had great deployments to post a followup – I do believe that there are people that have had great success with these units out there, or else they wouldn’t be selling so well! :) And again, thank you to the hard work from our reseller, and for all the people from Oracle (and those who used to work for Sun but did not move to Oracle) who did their best to help us!
Here’s the pre-sales issues we had with the unit..
- ZFS does not use traditional raid controllers, and uses your system memory as cache. This is great for reads (massive cache!), but for writes, it means that it’s extremely dangerous to turn write caching on. The memory used for write caching (which could be up to 32G in our system, or more in one with more memory) is not battery-backed, so if the system fails in any way (power supply failure, crash, power failure, etc.), any data in that cache is gone. On a “normal” ZFS configuration with SSD’s for the ZIL, there wouldn’t be a need for write caching in memory – data is cached on the write-optimized SSD’s, and if your system crashes, it will still be there. However, the model we ended up with does not contain a SSD, and the Sun Oracle-supported SSD for this unit is around $6000 (18gb ‘Logzilla’.) YIKES.
- Without SSD or write cache (we need the write cache disabled for the reasons described above), the maximum streaming synchronous write speed I have been able to get over iSCSI is ~35MB/s – that’s with triple-mirrored disks. With this configuration, if I have multiple virtual machines running and start a write-heavy process, the other VMs will slow to a crawl with high I/O wait. Not acceptable in production.
- For the reasons above, I firmly believe that Sun Oracle should NOT sell any ZFS-based storage appliances without SSDs – at least not without a big warning sticker that says “Make sure you can afford to either lose any data in cache (in some environments this would not be a huge issue), or work with the speeds described above.
- Sun Oracle no longer offers Try-and-Buy programs. We’ve been told that once you open the unit, it’s yours. Not a strong sign that they are confident in their technology. Oh, and we were told this after we had purchased and installed the units.
Now, for the technical problems we’ve been having..
- The DR unit has had problems since a few days after we installed it. It’s been throwing fan errors, which Sun Oracle has told us is a ‘firmware bug’ with no fix (even though it only happens on one of three nodes) – the workaround is to restart the service processor. It also has had a bad CPU, and worst of all, it started rebooting every night. Sun Oracle support had us jump through a zillion hoops for over a month to try to debug the nightly reboot issue, replaced various components (including the motherboard), and never came up with a workable solution. Then, a few weeks ago, after 4-5 days of not touching the system (I had been assigned to other tasks and the coworker who was taking it over was on vacation, plus a weekend), the reboots suddenly stopped. We were never able to get the reboots to either alert Sun Oracle via the phone home mechanism, or trigger the alerts we had set up via the web GUI – even though it was rebooting unexpectedly. There was no easy way to figure out that it had rebooted, in fact – I ended up having to go to the shell (naughty!) and run ‘last’ to see the reboot times. If we were relying on the GUI, and didn’t have monitoring set up of the system (which is what alerted us to the reboots), we wouldn’t have known until we put a production load on it — which, since this is a DR unit, would only have been when we failed to our DR site. Sun Oracle support has not been able to offer a RCA, except to say that whenever it rebooted the mpt driver threw errors. Since it ‘fixed itself’, Sun Oracle considered the case closed. The combination of the machine being flaky from the start combined with the fact that Sun Oracle couldn’t debug the issue has made us push for a replacement unit, but it seems nobody in Sun Oracle support has the authority to replace a chunk of hardware that can’t be trusted.
- A few weeks into debugging the issues with the DR unit, both of the production units (which were taking a test workload) developed problems at the same time. One of them stopped answering on the VLAN that we were running iSCSI on (which was trunked on top of the management VLAN; that VLAN continued to work fine), and the other turned itself off. The best part is this happened within 10 minutes of each other. Again, Sun Oracle support is unable to give us a root cause for either issue. The systems are completely isolated from each other physically (separate circuits, separate racks in different areas of the data center, etc) — the only shared component is the network (both are dual-homed to two switches.) Nothing else in the data center exhibited any issues during this period – the switches didn’t show any oddities, other systems and PDU’s on the same UPS branch did not show power issues, etc. Again, Sun Oracle’s response was “if it happens again let us know.” Even though we would have taken a full production outage of _everything_ because of this. Again, neither of the units triggered the phone-home or self-alert functionality when these issues occurred.
- When troubleshooting with Sun Oracle, their first step is to reboot the system into a debug mode. These are production storage appliances – having to reboot to get a RCA for a previous issue is *not* acceptable! Again, this makes it feel more like “this a server that should be in a redundant zero-shared config” more than a storage appliance.
Really, I can understand technical issues; they happen with every product. However, it appears that the way the specialized version of Solaris is set up on these, they are incapable of getting useful logs. On top of that, the support has been worse than useless. They have chewed through a ton of our time (we’ve probably spent at least 4 man-weeks working with them on this), and have not come back with anything useful. After escalating the tickets, nobody has the authority to either replace a box or allow us to return it. The escalation process is slow and doesn’t actually really seem to do anything; after requesting escalations, we don’t get higher-level contacts (the only way I’ve managed to do that so far is by blind-calling the service center when our regular tech wasn’t available, and requesting a callback), and the people it’s escalated to still have no power to resolve our issues. Every time we request escalation it also seems to slow the process by at least a few days – it’s crazy. The support people all seem to want to help, but the system is stacked against them.
So, right now I’m stuck with 3 7210’s that have all had issues, that I don’t trust in production, and I can’t return. Thanks a lot Sun Oracle! What happened to the old-fashioned Sun, where the gear was way too expensive (and yes, these 7210’s were expensive), but the support made up for it?
My advice to you if you are looking at a Unified Storage appliance is to run far away – the ones with SSDs included would in all likelyhood perform as expected, but the quality of the support and lack of customer service will scare me away from ever making a Sun Oracle purchase again.
Update [2010/02/24] – After chatting with many contacts that I used to have within Sun, I have changed most of the relevant ‘Sun’ names to ‘Oracle’. It seems that when Oracle purchased Sun, many, many good people were let go. Bummer.