Sun’s Unified Storage 7210 – designed to disappoint?

by nc on February 23, 2010 · 55 comments

Sun 7210

At my day job, we purchased 3 Sun (er, Oracle) 7210’s (part of their ‘Unified Storage’ platform; the machines are X4540’s with their specialized Unified Storage OS) on the recommendation of one of our vendors. The models we purchased contained 48 250gb disks and 32gb memory — no SSD’s. Our intent was to use these for our VM image storage (VM infrastructure is Xen, using LVM-over-iSCSI to the 7210’s.) We planned to install two units in production, one in our DR environment, and to use ZFS replication to have VM disaster recovery bliss.

It’s been a disaster.

Update [2010/02/26]: Sun has really stepped up to the plate and gotten the right people on this issue. I’m not sure if it’s what I posted here, the numerous ex-Sun contacts that I had who pinged people they knew, or the president of our VAR hammering at some of his contacts within Sun/Oracle. I will keep updating as we work towards RCA’s and a resolution.

Update [2010/08/27]: Thanks to some hard work from our reseller and people at Oracle, we were able to return these units a few months ago. I wish we had been able to work through the issues with Oracle, but needed to get something that we could trust online ASAP. For the record, I do think that we probably did receive a “batch” of bad units; I still have not heard from anyone else who has had multiple independent units fail simultaneously as we did. I will also keep comments open here, and encourage anyone who has had great deployments to post a followup – I do believe that there are people that have had great success with these units out there, or else they wouldn’t be selling so well! :) And again, thank you to the hard work from our reseller, and for all the people from Oracle (and those who used to work for Sun but did not move to Oracle) who did their best to help us!

Here’s the pre-sales issues we had with the unit..

  1. ZFS does not use traditional raid controllers, and uses your system memory as cache. This is great for reads (massive cache!), but for writes, it means that it’s extremely dangerous to turn write caching on. The memory used for write caching (which could be up to 32G in our system, or more in one with more memory) is not battery-backed, so if the system fails in any way (power supply failure, crash, power failure, etc.), any data in that cache is gone. On a “normal” ZFS configuration with SSD’s for the ZIL, there wouldn’t be a need for write caching in memory – data is cached on the write-optimized SSD’s, and if your system crashes, it will still be there. However, the model we ended up with does not contain a SSD, and the Sun Oracle-supported SSD for this unit is around $6000 (18gb ‘Logzilla’.) YIKES.
  2. Without SSD or write cache (we need the write cache disabled for the reasons described above), the maximum streaming synchronous write speed I have been able to get over iSCSI is ~35MB/s – that’s with triple-mirrored disks. With this configuration, if I have multiple virtual machines running and start a write-heavy process, the other VMs will slow to a crawl with high I/O wait. Not acceptable in production.
  3. For the reasons above, I firmly believe that Sun Oracle should NOT sell any ZFS-based storage appliances without SSDs – at least not without a big warning sticker that says “Make sure you can afford to either lose any data in cache (in some environments this would not be a huge issue), or work with the speeds described above.
  4. Sun Oracle no longer offers Try-and-Buy programs. We’ve been told that once you open the unit, it’s yours. Not a strong sign that they are confident in their technology. Oh, and we were told this after we had purchased and installed the units.

Now, for the technical problems we’ve been having..

  1. The DR unit has had problems since a few days after we installed it. It’s been throwing fan errors, which Sun Oracle has told us is a ‘firmware bug’ with no fix (even though it only happens on one of three nodes) – the workaround is to restart the service processor. It also has had a bad CPU, and worst of all, it started rebooting every night. Sun Oracle support had us jump through a zillion hoops for over a month to try to debug the nightly reboot issue, replaced various components (including the motherboard), and never came up with a workable solution. Then, a few weeks ago, after 4-5 days of not touching the system (I had been assigned to other tasks and the coworker who was taking it over was on vacation, plus a weekend), the reboots suddenly stopped. We were never able to get the reboots to either alert Sun Oracle via the phone home mechanism, or trigger the alerts we had set up via the web GUI – even though it was rebooting unexpectedly. There was no easy way to figure out that it had rebooted, in fact – I ended up having to go to the shell (naughty!) and run ‘last’ to see the reboot times. If we were relying on the GUI, and didn’t have monitoring set up of the system (which is what alerted us to the reboots), we wouldn’t have known until we put a production load on it — which, since this is a DR unit, would only have been when we failed to our DR site. Sun Oracle support has not been able to offer a RCA, except to say that whenever it rebooted the mpt driver threw errors. Since it ‘fixed itself’, Sun Oracle considered the case closed. The combination of the machine being flaky from the start combined with the fact that Sun Oracle couldn’t debug the issue has made us push for a replacement unit, but it seems nobody in Sun Oracle support has the authority to replace a chunk of hardware that can’t be trusted.
  2. A few weeks into debugging the issues with the DR unit, both of the production units (which were taking a test workload) developed problems at the same time. One of them stopped answering on the VLAN that we were running iSCSI on (which was trunked on top of the management VLAN; that VLAN continued to work fine), and the other turned itself off. The best part is this happened within 10 minutes of each other. Again, Sun Oracle support is unable to give us a root cause for either issue. The systems are completely isolated from each other physically (separate circuits, separate racks in different areas of the data center, etc) — the only shared component is the network (both are dual-homed to two switches.) Nothing else in the data center exhibited any issues during this period – the switches didn’t show any oddities, other systems and PDU’s on the same UPS branch did not show power issues, etc. Again, Sun Oracle’s response was “if it happens again let us know.” Even though we would have taken a full production outage of _everything_ because of this. Again, neither of the units triggered the phone-home or self-alert functionality when these issues occurred.
  3. When troubleshooting with Sun Oracle, their first step is to reboot the system into a debug mode. These are production storage appliances – having to reboot to get a RCA for a previous issue is *not* acceptable! Again, this makes it feel more like “this a server that should be in a redundant zero-shared config” more than a storage appliance.

Really, I can understand technical issues; they happen with every product. However, it appears that the way the specialized version of Solaris is set up on these, they are incapable of getting useful logs. On top of that, the support has been worse than useless. They have chewed through a ton of our time (we’ve probably spent at least 4 man-weeks working with them on this), and have not come back with anything useful. After escalating the tickets, nobody has the authority to either replace a box or allow us to return it. The escalation process is slow and doesn’t actually really seem to do anything; after requesting escalations, we don’t get higher-level contacts (the only way I’ve managed to do that so far is by blind-calling the service center when our regular tech wasn’t available, and requesting a callback), and the people it’s escalated to still have no power to resolve our issues. Every time we request escalation it also seems to slow the process by at least a few days – it’s crazy. The support people all seem to want to help, but the system is stacked against them.

So, right now I’m stuck with 3 7210’s that have all had issues, that I don’t trust in production, and I can’t return. Thanks a lot Sun Oracle! What happened to the old-fashioned Sun, where the gear was way too expensive (and yes, these 7210’s were expensive), but the support made up for it?

My advice to you if you are looking at a Unified Storage appliance is to run far away – the ones with SSDs included would in all likelyhood perform as expected, but the quality of the support and lack of customer service will scare me away from ever making a Sun Oracle purchase again.

Update [2010/02/24] – After chatting with many contacts that I used to have within Sun, I have changed most of the relevant ‘Sun’ names to ‘Oracle’. It seems that when Oracle purchased Sun, many, many good people were let go. Bummer.

{ 54 comments… read them below or add one }

Jim February 23, 2010 at 5:24 pm

We’ve had some problems with our new 7210, primarily revolving around data corruption under heavy load. They’ve been less than stellar about responding to our queries on the issue, and seem clueless (or disinterested, hard to tell) as to how to effectively troubleshoot the issue. I fully expected some dtrace wizardry, but they’ve had us doing packet traces on the network instead, with weeks going by without any progress or indication they are working on it.

I’m really disappointed, I had pretty high hopes for the unified storage line, but unfortunately I definitely wouldn’t recommend it to anyone right now.

Reply

nc February 23, 2010 at 6:32 pm

Yikes! Really not glad to hear about data corruption. Does ZFS surface the corruption at least, or is it silent?

We’ve reached a few helpful support people, but it seems that the unit just doesn’t give them the info that they need to be able to do anything. Also disappointed that dtrace does not seem to be nearly as useful as was implied..

Reply

Bryan Cantrill February 24, 2010 at 12:53 am

Jim,

This is an extraordinarily serious issue. I am aware of no open case of data corruption, and I have many follow-up questions for you. Could you e-mail me at my first name dot my last name at sun dot com at your earliest convenience? (Or if you’d rather, reply with your case number and I’ll get the details that way.)

Reply

Jim February 24, 2010 at 9:14 am

Actually, I should have been much more specific, although the intent of my post was more to commiserate about the support than to hash out technical problems. Obviously the “c” word is a pretty scary one, so I apologize for not being more specific.

I have no reason to think that our problems are related to ZFS. I realize now that my comments probably implied that, and I apologize.

Our current issue seems likely to revolve around the CIFS server, although we aren’t 100% its not an interaction between the client implementation (OS X 10.6) and the Sun CIFS server implementation. We aren’t seeing it from any other OS, but we also aren’t seeing this when using a 10.6 client to connect to a Windows based or Linux based server.

We found this because we deal with a lot of moving around of filesets that range from ~750MB to 5GB, so we do a lot of automated checksums, and started to see repeatable failures.

Brian, I’ll email you about this. The very fact that you offered is much appreciated.

Reply

nc February 24, 2010 at 9:18 am

Thanks for the clarification Jim! Yeah, Cifs scares the snot out of me, especially after hearing from our local Sun guy who’s been trying to help us figure out how to make these things perform without SSD’s that there is no way to disable the write caching to memory when using CIFS.

If your clients are OS X, any reason you can’t use NFS? We have a team using an OpenSolaris box as a bulk storage server for their HD video editing needs on OS X via NFS, and it’s worked out quite well.

Reply

Jim February 24, 2010 at 9:39 am

We’ve talked about using NFS and have had two things that have made us reluctant. First, up until last year the only storage server we had was a Windows box, which while it could serve NFS, it was serving AFP, which worked just fine for us, other than the fact that the server was slow and management was a pain. With the new Sun we started using Samaba/CIFS because we had some experience with doing our Mac work via Samba mounts, and we initially were really impressed with the Sun CIFS implementation; its handling of resource forks, for example, was done really well, especially as compared with the Linux CIFS servers we had messed with.

And while resource forks have been factored out of our main workflow, knowing that they can be handled well if we need them is very comforting.

So while it is a poor excuse, to some degree, it’s momentum. We’re in a bit of a holding pattern right now trying to decide what we want to do, depending on what Sun says on this issue.

We’d also feel a lot more comfortable if we could just know precisely what is causing this. While I suspect it’s this interaction, and heck, there’s a chance it’s not Sun’s fault at all (perhaps the problem is in the Snow Leopard client), this is the kind of thing you really want to know for sure if you are going to trust it with your data, ya know?

Reply

nc February 24, 2010 at 11:36 am

Makes a lot of sense. I’ve never been a huge SMB fan, or a huge NFS fan.. network filesystems suck. ;P Makes sense on your issue – hopefully Sun will be able to help you out.. I’d love to hear how it goes.

Ann E. Mouse February 23, 2010 at 7:39 pm

This is total bullshit. Oracle should put their support where there mouth is and get some of these people who make those snappy “watch while I use dtrace to solve in 5 minutes a previously unsolvable problem!!” videos to solve the damn problems or take the damn things back. Large hardware and software makers get away with selling products at phenomenal prices that don’t really work and we have everyone else supporting an empire full of unclothed kings. More companies should be like W.L. Gore and demand that software actually works as promised. You should be able to return this non-working crap and get you money back since they (and most tech companies) use support to fatigue users into giving up on trying to get help rather than solving problems.

I saw some Oracle apologist saying “Don’t get mad” on the zfs list, screw that. It is way past time that these hardware/software makers be held to some standards on their products and support and being nice and begging had gotten us nowhere but further behind.

Please let us know how this pans out and don’t let the reseller (if there was one) get off scot free either – they make money off you and then leave you hanging? Post their info so we can avoid them, too.

Reply

nc February 23, 2010 at 7:48 pm

I’m with you. I wish that more research had been done up front on the purchase (I inherited this), and I also wish that we had demanded a trial period instead of just buying three. I’m not going to stop screaming until Sun can at least tell us why the systems are failed, and if not take ’em back.

I will be sure to post a follow-up on what happens.. thanks for the comment!

Reply

Bryan Cantrill February 24, 2010 at 12:47 am

It needs to be said that this is an odd problem — you were seeing mysterious resets roughly every 24 hours (with amazing consistency, by the way). And having reviewed the logs from your system, there is little — remarkably little — to go on, and not for our lack of logging. Indeed, this is in many ways the worst kind of failure mode: a machine resets mysteriously without a crash dump or any other indication of failure. Given the extraordinary nature of the failure mode, you should have been instructed to open a console to the appliance and log it — the only hope would be that some piece of software is emitting some valuable morsel of information to the console before the machine resets. Failing this (that is, having done this and having the machine reset without so much as a peep to the console), you should have been asked to boot under the kernel debugger, which it sounds like you were in fact asked to do. But given your view that this is “unacceptable”, it sounds like there was a failure here in not educating you as to the nature of the failure mode and the reason why booting into the kernel debugger was an entirely reasonable course of action. (Your assertion that “DTrace wasn’t nearly as useful as implied” also shows how poorly you were educated as to the nature of the problem and/or the nature of DTrace — DTrace is of little assistance when the machine is spontaneously resetting.) The fact that your problem has, after many days of regularity, disappeared poses an additional challenge: what can be done about a problem for which we have no information that is never seen again? I’m not trying to excuse your support experience or minimize the seriousness of the failure — but I think it is important to educate you as to what is knowable and what is not knowable about failure modes of this nature.

Reply

nc February 24, 2010 at 8:52 am

Hi Bryan!

First of all, thanks for bringing some Sun perspective to this page. I was hoping to hear from some people “in the know” with these units.

Odd problem/rebooting consistency: tell me about it! :) That’s why we worked to eliminate any possible environmental issues. If we could come to an acceptable root cause (which there appears to be some progress on finally; I will post an update after a conference call this afternoon), it’ll be quite helpful. It is indeed the worst kind of failure mode; the disappointing part is how long it took to get to good troubleshooting steps. For example, the suggestion to leave a console connected to the machine when it resets was eventually made, but only a few weeks into the troubleshooting process – if that is useful, I would have hoped it would have been suggested right away. I agree that I was poorly educated on the issue; it seemed that the technician working the issue was also grasping at straws. Regarding the problem disappearing after many days of regularity and never being seen again – yes, that makes it hard to troubleshoot, but I don’t accept that as an excuse in this case, since it was happening for a significant period of time after support was engaged.

My biggest disappointment is the lack of support’s ability to authorize an exchange or return of the machine(s) in question. I have never had the experience with any other vendor where they were unwilling to take hardware back that exhibits extremely odd behavior from the initial installation with no immediate root cause.

Curious – do you have any comments on the two production units developing issues at the same time? For me, that is almost worse than the DR unit rebooting, as it would cause a complete production outage for us (we bought two units for production specifically to ensure that a failure could only affect 50% of our VMs.) The two machines in production had no relationship to each other besides being on the same physical VLAN, and both having replication relationships set up with the DR filer (but not to each other.)

Again, thanks for the comment!

-Nate

Reply

Charlie May 22, 2010 at 5:20 pm

This is your response after ousting yourself as a Sun employee? Hah, wow. Hardware replacement is in order, who the hell cares if the problem isn’t consistent or if the customer doesn’t want to debug YOUR kernel.

Stick to Linux, get better support — for free.

Reply

nc May 23, 2010 at 8:51 am

Charlie –

Just a note to the comment above.. I was very appreciative of getting a response from someone at Sun, and was happy to have Bryan trying to help out – especially when publicly ack’ing who he is. Not sure if you are aware of it, but he is actually a fairly senior geek when it comes to the Unified Storage arena at Sun, so it was nice to have direct communciation with him.

I’ll leave it at that..

-Natee

Reply

Bryan Cantrill February 24, 2010 at 5:15 pm

Nate,

It’s not clear (at all) that you need new hardware here — we need to figure out why the machine is resetting, not throw hardware at it. As for why the two units failed simultaneously, the only thing I can tell you is that I don’t know. You said that one of the nodes “turned itself off”; assuming that I’m looking at the right data (MD5 hash of the hostname is 3df2a064148a0dc90237d936ec52c8b7) on the right day and time (1/19, 15:32 GMT), it appears that this was the result of an explicit power off. (Or rather, it is indistinguishable from an explicit power off.) So at this point, we would need to see the entire audit log for the SP. Unfortunately, there is not a way to get that via IPMI, so you would need to log into the SP, “cd /SP/logs/event/list” and then type “show”. This should at least tell us what was going on from the SP’s perspective, and why/how it decided to power off the appliance.

Reply

nc February 24, 2010 at 5:57 pm

The one that we’re wanting new hardware on is the DR unit that was rebooting itself every night – that’s had multiple issues flagged by the system as definitive hardware issues, and no root cause in sight. I just had a conference call with the escalation team, and it appears that the DR unit was having LBA issues on a drive that caused itself to reboot, in ways that cannot yet be explained, and are unlike any other issue that’s been seen before. To me, that says two things – 1) there is something unique about this system that is unlike any other one you have in the field and 2) if it’s going to production it needs to be a new one. ;) (I should also note that these logs were in each system dump that was sent to Sun, but apparently there’s no system in place to automatically identify anomalies in the logs to the support engineer, so they were missed by each tech that’s looked at them up until a few days ago.)

Regarding the system you mentioned – the escalation team also said that it appeared to be a SP-level problem, but not that it was an explicit poweroff command – thanks for that info! They mentioned that they’d be looking at SP logs, but didn’t ask me to provide them (I thought they already had them as part of the system dump?), but I’d be happy to grab those logs for you and anyone else that is interested in looking at it. Want me to fire them to the email address you’ve posted from?

Reply

Bryan Cantrill February 24, 2010 at 7:11 pm

The audit logs from the SP (which is what are needed here) are not available via IPMI; they have to be retrieved manually (i.e., by you). If you want to send them to me, you certainly may, though I am not optimistic that they will contain a smoking gun. (Unless, of course, someone did a “stop /SYS” on the SP — which would be contained in the audit log.)

As for the other machine: you may indeed have a bad HBA — which would necessitate replacing the HBA, not the machine. But we would need a console log to know what happened, and given that we don’t seem to have that, it’s going to be hard to know definitively. I’m not sure what you mean about “no system in place to automatically identify anomalies”; I went through your logs and there’s very little anomalous about them — that’s part of what makes this mysterious. The log that we need — that of the SP console at the time of machine reset — is not something that the appliance can gather for itself…

Reply

nc February 25, 2010 at 5:36 pm

I’ve sent them to the support team, who is reviewing him.. here is the relevant section though:

7901 Tue Jan 19 20:29:56 2010 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
7900 Tue Jan 19 15:32:52 2010 System Log critical
upgrade to version unknown failed
7899 Tue Jan 19 15:32:46 2010 IPMI Log critical
ID = 15c : pre-init timestamp : System ACPI Power State : sys.acpi : S5/ G2: soft-off
7898 Tue Jan 19 15:32:41 2010 Audit Log minor
KCS Command : Set ACPI Power State : system power state = no change : device power state = no change : success
7897 Tue Jan 19 15:32:41 2010 Audit Log minor
KCS Command : Chassis Control : action = power down : success
7896 Tue Jan 19 15:04:52 2010 Audit Log minor
KCS Command : Set SEL Time : time value = 0x4B55CA14 : success

The item at 15:04 was the last “normal” message (seems those occur about once per hour?); the next few messages almost make it sound like the system believes it was powered down for an update which failed? The shell session is when I logged in to boot it back up.

Other machine — I asked if it’s possible that it’s a bad HBA or a midplane issue (as support had previously suspected but never replaced), but the senior support reps on the phones indicated that it was definitively a drive, and not in another part of the system. No console on the system – not used to needing console logging on modern computers. ;)

One thing worth noting – I had no idea that the ILOM logs wouldn’t be bundled with the support pack (I’d assume you have a way to retrieve them from within the OS), so I’ve never really looked at them beyond a glancing view, and only on the one that was rebooting itself daily. One interesting thing now that I am looking, is that the system that was rebooting has -far- more events in the SP log than the other two. It’s logging things similar to the following on a seemingly regular basis – appears to be 15 minutes after the hour, ever hour on a quick lookover:

27537 Thu Feb 25 17:15:38 2010 Audit Log minor
KCS Command : OEM Set LED Mode : device address = 0x18 : LED = 0x3 : controller’s address = 0x20 : HW info = 0x3 : mode = 0x0 : force = false : role = 0x41 : success
27536 Thu Feb 25 17:15:38 2010 Audit Log minor
KCS Command : OEM Set LED Mode : device address = 0x18 : LED = 0x5 : controller’s address = 0x20 : HW info = 0x5 : mode = 0x0 : force = false : role = 0x41 : success
27535 Thu Feb 25 17:15:38 2010 Audit Log minor
KCS Command : OEM Set LED Mode : device address = 0x2C : LED = 0x2 : controller’s address = 0x20 : HW info = 0x2 : mode = 0x0 : force = false : role = 0x41 : success

I see events like this on the other two systems, but extremely infrequently (ie – days between them.)

Regarding my comment on log analysis – that question came up because it was taking the engineers so long to get through the support packs.. I asked if the support reps had any tools to strip out extraneous information from the logs for them, and just leave the “unusual” stuff – the answer from the support manager was no, which is what surprised me. There’s a /ton/ of data there, and having to strip out the important parts by hand is time consuming! As I mentioned, the drive errors were in each log that was sent, but were missed by the engineers reviewing them – a proper log analysis program would certainly help with that.

Reply

Bryan Cantrill February 25, 2010 at 11:11 pm

Nate,

I am sympathetic to your plight, but you are suffering from a bad case of hindsight bias: there is no reason to believe that a disk failure would induce system failure, and trust me that there is nothing in the logs that they were missing. The missing piece of data — and I feel I have said this several times over now — was the console log that you didn’t collect (not your fault, of course — you weren’t told to collect it). Yes, “modern computers” need to have their console logged; how else is one supposed to debug failures in which no operating system dump is taken? (And yes, such failures exist — viz. yours.) The best guess at the moment (and it’s always going to remain a guess for the moment) is that the drive failure was inducing an HBA logic failure (that is, a failure in the HBA firmware itself), and that the HBA failure was in turn inducing both an operating system panic (which is — or was — our defined failure mode in such a case) and then (we hypothesize) a dump abort. (On the 7210 — unlike the 7310 and 7410 — we rely on the same HBA for both system disks and data disks.) We’re guessing here, but your drive is making its way back to our team to see if we can reproduce this in house. One final note: because these HBA logic failures have been a real difficulty for us, we have added logic in our Q1 release (due out in the next few weeks) that will reset the IOC on failure. So when we get your drive, we may well see the system recover now in a way it wouldn’t previously; we’ll let you know.

Jes February 25, 2010 at 4:58 pm

You raise two classes of issue: support and technical and I can sympathise with you on both counts as I’ve had very similar experiences.

Sun support has been absolutely abysmal because 1) the first two levels of support know nothing about the product (not necessarily their fault) but consequently they give incorrect diagnosis, bad advice, and send out engineers to fix things that aren’t broken because they don’t understand what has gone wrong; 2) there’s only 3 (three) people in back line support for the whole of Europe, Middle-East and Africa (yes, just three!). They are so busy that there’s no chance they will get time to work on your problem. And they are being unfairly given work that would be better done by the development team. Something that the developers could fix in an hour will take two weeks to investigate by someone not familiar with the source code, and as you can see they simply don’t have that time to spend on each problem.

The product itself has been almost as abysmal, and I am sad to say this because I’ve been a fan of Sun technology. Not that the underlying components are bad: Solaris is great and ZFS is getting there. But the appliance software has so many bugs, even 14 months after release, that it’s still only beta quality. Yet there’s been no provision within Sun for customers to feed back a holistic view of their system and the support structure is totally inadequate for a new class of product such as the 7000 series. We’ve been so frustrated it’s unbelievable, if only we could talk to someone technical that was willing to listen to us and had the ability to influence the development team. Our list of problems is so large that just the titles fill an A4 page; most of which are serious or very serious. People like Bryan prove that the engineers do care, but …

For example we’ve had assurances that Sun will debug the ZFS corruption that occurs when a disk is replaced during a resilver but, 10 months later, the problem still exists and is easily repeatable. Would you trust any data to a system like that, or to a company which on the surface doesn’t appear to take it seriously?

Not only that but when you suffer corruption the system almost completely fails to inform you. If you are looking hard you will find a one-line message in the BUI but there’s no alert, no SNMP, no email, nothing, not even a permanent record in a log file of which what’s been corrupted. And it gets worse, the system then forgets that any corruption has happened at all, leaving you with no message in the BUI, just a bunch of files which appear to be fine but when you access them you get corrupt data. In Sun terminology that’s both data “corruption” and data “loss”.

Another example of a problem: the Q3 update provided the missing functionality to backup and restore the appliance config, yet when you perform a restore it breaks the box so badly you have to boot into a previous firmware revision.

Yet another example: services can fail (including the appliance kit daemon itself) and the head won’t fail-over to the other head (if you have a clustered pair of heads).

I could go on for hours about other serious problems but don’t want to bore you ;-)

I was going to be charitable and say that your title is misleading (“designed to disappoint”), because on the whole the collection of technology inside the appliance is really good. Yet there are definitely aspects where the design simply hasn’t been thought through properly (like the clustering and the reporting of corruption) so you’re right.

Reply

nc February 25, 2010 at 5:25 pm

First of all, thanks for taking the time to comment!

Oh, you won’t bore me at all. ;) I’m sorry to hear you are having issues too – I was really hoping we were the exception to the rule. I’m also hoping that some people who have had great success will post.

I’m also finding that it seems the support techs are over-busy — seems like “squeaky wheel gets the grease” in my case as I’m now receiving quite prompt attention. I’m bummed to hear that EMEA doesn’t send you directly to US-based techs when needed.. it makes sense to have local high-level support, but send things back to the people who are most familiar in extreme cases.

I’m also feeling like the product is still “beta”, but it’s not being sold that way. ;( As far as not getting influence on the dev team, have you posted your issues online somewhere? It seemed to help me. ;)

ZFS corruption – that is extremely frightening. I’d love to hear details on what corruption you’re seeing and exactly how you trigger it. As far as the logging aspect goes, that is *really* frightening – and unfortunately on par with what I’m finding.

Failover – the issue of not failing over because of service-level issues is something I’ve found to be very common on active/passive configurations like this.. it’s a hard thing to get right, but I’d hope that would be something heavily tested and properly engineered before public release.

My title is meant to be tongue-in-cheek.. obviously the intentions were not to build a product that is disappointing, but that sure does seem to be the way that it’s ended up.

Reply

Bryan Cantrill February 25, 2010 at 10:51 pm

Jes,

First, sorry to hear that your experience has been rocky. And yes, all of the teams on the project — support, development, test — have been enormously overtaxed. That, I’m afraid, is the curse of hypergrowth, which is exactly what we have experienced over the past year. And we have experienced that hypergrowth because we were largely right about one critical point in the enterprise storage industry: customers are sick of paying high rents to a couple of companies. What we attempted to do to address this was very ambitious: build enterprise-grade storage out of commodity parts. Fourteen months later, I think we’re finally approaching that goal — but as you have seen, it’s been a very rocky path (much rockier than we naively thought it would be). All of which isn’t meant to excuse any problems you’ve had with the product, but rather to give you an honest assessment of how we got here.

Now, that said: I am quite troubled by the corruption that you believe that you are seeing. Can you please send me a case number (my first name dot my last name at sun dot com) so I can unravel what happened here? (Jim can attest that my offer is in earnest.)

Finally, there actually is a mechanism to provide holistic feedback about the product, but you could be forgiven for not noticing it: in the BUI, there is a feedback link at the bottom of every page. Mail sent via that link goes straight to the development team, and no one else — and we have responded to everyone who has gotten in touch with us that way. (And we’ve gotten some very useful feedback that way.) We provided that exactly so people like you could unload: we’re technical and we’ll listen to you — so please don’t hesitate to send us that A4 worth of feedback, with our thanks in advance for doing us the service.

Reply

Henkis February 26, 2010 at 8:37 am

Which Fishworks releases have you had these problems with, do you have all the issues on the latest (2009.Q3.4.1) release?

Reply

nc February 26, 2010 at 9:47 am

Releases were current at the time of the issues.

For the rebooting box (DR node), it was running ak-2009.09.01.3.0 when we started working on it, and during the rebooting issue ak-2009.09.01.4.0 was released, and we upgraded to it. It didn’t coincide with the reboots stopping.

The prod boxes were running ak-2009.09.01.3.0, and still are – we’re basically not touching them until we get somewhere on the cases.

Thanks!

Reply

Henkis March 24, 2010 at 11:58 am

Hello again!

Did you end up upgrading to 2010.Q1 with the improved clustering and HBA failure recovery? Any other updates? Please share, we’re thinking on buying a couple of boxes and even if not everybody have had the problems you are seeing it would be very interesting to know if they where solved in a good way in the end.

Reply

nc March 24, 2010 at 12:36 pm

We have not yet.. the HBA failure recovery patches are supposed to deal with the nightly reboot issue that we’ve seen however.

Reply

Gerry April 21, 2010 at 1:40 pm

We are experiencing a nightmare in installing 2 Unified Storage units (7110/7210) in a Oracle environment whereby the performance on trying to create a simple 18GB database takes 6 hours to complete compared to an existing Nstor unit that takes 60 minutes.

Comparision tests using a SCSI server, nStor SAN and SUN 7110.

Creating an Oracle database
SCSI 12 GB 30 minutes
nStor 26 Go 60 minutes
SUN 26 Go 6 heures 30 minutes

Importing an Oracle .DMP file
SCSI 18 GB file 3 hours
nStor 18 GB file 2 heures 30 minutes
SUN 18 GB over 14 hours

Anyone got any advice – we could definitely use some help figuring this out. We have had 2 Sun Unified Storage technical resources involved in this installation and are at the point that the customer wants to throw it all out.

Reply

nc April 21, 2010 at 1:43 pm

Do you have ZILs? If you want sync writes, that’s the only way I know of to get reasonable performance.

Reply

Gerry April 21, 2010 at 2:03 pm

The USS 7210/7110 were ordered without SSDs. Do you not require SSDs in order to realize the performance benefits associated with ZILs?

Reply

nc April 21, 2010 at 2:12 pm

Yeah, you’d need SSDs for ZIL. In my experience, on a 7210 without SSD’s, I was only getting ~30MB/s writes over iSCSI unless sync mode was disabled, in which case I could max out at least a gig-e link.

If you don’t require high data integrity in case if a system crash, etc, you could always turn off sync mode.. that helps quite a bit.

Reply

Gerry April 21, 2010 at 4:01 pm

Can anyone send me the contact info for the individual that posted the “designed to disappoint” article on the Sun Unified Storage? I would appreciate knowing how he resolved his performance problems.

Reply

Andy Paton April 28, 2010 at 5:31 pm

As a Oracle (Sun) Reseller, we had some major support issues with our customers over the last 18 months. The support from the Sun guys in the early days was poor. I think this was down to the storage team wasn’t able to cope with the S7000 Kernel, ZFS and Network issues, which are completely different disciplines to the normal SANs and SCSI issues they normally handled.

But now in the UK we see some major changes in the Oracle EMEA support team; multi-disciplined, direct communication, engineering available and fixes arrive.

Every product will have issues, but the real test is how the Reseller and Oracle address these. I do believe that all parties are now more than ready!

Reply

nc April 28, 2010 at 5:40 pm

Thanks for the note.. glad to hear that you’ve seen things improving! I have no doubt that these products will be great given time.. I have a hunch that the timing of the merger combined with a brand-new product just kind of threw everything up in the air. ;)

Reply

Kyle Tucker June 2, 2010 at 8:01 am

Just chiming in on this thread. I have a 7410 cluster in one location and a solo 7410 in another, both serving VMware Virtual Center/ESX environments on NFS. We moved off expensive NetApp and EMC gear for these environments. While performance has been excellent, we’ve had several issues. On 3 occasions, supposedly failed disks have hung up access to the BUI and CLI on both nodes, with NFS service interruptions, leaving the cluster unusable. We still have issues with NDMP over fiber channel to CommVault. Some weeks backups are flawless, some every ones fails with write errors. Oracle support says the units weren’t certified with the Quantum (ADIC) i500 library. BS, we’re writing directly to the LTO4! Twice the cluster has failed to failback, leaving the cluster in a hung state, with NFS shares not being served. Both times required calls to support to sync up the clusters and hours of downtime. Worst of all, since the 2001 Q1, we have NFS service interruptions 50% of the days at the point my scheduled snapshots are taken. I’m being told it’s a bug in dedupe, but it’s occurred on the single 7410 we’re I don’t even have an shares with dedupe enabled. I was asked to change my scheduled snapshots so they are staggered, but since changing a snapshot schedule blows away the existing snapshots (leaving me vulnerable since backups fail so often), I’ve refused. I am very disappointed in the units and support, and I had very high hopes for them. I have recently asked for a quote for a NetApp FAS3140 to put our production VMs on. :(

Reply

nc June 2, 2010 at 10:20 am

So sorry to hear that. ;( Bummed that 2010 Q1 didn’t help you out.. it really did sound like a majority of the bugs had at least been addressed on there.

Have you tried Nexenta perchance? I am really curious how many of the issues are the kernel and ZFS itself, and how many are related to the management stack.

Reply

Jim Nickel June 8, 2010 at 11:39 pm

I have customers with Sun/Oracle 7110’s. I have been able to get them to perform decently – you have to use RAID1 – RAID 5 performance is only acceptable with NFS.

However, I have one customer that has 2 of these and wants to use the 2nd one for replication and DR.

A major problem I am having is this: Starting or Stopping replication on the source 7110 makes the production 7110 completely unresponsive! This essentially brings down the entire environment! Completely unacceptable! After waiting 15 mintues for it to come back – (even if it did, I am not sure Windows VM’s running on vSphere 4 would survive that long without their disk) – my only option was to power down the 7110 and power it back up. It did come back, but I had one VM that was corrupted – good thing I had good backups.

These units have so much potential! I just wish it all worked as advertised.

Jim

Reply

Paul Sorgiovanni July 28, 2010 at 8:22 am

Answer here is easy. you get what you pay for; go & buy NetApp. Guaranteed it will work.

Paul.

Reply

nc July 28, 2010 at 9:01 am

That’s actually a bit ironic – we’re looking at replacing our current Netapp setup with Nexenta.. Netapp has just been too expensive to build out in a fully-redundant configuration when you need a ton of space. I don’t mind paying a semi-large chunk of change for the heads, but the disk price is crazy. ;(

We’ve also had stability issues with Netapp, but it’s mostly been because the systems aren’t set up in an ideal configuration — IE, aggregates are way over-provisioned, no redundant heads, etc.. can’t blame Netapp for that. ;)

Reply

Anonymous August 3, 2010 at 7:49 am

Don’t waste money and effort on Nexenta.

Reply

nc August 3, 2010 at 11:41 am

It’d be great if you were willing to post more information about why you feel that way; everything I’ve heard from everyone else on it has been quite positive.

Reply

Niels Jensen August 27, 2010 at 3:49 pm

I feel for the original writer and while I may be a little late in the game here I just wanted to let you all know my story…

We were allegedly one of the first companies in the UK to deploy a 7410 system and from Day 1 we’ve had a ridiculous number of problems with it. in fact, of all the server related issues we have in our company, the 7410 accounts for al least 80% of them, we have over 70 servers (including virtual).

Initially the device couldn’t connect to our windows 2008 AD domain as a NAS. The domain was a fresh, untouched vanilla domain and the CIFS service just couldn’t authenticate with the DC’s. It took months to resolve and in the end had one of Sun’s senior Unified Storage engineers come out to try and bugfix. A few weeks later and we got connected… Hooray I thought, all problems solved…. Heh!

Many hardware failures later, inexplicable fail-overs to our other active head, and other buggy problems we still have one huge outstanding issue…

If we create a security group in our AD, add a few members of staff to the group and give that group access to an area on the NAS CIFS share. It doesn’t actually apply… As far as windows is concerned it should, but, it doesn’t! The only fix…. Reboot the damn controller – and this is a production environment… We now have to explain to users that they can’t access that folder until the next day as we cant reboot the system during business hours.

It’s been months since this call was first logged. With Oracle support asking us to always apply the patches and see if that fixes the bug, when it doesn’t they ask for a fresh support bundle and spend another few weeks before they ask me to test another change (that never works) before asking for another support bundle… The last support response I got was to enable more in depth logging using a workflow, replicate the issue and you guessed it…. send them a support bundle! I’ve not done this yet as I have other things to manage and Oracle clearly have no idea what the hell is going on.

I honestly wish that we’d never bought the hardware and spent our money on an established storage provider (not that Sun wasn’t an established brand).

BTW Hi Andy P. – He’s my reseller support guy and in all fairness, his company have gone above and beyond in trying to help us sort this out. I can’t fault his company, but Sun(Oracle) are really being unhelpful.

Niels

Reply

nc August 27, 2010 at 5:04 pm

Hi Niels,

Thanks for your story, and sorry to hear that you have also had issues! :(

I have also heard about a lot of issues regarding head failover, etc, via private email.. at least that appears to be better with the newest release from what I’ve heard.

Also bummed to hear about the security group issues, but at least that is not “service-affecting” — annoying to have to reboot the head, but better than having things go entirely offline for any reason. Are you able to just fail over to the other head for that? (Not that you can do that during the business day anyways..)

Reply

nc August 27, 2010 at 5:09 pm

Added this to the post already, but adding to the comment stream to ensure that it gets sent to anyone who is subscribed to comments by email or rss:

Update [2010/08/27]: Thanks to some hard work from our reseller and people at Oracle, we were able to return these units a few months ago. I wish we had been able to work through the issues with Oracle, but needed to get something that we could trust online ASAP. For the record, I do think that we probably did receive a “batch” of bad units; I still have not heard from anyone else who has had multiple independent units fail simultaneously as we did. I will also keep comments open here, and encourage anyone who has had great deployments to post a followup – I do believe that there are people that have had great success with these units out there, or else they wouldn’t be selling so well! :) And again, thank you to the hard work from our reseller, and for all the people from Oracle (and those who used to work for Sun but did not move to Oracle) who did their best to help us!”

Reply

nitin December 11, 2010 at 5:24 am

Dear Guru’s,

I have an SUN Storage 7210 with Oracle Sun Solaris O/S,it is remotely connected with ILOM and we are able to poweroff and poweron the storage,but its not booting up.It is hang after grub loading stage 2, the screen is complete black and the cursir is blinking,the ip address is giving us requested time out,than we directly connecte the moniter but the problem is still same,when we connect it through the putty(through SSH)ILOM’s ip and then we connected with the machine its like 7210 > Login,we loged in and than type the command “start /SP/console” after that with other command we find the interface information with storage O/S ip stating up than we tried to ping that os ip but its not pinging….plz tell me is this is the problem of grub or something else…..plz help

Reply

Salman July 15, 2011 at 3:26 pm

Hi Nitin, just wondering if you was able to fix this issue and if you have solution for this. I am actually experiencing the same with Sun Storage 7210… THX in ADV

Reply

nc July 15, 2011 at 3:34 pm

Hi,

Your best bet is to check with the support staff at Sun.. we no longer have this hardware, but I have heard that much of it is fixed.

-Nate

Reply

SUNSAIN!!! January 6, 2011 at 11:54 pm

Ahh the MOON product, cause it is the biggest pain in my @55!

I tell anyone who is considering any SUN product or solaris…. DON’T DO IT!!! They offer the worst Service I have ever had!!! we have a range of servers, Tape backup and SUN 7210 Storage. The storage for the past year and half has been giving us nothing but greif!!! I have only been here acouple of months and have already a new SAN product to replace it. we dont use it and have moved the Vmachines and data to temp HP drives. The unit every now and then decides to just powerdown! Sun keep telling us someone is turning it off. we have provided them with the electronic Key logs that show no one has been in the server room where this unit sits for 3 days prior! we have a P1 ticket and SUN just closes the job because they are as dense as their crappy product! if they review their logs on all our calls. they would see a pattern that every 8 weeks or so the unit dies! they have put in a new hard drive and its not showing it. their Techs have been in here a couple of times past two weeks (Very fast workers!), and he has a print out from another tech on some instructions to do something…. but the instructions dont match the gui of the bios and he has no idea. great for a global corporate solution. fortunatly we have other systems in place to manage this. xmas eve the SAN kept powering off then rebooting and off again and rebooting. Our monitoring solutions kept showing servers going up and down. but SUN keep saying nothing is wrong. I want to go into their office and really cause some serious pain! they have no incentive, no nothing. It has been the most frustrating, uncostomer focused organisation I have ever dealt with!!! a month ago they suggested that we need to get the backplain, Hard drive, controller and powermodule replaced. but they keep sending me a guy who has no idea on how to navigate the bios GUI! he looks at the serial number on his paper and tells me that his techs told him that this is the old drive serial and that the new one isnt working. but he doesnt know. my guy tells him go look at the unit. we have it sitting there now doing nothing! My point. IF YOU WANT TO BUY A SUN OR ANYTHING SOLARIS AFTER THIS!!!! THEN YOUR AN IDIOT! GIVE ME A THOUSAND LASHES ON THE BACK EVERY HOUR! Cause its less painful then dealing with SUN! The worst service and product support ever!!!

Reply

nc January 7, 2011 at 10:11 am

Interesting.. that’s almost exactly the issue we had with one of our units that was never solved. ;( I really hope they can get this resolved for you.

Reply

FERKA February 2, 2011 at 12:55 pm

I understand you, SUN was never like that!! Oracle just screwed up everything. Now they want to take over the support directly from the partners…..with so much incompetency…. i really wonder what they are thinking!! The way they resolve HW issues is similar to the software…..which simply cannot be the case ( two different environment)….. SUN PRODUCTS ARE GOOD, SOLARIS IS IMPRESSIVE, INNOVATION IS THERE…. they have things that give you zest; unfortunately, bad service/support makes you think twice. …. I really feel sad for such good product to fade….

Reply

SUNSAIN!!! January 7, 2011 at 12:00 am

Oh Yeah the tech has left and again no firmware update nothing…. I’m betting he’ll come back again in a couple of weeks wondering if the hard drive serial is displayed on the SUN GUI as he has the past two weeks.

Sorry for my frustrated expression. but this thing really boils me blood! I really wanted to share my love for SUN

Reply

Mark January 27, 2011 at 11:18 pm

I agree, these devices should not be sold without the SSD’s for Logzilla.
Sadly, we won’t spend the $14k for a mirrored pair of Logzillas either (you need a mirror incase one happens to fail – flash can go bad too)

I experimented with some OpenSolaris and 6 SATA disks using motherboard RAID, and with a SSD acting as a Logzilla there was a night and day difference, even with losing a disk to put the ssd in.

the new firmware is a lot better, I run about 30 VM’s across 3 hosts and use NFS mounts. It works, but its way to slow for production use. MS Patches can take hours.

Reply

NetSyphon March 9, 2011 at 9:48 pm

1. Sun never really backed you updating the ILOM, and the little “updates” never updated it.
2. Forget making good products, this is Oracle. They want to make money.
3. The good news is that if you have a 7000 series is you WILL talk to a OG Sun guy. Although they are helpless to resolve you issues but will agree that they exist.
4. I miss Dave Hitz vs. Jonathan Schwartz!

Sun was idealist and the customers took advantage of it, problem is they didnt make money… This was destined to end badly for everyone! I wish that IBM would have bought them but Larry screwed them out of that proposition too. Good news is the competition sees zfs and has gotten competitive on features and pricing.

Reply

mark240 March 11, 2011 at 10:33 am

Can anyone comment on whether the latest hardware versions (7x20s) have the same issues?

Configuration of these seem a little clunky i.e. you either get 4x 512 GB ZILs in a cluster for $48k or nothing. I read an authoritative source (like richard elling or someone) that said any RAIDZ implementation would limit disk throughput to one disks worth and explained the reasons behind it but no matter what I google, I cant find that page again. Think that explains the 15-30 MB/s people are seeing? So its really raid10 (pool of mirrors) or nothing for good throughput (which someone else posted)… which really means not using ZILs+read cache SSDs is not an option. Off the topic, anyone know if/when ZILs will be dedupe aware like netapp PAMs or if its fundamentally impossible since reads and writes are going to different devices (dont think it should be but throwing it out there)?

This pushes my solution away from the unified storage and toward a ‘regular’ oracle HA/ZFS/NFS cluster for a virtual infrastructure backend but I really want the dtrace gui and other cool stuff like that. Does anyone know if any/all features can be manually added to a traditional solaris 10/11 build/cluster? Anyone know if the unified storage just uses preconfigured sun cluster 3.2 or 3.3 and ha storage plus for NFS HA or some other secret sauce?

Im a n00b at oracle storage but love the idea of it all coming from a netapp background (like most people here) so sorry if the questions are elementary.

Reply

mark240 April 13, 2011 at 1:56 pm

Found it: http://blogs.sun.com/roch/entry/when_to_and_not_to

……….Effectively, as a first approximation, an N-disk RAID-Z group will
behave as a single device in terms of delivered random input
IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group. This is the price to
pay to achieve proper data protection without the 2X block overhead
associated with mirroring.

With 2-way mirroring, each FS block output must be sent to 2 devices.
Half of the available IOPS are thus lost to mirroring. However, for
Inputs each side of a mirror can service read calls independently from
one another since each side holds the full information. Given a
proper software implementation that balances the inputs between sides
of a mirror, the FS blocks delivered by a mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.

So looking at random access input load, the number of FS blocks per
second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way
mirrored or simply striped (a.k.a RAID-0, no data protection !), the
equation would be (where dev represents the capacity in terms of
blocks of IOPS of a single device):

Random
Blocks Available FS Blocks / sec
—————- ————–
RAID-Z (N – 1) * dev 1 * dev
Mirror (N / 2) * dev N * dev
Stripe N * dev N * dev

Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and
look at different possible configurations; In the table below the
configuration labeled:

“Z 5 x (19+1)”

refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.

Random
Config Blocks Available FS Blocks /sec
———— —————- ———
Z 1 x (99+1) 9900 GB 200
Z 2 x (49+1) 9800 GB 400
Z 5 x (19+1) 9500 GB 1000
Z 10 x (9+1) 9000 GB 2000
Z 20 x (4+1) 8000 GB 4000
Z 33 x (2+1) 6600 GB 6600

M 2 x (50) 5000 GB 20000
S 1 x (100) 10000 GB 20000

So RAID-Z gives you at most 2X the number of blocks that mirroring
provides but hits you with much fewer delivered IOPS. That means
that, as the number of devices in a group N increases, the expected
gain over mirroring (disk blocks) is bounded (to at most 2X) but the
expected cost in IOPS is not bounded (cost in the range of [N/2, N]
fewer IOPS).

Reply

Henry April 6, 2013 at 2:05 am

> “When troubleshooting with Sun Oracle, their first step is to reboot ”

WTF?!
# export LART=”oracle”

Thanks for the story. Sounds a little bit like ESD-damaged parts?
> receive a “batch” of bad units
> a ‘firmware bug’ with no fix
> Oracle no longer offers Try-and-Buy programs. (may say something about the assembly?)
> whenever it rebooted the mpt driver threw errors. (?)

I wanted to experiment with Solaris for some years, but haven’t come around to actually do it, the knowledge could be useful in a job, and if I enjoyed this I would work more with Solaris. Never underrate the power of non-oriented knowledge acquiring!

The sad thing is that Oracle have killed some non-offical support-/fan-pages, pages with information to how to use Solaris, finding bugs, such things.
Maybe they don’t want clever peoples fixing their own problems?
(=no need to call for support, e.g. more downtime and support payment. )
Maybe it was a good thing that I was too slow? :)

Reply

Leave a Comment

{ 1 trackback }

Previous post:

Next post: