back end randomly transitioned to failed

Bug #485976 reported by ChrisW
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
gocept.zeoraid
Confirmed
Critical
Christian Theune

Bug Description

I just got an alert from our monitoring because zeo1 for the packed storage on zeoraid2 has become failed.

On zeoraid1, I get "Could not connect to ZEO server at 0.0.0.0:6001" when trying to check the status of the packed storage.

The unpacked storage on both servers is still optimal, which is odd.
Nothing in zeoraid2's event or debug logs.

In zeoraid1's event log I see:

2009-11-20T17:14:34 INFO ZEO.zrpc.Connection(S) (127.0.0.1:55446) lastTransaction() raised exception: RAID is inconsistent and was closed.
RuntimeError: RAID is inconsistent and was closed.

There's nothing more informative in the zeoraid1 debug log and nothing in zeo1's event or debug log.

So, I'm a little confused :-S

Revision history for this message
Christian Theune (ctheune) wrote :

Ok, so *something* made zeoraid2 think that zeo1 is failed.

This causes zeoraid2 to stop writing to zeo1. If a transaction goes through zeoraid1 and ends up in both zeo1 and zeo2 then zeo1 will quickly appear as inconsistent which is correct.

I think the major problem here is that you can't see why zeoraid2 is degrading zeo1 which is an issue already described in another bug.

Revision history for this message
ChrisW (chris-simplistix) wrote :

Could you un-mark this as a duplicate? (since this isn't a duplicate of anything, the logging issue was related, but not *the* issue)

Anyway, it happened again. It was the packed storage again.

On zeoraid2:
2009-11-23T08:22:31 CRITICAL gocept.zeoraid Storage zeo2 degraded. Reason: inconsistent OIDs

On zeoraid1:
2009-11-23T08:24:33 ERROR ZEO.zrpc.Connection(S) (127.0.0.1:56657) lastTransaction() raised exception: RAID is inconsistent and was closed.

That's about all the useful logging I could find. If you want any more specifics, I have debug logs from both zeoraids and both zeos. Let me know what you want :-)

Revision history for this message
ChrisW (chris-simplistix) wrote :
Download full text (3.2 KiB)

Okay, this happened again...

On zeoraid2:

10:41 - CRITICAL: Storage zeo1 degraded. Reason: inconsistent OIDs

On zeoraid1:

11:22 - ERROR: lastTransaction() raised exception: RAID is inconsistent and was closed.

This is the packed storage once more.

Now, thinking about this more, I leave the clients connected to zeoraid1, and yet the initial error here happens on zeoraid2, which should be idle...

So, here's the event log from zeoraid1:
------
2009-11-25T10:40:33 INFO ZEO.StorageServer (13221) new connection ('client', 57615): <ManagedServerConnection ('client', 57615)>
------
2009-11-25T10:40:33 INFO ZEO.StorageServer (13221) disconnected
------
2009-11-25T10:40:38 INFO ZEO.StorageServer (13221) new connection ('client', 57617): <ManagedServerConnection ('client', 57617)>
------
2009-11-25T10:40:38 INFO ZEO.StorageServer (13221) disconnected
------
2009-11-25T11:22:32 INFO ZEO.StorageServer (13221) new connection ('client', 58926): <ManagedServerConnection ('client', 58926)>
------
2009-11-25T11:22:32 INFO ZEO.zrpc.Connection(S) (client:58926) received handshake 'Z303'
------
2009-11-25T11:22:32 INFO ZEO.StorageServer (13221/client:48809) disconnected
------
2009-11-25T11:22:32 INFO ZEO.ClientStorage [('zeo1', 6011)] Disconnected from storage: "('zeo1', 6011)"
------
2009-11-25T11:22:32 INFO ZEO.ClientStorage [('zeo2', 6011)] Disconnected from storage: "('zeo2', 6011)"
------
2009-11-25T11:22:33 ERROR ZEO.zrpc.Connection(S) (client:58926) lastTransaction() raised exception: RAID is inconsistent and was closed.

Likewise from zeoraid2:
------
2009-11-25T10:40:33 INFO ZEO.StorageServer (1670) new connection ('client', 57616): <ManagedServerConnection ('client', 57616)>
------
2009-11-25T10:40:33 INFO ZEO.zrpc.Connection(S) (client:57616) received handshake 'Z303'
------
2009-11-25T10:40:38 INFO ZEO.StorageServer (1670) new connection ('client', 57618): <ManagedServerConnection ('client', 57618)>
------
2009-11-25T10:40:38 INFO ZEO.zrpc.Connection(S) (client:57618) received handshake 'Z303'
------
2009-11-25T10:41:36 CRITICAL gocept.zeoraid Storage zeo1 degraded. Reason: inconsistent OIDs
------
2009-11-25T10:41:36 INFO ZEO.ClientStorage [('zeo1', 6011)] Disconnected from storage: "('zeo1', 6011)"
------
2009-11-25T11:22:32 INFO ZEO.StorageServer (1670) new connection ('client', 58927): <ManagedServerConnection ('client', 58927)>
------
2009-11-25T11:22:33 INFO ZEO.zrpc.Connection(S) (client:58927) received handshake 'Z303'
------
2009-11-25T11:22:33 INFO ZEO.StorageServer (1670) tid to old for invq 252843237928668467 < 252847970190423022
------

So, the questions I have:

- Why is the client connecting to both zeoraid1 and zeoraid2 at the same time?

- What causes the disconnections on zeoraid1 at 10:40:33 and 10:40:38?

- Which connection is the client using, based on this logging? (if you need debug logging stuff, lemme know which timeframe for)

- What triggers the inconsistent OIDs on zeoraid2 at 10:41, and how does this lead to inconsistent raid on zeoraid2 at 11:22?

Request:

- please can you un-mark this bug as a duplicate

- can you add some logging to show what connection is in use, or let me know what I should grep for to find...

Read more...

Changed in gocept.zeoraid:
milestone: none → 1.0b8
importance: Undecided → Critical
Changed in gocept.zeoraid:
assignee: nobody → Christian Theune (ct-gocept)
Revision history for this message
Christian Theune (ctheune) wrote :

Alright, I looked at the timeline and in order to answer your questions I need some trace logs:

For each of zeo1, zeo2, zr1 and zr2 I'd like to have trace logs of the time ranges:

10:40:00 to 10:42:00

and

11:22:00 to 11:24:00

Revision history for this message
Christian Theune (ctheune) wrote :

Note: Chris send me the log files via email to keep them confidential.

> - Why is the client connecting to both zeoraid1 and zeoraid2 at the same time?

I think that's the management client you see connecting. This is another reason why I want to move to out-of-band communication for status retrieval and controlling.

> - What causes the disconnections on zeoraid1 at 10:40:33 and 10:40:38?

Again, that's probably the management client.

> - Which connection is the client using, based on this logging? (if you need debug logging stuff, lemme know which timeframe for)

I'd say that in the first time frame (10:40-10:42) ZR1 is being connected to by the Zope server but not used and ZR2 is being connected to by the batch processor and actually doing some serious business.

> - What triggers the inconsistent OIDs on zeoraid2 at 10:41, and how does this lead to inconsistent raid on zeoraid2 at 11:22?

You're asking about zeoraid1 at 11:22, right?

I'm not yet sure what makes the OIDs inconsistent. I have the feeling that it's the packing. However, I wonder what happened before that transaction in ZR2 that came out with inconsistent OIDs.

> - can you add some logging to show what connection is in use, or let me know what I should grep for to find...

Not quite sure what you mean. The connection is displayed: at some point a client connects and gets a storage server id assigned. all later calls have that id in them.

Revision history for this message
Christian Theune (ctheune) wrote :

Almost forgot that event at 11:22

I think that's a genuine problem of two ZRs running next to each other because the check for lastTransaction isn't protected by a global lock and thus suffers from a race condition.

Revision history for this message
ChrisW (chris-simplistix) wrote : Re: [Bug 485976] Re: back end randomly transitioned to failed

Christian Theune wrote:
>> - Which connection is the client using, based on this logging? (if you
> need debug logging stuff, lemme know which timeframe for)
>
> I'd say that in the first time frame (10:40-10:42) ZR1 is being
> connected to by the Zope server but not used and ZR2 is being connected
> to by the batch processor and actually doing some serious business.

That's odd, since ZR1 is listed first on all clients...
Why would ZR1 not be used?

>> - What triggers the inconsistent OIDs on zeoraid2 at 10:41, and how
> does this lead to inconsistent raid on zeoraid2 at 11:22?
>
> You're asking about zeoraid1 at 11:22, right?

Nope, zeoraid2, at 10:41, as I put in a previous message:
2009-11-25T10:41:36 CRITICAL gocept.zeoraid Storage zeo1 degraded.
Reason: inconsistent OIDs

> I'm not yet sure what makes the OIDs inconsistent. I have the feeling
> that it's the packing.

...but packing only happens once a week, and not on a Wednesday.

> However, I wonder what happened before that
> transaction in ZR2 that came out with inconsistent OIDs.

Anything I can do to help you find out?

>> - can you add some logging to show what connection is in use, or let
> me know what I should grep for to find...
>
> Not quite sure what you mean. The connection is displayed: at some point
> a client connects and gets a storage server id assigned. all later
> calls have that id in them.

That would be the "what I should grep for" answer then ;-)

Chris

Revision history for this message
Christian Theune (ctheune) wrote :

Clients try connecting to all servers at once and will take whatever answers quickest. I'm not sure about all of the details but I can imagine that network burps or even busy ZEO servers could cause the second server to be preferred for the second client.

The second question: you asked "What triggers the inconsistent OIDs on zeoraid2 at 10:41, and how does this lead to inconsistent raid on zeoraid2 at 11:22?"

At 11:22 there's interesting stuff only happening on zeoraid1. I was checking whether that zeoraid2 was a typo or whether you did see something interesting on zeoraid2 at that time.

In general its mostly interesting now though why we loose the OID consistency at 10:41, agreed. So, could you get me the detailed logs of the 10 minutes leading up to 10:41?

Changed in gocept.zeoraid:
status: New → Confirmed
Revision history for this message
ChrisW (chris-simplistix) wrote :

Okay, I'll email you the debug logs from zeoraid2 for 10:40-10:50.

As for the connections, as previously discussed, I guess the two zeoraids need to talk to each other and only accept connections when they're sure the other zeoraid is dead. I know we talked about this before, but have you made any progress on this issue? I remember there being thorny edge cases...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.