when repeatedly running command line clients they sometimes (once in 200 tries) dont connect

Bug #541275 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Critical
Andrew Johnson

Bug Description

Bob Soliday has reported that when repeatedly running
command line clients with circuits to several IOCs
they sometimes (once in 200 tries) dont always connect.

Additional information:

More information (from Andrew) ...

I changed the timeout on the final ca_pend_io() call to 10 minutes, and
ran the test code while I had my caClientMonitor program monitoring all
the CA searches on the network. After a while all searches from this
workstation (hydroxide) stopped while the program was still active.
This is the output from caClientMonitor over that transition:

2006-02-13 17:07:16.280346070 Clients=66 Searches=904 Rate=90.12
     Client's UDP Port Age Req's Names
     hydroxide:59941 0.25 41 35
     hydroxide:59943 0.08 39 35
     hydroxide:59937 0.37 71 35
     hydroxide:59939 0.82 62 35
     hydroxide:59909 0.61 12 35
     hydroxide:59911 0.05 39 35
     hydroxide:59917 0.66 42 35
     hydroxide:59919 0.67 74 35
     hydroxide:59913 1.46 54 35
     hydroxide:59915 0.05 39 35
     hydroxide:59925 0.22 62 35
     hydroxide:59927 0.05 39 35
     hydroxide:59921 0.08 39 35
     hydroxide:59923 0.62 42 35
     hydroxide:59933 0.25 41 35
     hydroxide:59935 0.05 39 35
     hydroxide:59929 0.35 83 35
     hydroxide:59931 0.74 56 35
     gateway437:41495 39.57 1 1
     gateway437:41485 12.28 20 10
     gateway431:40510 45.31 9 2

2006-02-13 17:07:26.340245867 Clients=61 Searches=675 Rate=67.10
     Client's UDP Port Age Req's Names
     hydroxide:59949 0.30 56 35
     hydroxide:59951 0.05 39 35
     hydroxide:59945 0.34 69 35
     hydroxide:59947 0.74 62 35
     hydroxide:59957 0.26 41 35
     hydroxide:59959 0.05 39 35
     hydroxide:59953 0.33 68 35
     hydroxide:59955 0.72 56 35
     hydroxide:59965 0.62 42 35
     hydroxide:59967 0.10 71 33
     hydroxide:59961 0.32 90 35
     hydroxide:59963 0.23 39 35
     gateway431:40510 57.33 3 2

2006-02-13 17:07:36.358291235 Clients=52 Searches=1049 Rate=104.71
     Client's UDP Port Age Req's Names
     cyclops:46721 0.00 1 1
     cyclops:46723 0.00 8 8
     ctlapps4l:39942 0.00 6 6
     landau:33048 58.90 700 20
     cyclops:46711 0.00 1 1
     cyclops:46717 0.11 142 120
     cyclops:46719 0.04 126 120
     cyclops:46713 0.00 1 1
     cyclops:46715 0.00 3 3
     gateway437:41495 63.16 1 1
     gateway436:32971 1.91 60 10

2006-02-13 17:07:46.370780548 Clients=36 Searches=3507 Rate=350.26
     Client's UDP Port Age Req's Names
     landau:33048 68.15 3475 20
     gateway437:41485 36.34 10 10
     gateway436:32971 6.07 20 10
     gateway431:40510 73.71 2 2

2006-02-13 17:07:56.412832476 Clients=17 Searches=3198 Rate=318.46
     Client's UDP Port Age Req's Names
     landau:33048 78.96 3175 20
     gateway436:32971 16.68 10 10
     echo:38210 0.21 4 1
     echo:38212 0.22 9 7

At this point the program is sitting in ca_pend_io() but not sending any
search requests at all. The relevent parts of a netstat output are:

hydroxide% netstat -a

UDP: IPv4
    Local Address Remote Address State
-------------------- -------------------- -------
       *.5065 Idle

TCP: IPv4
  Local Address Remote Address Swind Send-Q Rwind Recv-Q State
---------------- ----------------- ----- ------ ----- ------ -------
hydroxide.36974 iocmtime.5064 8192 0 49640 0 ESTABLISHED
hydroxide.36975 iocbramp.5064 8192 0 49640 0 ESTABLISHED
hydroxide.36976 iocinjtime.5064 8192 0 49248 0 ESTABLISHED
hydroxide.36977 iocrfsource.5064 8192 0 49640 0 ESTABLISHED
hydroxide.36978 iocbbpm3.5064 8192 0 49304 0 ESTABLISHED
hydroxide.36979 iocbbpm1.5064 8192 0 49024 0 ESTABLISHED
hydroxide.36980 iocbbpm2.5064 8192 0 49304 0 ESTABLISHED

There were no TCP sockets in TIME_WAIT state.

I took the opportunity to kill -QUIT the program and examine the
resulting core dump:

(dbx) threads
o> t@1 a l@1 ?() sleep on 0x12e410 in __lwp_park()
     t@2 a l@2 start_routine() sleep on 0x12eb00 in __lwp_park()
     t@3 a l@3 start_routine() sleep on 0x12ed48 in __lwp_park()
     t@4 a l@4 start_routine() LWP suspended in _so_recvfrom()
     t@5 a l@5 start_routine() LWP suspended in _so_recv()
     t@6 a l@6 start_routine() sleep on 0x167240 in __lwp_park()
     t@7 a l@7 start_routine() LWP suspended in _so_recv()
     t@8 a l@8 start_routine() sleep on 0x1f1008 in __lwp_park()
     t@9 a l@9 start_routine() LWP suspended in _so_recv()
    t@10 a l@10 start_routine() sleep on 0x1f5290 in __lwp_park()
    t@11 a l@11 start_routine() LWP suspended in _so_recv()
    t@12 a l@12 start_routine() sleep on 0x1f9518 in __lwp_park()
    t@13 a l@13 start_routine() LWP suspended in _so_recv()
    t@14 a l@14 start_routine() sleep on 0x1fa0e0 in __lwp_park()
    t@15 a l@15 start_routine() LWP suspended in _so_recv()
    t@16 a l@16 start_routine() sleep on 0x203008 in __lwp_park()
    t@17 a l@17 start_routine() LWP suspended in _so_recv()
    t@18 a l@18 start_routine() sleep on 0x207290 in __lwp_park()

The main thread is in ca_pend_io():

(dbx) where t@1
current thread: t@1
   [1] __lwp_park(0x4, 0xffbfbfd8, 0x0, 0x1, 0x0, 0x0), at 0xff065998
   [2] cond_wait_queue(0x12e410, 0xff078b48, 0x0, 0x0, 0xff350000,
0xff078000), at 0xff062b9c
   [3] cond_wait_common(0x0, 0x12e3f8, 0xffbfbfd8, 0x0, 0x0,
0x43f1112a), at 0xff063114
   [4] _ti_cond_timedwait(0x12e410, 0x12e3f8, 0xffbfc184, 0xa259c, 0x34,
0xffbfeed6), at 0xff0635a4
   [5] _cond_timedwait_cancel(0x12e410, 0x12e3f8, 0xffbfc184, 0xde400,
0x0, 0x0), at 0xff0635d8
   [6] _ti_pthread_cond_timedwait(0x12e410, 0x12e3f8, 0xffbfc184, 0x2,
0xffbfeb9b, 0xffbfc184), at 0xff063618
=>[7] condTimedwait(condId = 0x12e410, mutexId = 0x12e3f8, time =
0xffbfc184), line 65 in "osdEvent.c"
   [8] epicsEventWaitWithTimeout(pevent = 0x12e3f8, timeout = 600.0),
line 157 in "osdEvent.c"
   [9] epicsEvent::wait(this = 0x12e3b8, timeOut = 600.0), line 72 in
"epicsEvent.cpp"
   [10] ca_client_context::blockForEventAndEnableCallbacks(this =
0x12e340, event = CLASS, timeout = 600.0), line 607 in
"ca_client_context.cpp"
   [11] ca_client_context::pendIO(this = 0x12e340, timeout = 600.0),
line 506 in "ca_client_context.cpp"
   [12] ca_pend_io(timeout = 600.0), line 481 in "access.cpp"
   [13] main(argc = 2, argv = 0xffbff364), line 218 in "debugTest.c"

This is the ipAddrToAsciiEngine worker thread:

(dbx) where t@2
current thread: t@2
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x1, 0xff078000, 0x0), at 0xff065998
   [2] cond_wait_queue(0x12eb00, 0xff078b48, 0x0, 0x0, 0xff350200,
0xff078000), at 0xff062b9c
   [3] _cond_wait_cancel(0x12eb00, 0x12eae8, 0xca9e8, 0xfeeafbd8,
0x1f9590, 0x920), at 0xff063358
   [4] _ti_pthread_cond_wait(0x12eb00, 0x12eae8, 0xca9e8, 0xfeeafbd8,
0x81010100, 0xff0000), at 0xff063394
   [5] condWait(condId = 0x12eb00, mutexId = 0x12eae8), line 75 in
"osdEvent.c"
   [6] epicsEventWait(pevent = 0x12eae8), line 137 in "osdEvent.c"
   [7] epicsEvent::wait(this = 0x12eaac), line 63 in "epicsEvent.cpp"
   [8] ipAddrToAsciiEnginePrivate::run(this = 0x12e688), line 244 in
"ipAddrToAsciiAsynchronous.cpp"
   [9] epicsThreadCallEntryPoint(pPvt = 0x12eab4), line 59 in
"epicsThread.cpp"
   [10] start_routine(arg = 0x12ebc8), line 320 in "osdThread.c"

Here's the timerQueue thread:

(dbx) where t@3
current thread: t@3
=>[1] __lwp_park(0x4, 0xfee7fa60, 0x0, 0x1, 0x0, 0x0), at 0xff065998
   [2] cond_wait_queue(0x12ed48, 0xff078b48, 0x0, 0x0, 0xff350400,
0xff078000), at 0xff062b9c
   [3] cond_wait_common(0x0, 0x12ed30, 0xfee7fa60, 0x0, 0x0,
0x43f1116b), at 0xff063114
   [4] _ti_cond_timedwait(0x12ed48, 0x12ed30, 0xfee7fc0c, 0x0, 0x0,
0x0), at 0xff0635a4
   [5] _cond_timedwait_cancel(0x12ed48, 0x12ed30, 0xfee7fc0c, 0xa26c4,
0xfee7fa74, 0x10), at 0xff0635d8
   [6] _ti_pthread_cond_timedwait(0x12ed48, 0x12ed30, 0xfee7fc0c, 0x3,
0x1e5273eb, 0xfee7fc0c), at 0xff063618
   [7] condTimedwait(condId = 0x12ed48, mutexId = 0x12ed30, time =
0xfee7fc0c), line 65 in "osdEvent.c"
   [8] epicsEventWaitWithTimeout(pevent = 0x12ed30, timeout =
65.534625872), line 157 in "osdEvent.c"
   [9] epicsEvent::wait(this = 0x12ecb4, timeOut = 65.534625872), line
72 in "epicsEvent.cpp"
   [10] timerQueueActive::run(this = 0x12ec60), line 70 in
"timerQueueActive.cpp"
   [11] epicsThreadCallEntryPoint(pPvt = 0x12ecbc), line 59 in
"epicsThread.cpp"
   [12] start_routine(arg = 0x12ee10), line 320 in "osdThread.c"

At this point I imagine you might be able to extract additional
information about the state of the client library, so I have kept the
core file and executable from this.

On a later run, I ran caClientMonitor set up to monitor just searches
from this workstation, and found that during the 10 minute period there
appeared to be no searches being sent out at all (or at least none that
the CAS tool was being presented with).

Original Mantis Bug: mantis-240
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=240

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

>
> Andrew: Do you have preemptive POSIX scheduling enabled?

% cd configure
% grep PRIORITY * os/* CONFIG_SITE:USE_POSIX_THREAD_PRIORITY_SCHEDULING = NO

Thus we have preemptive scheduling, but its not priority-based.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> Is Bob's app thrashing TCP circuits as defined by the definition
> below? Presumably he would need to run his applicatuion several times
> per minute and or connect to many IOCs.

If this was the cause of the problem, we'd have seen it on R3.14.6 as
well as the later versions. So far we haven't seen the channel
connection failure on that version of base at all.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

It looks like these changes should resolve this issue

cvs diff -u -wb -i -- searchTimer.cpp udpiiu.cpp udpiiu.h (in directory D:\users\hill\R3.14.dll_hell_fix\epics\base\src\ca\)
Index: searchTimer.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/searchTimer.cpp,v
retrieving revision 1.33.2.14
diff -c -u -w -b -i -r1.33.2.14 searchTimer.cpp
cvs diff: conflicting specifications of output style
--- searchTimer.cpp 6 Jun 2006 22:53:04 -0000 1.33.2.14
+++ searchTimer.cpp 6 Jun 2006 23:12:42 -0000
@@ -44,6 +44,7 @@
         const unsigned indexIn,
         epicsMutex & mutexIn,
         bool boostPossibleIn ) :
+ timeAtLastSend ( epicsTime::getCurrent () ),
     timer ( queueIn.createTimer () ),
     iiu ( iiuIn ),
     mutex ( mutexIn ),
@@ -133,6 +134,8 @@
             guard, *pChan, this->index );
     }

+ this->timeAtLastSend = currentTime;
+
     // boost search period for channels not recently
     // searched for if there was some success
     if ( this->searchResponses && this->boostPossible ) {
@@ -273,8 +276,6 @@
     this->dgSeqNoAtTimerExpireEnd =
         this->iiu.datagramSeqNumber ( guard ) - 1u;

- this->timeAtLastSend = currentTime;
-
 # ifdef DEBUG
         if ( this->searchAttempts ) {
             char buf[64];
Index: udpiiu.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/udpiiu.cpp,v
retrieving revision 1.103.2.28
diff -c -u -w -b -i -r1.103.2.28 udpiiu.cpp
cvs diff: conflicting specifications of output style
--- udpiiu.cpp 22 Nov 2005 00:11:16 -0000 1.103.2.28
+++ udpiiu.cpp 6 Jun 2006 19:35:49 -0000
@@ -1115,13 +1115,19 @@

 void udpiiu::updateRTTE ( double measured )
 {
+ if ( measured > maxRoundTripEstimate ) {
+ measured = maxRoundTripEstimate;
+ }
+ if ( measured < minRoundTripEstimate ) {
+ measured = minRoundTripEstimate;
+ }
     double error = measured - this->rtteMean;
     this->rtteMean += 0.25 * error;
 }

 double udpiiu::getRTTE () const
 {
- return epicsMax ( this->rtteMean, minRoundTripEstimate );
+ return this->rtteMean;
 }

 unsigned udpiiu::getHostName (
Index: udpiiu.h
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/udpiiu.h,v
retrieving revision 1.24.2.15
diff -c -u -w -b -i -r1.24.2.15 udpiiu.h
cvs diff: conflicting specifications of output style
--- udpiiu.h 22 Nov 2005 00:11:16 -0000 1.24.2.15
+++ udpiiu.h 6 Jun 2006 19:35:51 -0000
@@ -80,6 +80,7 @@
 };

 static const double minRoundTripEstimate = 32e-3; // seconds
+static const double maxRoundTripEstimate = 10; // seconds
 static const double maxSearchPeriodDefault = 5.0 * 60.0; // seconds
 static const double maxSearchPeriodLowerLimit = 60.0; // seconds
 static const double beaconAnomalySearchPeriod = 5.0; // seconds

***** CVS exited normally with code 1 *****

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I am sorry, but I have misplaced your test code. How do you want to proceed? Would you like to run tests there to confirm that my patch works or send the test code here hoping that I can reproduce in my situation (I fear that this one may be dependent on the speed and or version of the server that responds to the first search request).

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Bob Soliday:

Sorry it took me so long to get back to you. I have been pretty busy. I got a chance today to test your changes with the latest version of the R3-14-2_branch. This solved the problem I was seeing.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Looking at this in more detail, I am mystified as to how this bug is occurring in practice. The original locking design appears to protect against problems.

I am starting to suspect that the UDP system is providing a UDP response solicited during a previous client's lifetime, but I dont see how that can happen given that the OS would presumably not reuse the same ephemeral UDP port assignement for long enough that old responses could not remain queued up somewhere during transport in the WAN.

edited on: 2006-06-21 17:23

Revision history for this message
Andrew Johnson (anj) wrote :

We're not getting complaints about this any more, I'm going to assume it's fixed.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.10 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.