[Solved] remote srcp server reboots

Moderator: Moderators

Postby rjversluis » 02.08.2013, 06:41

You can give 5642 a try.
Best Regards, Rob.
:!: PS: Do not forget to attach the usual files.
:!: PS: Nicht vergessen die übliche Dateien an zu hängen.
[ macOS - Linux] - [ N: CBus - CAN-GCA ] - [ 0: RocNetNode - GCA-Pi ]
rjversluis
Site Admin
 

Postby Richard-TX » 02.08.2013, 06:47

In giving this some more thought, this same issue occurs with telnet.

Telnet to a remote machine. If the remote machine suddenly dies, the telnet session stays up and does not disconnect until the user types something. It is only then that telnet session errors and dies.

In short this issue is a by product of the design of srcpd. Since the traffic on the info channel is one-way only, an info channel connection that suddenly dies (kernel panic, etc) is never detected.
Richard-TX
 

Postby rjversluis » 02.08.2013, 08:01

It is in Rocrail detected after issuing a command. If it detects the connection is broken both channels are disconnected and closed. The reconnect is triggered.
Best Regards, Rob.
:!: PS: Do not forget to attach the usual files.
:!: PS: Nicht vergessen die übliche Dateien an zu hängen.
[ macOS - Linux] - [ N: CBus - CAN-GCA ] - [ 0: RocNetNode - GCA-Pi ]
rjversluis
Site Admin
 

Postby Richard-TX » 02.08.2013, 12:30

rjversluis wrote:It is in Rocrail detected after issuing a command. If it detects the connection is broken both channels are disconnected and closed. The reconnect is triggered.


I agree but for some reason the info channel is not reconnecting. Likely because there is no real test for it that can be detected. In a case like this, when the cmd channel goes down, rocrail should just assume that the info channel is down too.

I will be poking around the code and trying a few things. So far I have made several changes but have not elicited a fix.
Richard-TX
 

Postby rjversluis » 02.08.2013, 13:49

Hey,

I did committed a fix for it.
I tested this morning with two systems.
Just test 5642...

http://bazaar.launchpad.net/~rocrail-pr ... il/changes
Best Regards, Rob.
:!: PS: Do not forget to attach the usual files.
:!: PS: Nicht vergessen die übliche Dateien an zu hängen.
[ macOS - Linux] - [ N: CBus - CAN-GCA ] - [ 0: RocNetNode - GCA-Pi ]
rjversluis
Site Admin
 

Postby Richard-TX » 02.08.2013, 16:55

Believe it or not but that is exactly the first thing I tried. It did not fix it. I will be working on it later today.
Richard-TX
 

Postby Richard-TX » 02.08.2013, 20:17

I am having issues with trace messages that I add.

For example:

in srcp/tcpip.c I changed line 210 and added two "!"

---------------------------------
if (! SocketOp.readln(o->cmdSocket,inbuf) ) {
TraceOp.trc( name, TRCLEVEL_EXCEPTION, __LINE__, 9999, "SendCommand: could not read response !!");


I never see "!!" appended to any of the messages such as the one below.

20130802.140609.711 r9999E cmdr5C00 OSRCP 0209 SendCommand: could not read response

What am I missing?

I have tried the following to build rocrail.

make
make fromtar
make server

I must be doing something wrong.
Richard-TX
 

Postby rjversluis » 02.08.2013, 22:10

Drop your issue.

I could reproduce your issue and fixed it.
Best Regards, Rob.
:!: PS: Do not forget to attach the usual files.
:!: PS: Nicht vergessen die übliche Dateien an zu hängen.
[ macOS - Linux] - [ N: CBus - CAN-GCA ] - [ 0: RocNetNode - GCA-Pi ]
rjversluis
Site Admin
 

Postby Richard-TX » 03.08.2013, 03:10

rjversluis wrote:Hey,

I did committed a fix for it.
I tested this morning with two systems.
Just test 5642...

http://bazaar.launchpad.net/~rocrail-pr ... il/changes


I did try 5642. It is still failing to reconnect.

Don't work on it Rob. I am working on it. I will let you know what I come up with. This gives me something to do while I am waiting for some pain relief. I have some bad vertebrae in my neck and it is causing my much arm pain. Learning about the inner workings of Rocrail gives me a much needed distraction.
Richard-TX
 

Postby Richard-TX » 05.08.2013, 13:56

Rob,

I have a fix.

In file rocdigs/impl/srcp/tcpip.c


Here is the addition.


/* Read server response: */
if (! SocketOp.readln(o->cmdSocket,inbuf) ) {
TraceOp.trc( name, TRCLEVEL_EXCEPTION, __LINE__, 9999, "SendCommand: could not read response");

if (!SocketOp.write( o->infoSocket, szCommand, StrOp.len(szCommand))) {
TraceOp.trc( name, TRCLEVEL_EXCEPTION, __LINE__, 9999, "Could not write: %s", szCommand );
}

SocketOp.disConnect(o->cmdSocket);
SocketOp.base.del(o->cmdSocket);
o->cmdSocket = NULL;

/* rc = SRCPCONNECT_ERROR; */
return -1;
}


The idea is that if there is a read error on the CMD channel, then most likely the info channel is lost too. Since the tcp socket on the info channel will never throw an error if the srcpd server suddenly dies, a simple write to the info channel will cause the socket to error and close. The rocrail socket library will take care of the rest.

This is the only way that I can find to tell the inforeader that the info socket is dead.



What do you think?
Richard-TX
 

Postby rjversluis » 05.08.2013, 14:05

This is the current implementation:
Code: Select all
  /* Read server response: */
  if (! SocketOp.readln(o->cmdSocket,inbuf) ) {
    TraceOp.trc( name, TRCLEVEL_EXCEPTION, __LINE__, 9999, "SendCommand: could not read response");
    SocketOp.disConnect(o->cmdSocket);
    SocketOp.base.del(o->cmdSocket);
    o->cmdSocket = NULL;

    if( o->infoSocket != NULL ) {
      SocketOp.disConnect(o->infoSocket);
      SocketOp.base.del(o->infoSocket);
      o->infoSocket = NULL;
    }
    return -1;
  }


The infoSocket is closed too on cmdSocket read error. (Worked OK with my two system test.)
I do not see those lines in your code and I do not see any logic or effect of writing to the infoSocket...
Best Regards, Rob.
:!: PS: Do not forget to attach the usual files.
:!: PS: Nicht vergessen die übliche Dateien an zu hängen.
[ macOS - Linux] - [ N: CBus - CAN-GCA ] - [ 0: RocNetNode - GCA-Pi ]
rjversluis
Site Admin
 

Postby Richard-TX » 05.08.2013, 15:25

I went through this at great length over the weekend. I know it seems odd, but I can see no other way to inform the inforeader that the info socket it is using is dead and to reestablish communications.

The lines you added did not take care of the issue in it's entirety.

There is something about calling SocketDisconnct that does not cause inforeader to see the disconnection of the socket. It is my belief that inforeader is sitting on that info socket using a blocked read. Until the OS throws an exception (it never will if the remote server loses power and a write is not performed) , the inforead thread will happily sit there listening to a dead socket...quite possibly forever.

The same thing can happen to a common telnet session. The remote server can go away (lose power) and telnet just sits there waiting for something to happen. I have seen telnet sessions stay alive for days after the remote server had crashed. It isn't until until someone types something within telnet that telnet actually errors and dies. Same sort of thing applies here.

So given the circumstances, there are a few ways to deal with the issue. They are:

1 - change to a non-blocking read
2 - establish a short timeout.
3 - test the socket with a write.

Neither #1 or #2 of these solutions are what I would consider acceptable. So the next best thing is perform a socket write and that will cause the OS to throw an error which is handled quite nicely by Rocrail and inforeader.

If you would like a better explanation of this shortcoming of TCP/IP, I can likely find it.

Alternatively, If you can find a way for cmd-recv-error to throw an exception within inforeader, then that might also resolve the issue. It depends if inforeader will drop the socket it is currently using and establish a new one when that exception occurs.

My solution is far from elegant but it has no downside that I can find either.

If you need more data, I can provide it.

Rich
Last edited by Richard-TX on 05.08.2013, 15:33, edited 1 time in total.
Richard-TX
 

Postby Richard-TX » 05.08.2013, 15:27

Here is a larger snippet.

------------------------------------------------------------

/* Read server response: */
if (! SocketOp.readln(o->cmdSocket,inbuf) ) {
TraceOp.trc( name, TRCLEVEL_EXCEPTION, __LINE__, 9999, "SendCommand: could not read response");
if (!SocketOp.write( o->infoSocket, szCommand, StrOp.len(szCommand))) {
TraceOp.trc( name, TRCLEVEL_EXCEPTION, __LINE__, 9999, "Could not write: %s", szCommand );
}
SocketOp.disConnect(o->cmdSocket);
SocketOp.base.del(o->cmdSocket);
o->cmdSocket = NULL;

if( o->infoSocket != NULL ) {
SocketOp.disConnect(o->infoSocket);
SocketOp.base.del(o->infoSocket);
o->infoSocket = NULL;
}
return -1;

The actual data written to the info socket could be anything.....a single character would suffice.
Richard-TX
 

Postby rjversluis » 05.08.2013, 15:34

OK, I will add your line a little more save:
Code: Select all
  /* Read server response: */
  if (! SocketOp.readln(o->cmdSocket,inbuf) ) {
    TraceOp.trc( name, TRCLEVEL_EXCEPTION, __LINE__, 9999, "SendCommand: could not read response");
    SocketOp.disConnect(o->cmdSocket);
    SocketOp.base.del(o->cmdSocket);
    o->cmdSocket = NULL;

    if( o->infoSocket != NULL ) {
      /* Trigger the socket to generate an exception. */
      SocketOp.write( o->infoSocket, szCommand, StrOp.len(szCommand));
      SocketOp.disConnect(o->infoSocket);
      SocketOp.base.del(o->infoSocket);
      o->infoSocket = NULL;
    }
    return -1;
  }


I tested the disconnect under Mac OS X which did do the trick.
Windows behaves for sure in a different way...
Best Regards, Rob.
:!: PS: Do not forget to attach the usual files.
:!: PS: Nicht vergessen die übliche Dateien an zu hängen.
[ macOS - Linux] - [ N: CBus - CAN-GCA ] - [ 0: RocNetNode - GCA-Pi ]
rjversluis
Site Admin
 

Postby Richard-TX » 05.08.2013, 15:54

Mac , Windows, and Unix/Linux are indeed all a little different.

Having said all that, my testing was on Linux. To come up with a Windows build, I would have to invent a fair amount of time in configuring and installing the cross compiler. I decided to forgo that since I was able to reproduce the issue under Linux (my gold standard so to speak).

In addition, the Rocpro issue I am intermittently experiencing only occurs under Windows, The Linux version of Rocpro is 100% solid.

For those that don't know how different things can be with regards to Operating Systems, Goodle "Mac-Vs-Unix-Vs-Vista"

I will test your version of the fix and let you know how it works.

Rich
Richard-TX
 

PreviousNext

Return to srcp