Saturday, March 31, 2012

Three Urgent Problems in isupd

Problem 1:

Mar 26 02:21:03 v3 sangoma_isupd[15421]: F:sb_cmm.c:get_new_overall_rte_status:illegal status, should be DAVA:tg/sg0-stat/sg1-stat/overall-stat:0:2:1:1

We are sure this one is fixed in the isupd 2.6.1 patch 8.

Patch 8 (2012-03-30 Fr)
  * when heartbeat is lost to SG in simplex or both SG in duplex, M3UA route
    is set to DUNA; patch also sets individual SG route status to DUNA which
    prevents failure of a status check elsewhere in the code

Problem 2:

*** glibc detected *** /usr/local/ss7box/sangoma_isupd: double free or corruption (out): 0x08dd90a8 ***
======= Backtrace: =========
/lib/libc.so.6[0x83a6c5]
/lib/libc.so.6(cfree+0x59)[0x83ab09]
/usr/local/ss7box/sangoma_isupd[0x804d7e1]
/usr/local/ss7box/sangoma_isupd[0x804992d]
/lib/libc.so.6(__libc_start_main+0xdc)[0x7e6e9c]
/usr/local/ss7box/sangoma_isupd[0x80494b1]

We searched the code for a while but the threads were long, so we decided to devise a usage semaphore in each timer that would let us detect double freeing of timer memory and dump info to the log to help us track down the problem. The timer semaphore is in isupd 2.6.1 patch 9.

Patch 9 (2012-03-31 Sa)
  * added a semaphore to timers to test for double freeing of memory for timers

Problem 3:

CQM that cross T1 span boundries cause improper responses that cause loss of circuits. A restart of isupd is required to recover the lost circuits. The CQM arrives nightly at several locations. We are devising a multi-step solution. The first step is to develop a work-around solution to prevent circuit loss as quickly as possible. The first step will be to create a response based on the incoming CQM that always reports the indicated circuits as being in working order.

Friday, March 23, 2012

How To Read the CDR Log File



Here is an example with fake phone numbers:

1001, in, 1, 2012, 03, 23, 16, 02, 22, 1332543742, 499660, 0, 0, 15, 0, , 0, , 0
128, out, 1, 2012, 03, 23, 16, 02, 22, 1332543742, 499692, 0, 0, 15, 10, 1112223333, 10, 
                 4445556666, 0
129, in, 1, 1332543742, 758203, 0, 0, 15
1006, out, 1, 1332543742, 758220, 0, 0, 15
80, unrecognized
1044, unrecognized
132, in, 1, 1332543742, 758427, 0, 0, 15
1009, out, 1, 1332543742, 758441, 0, 0, 15
1012, in, 1, 1332543789, 456939, 0, 0, 15, 16
133, out, 1, 1332543789, 456956, 0, 0, 15
134, in, 1, 1332543789, 464574, 0, 0, 15
1016, out, 1, 1332543789, 464594, 0, 0, 15

Note the 2 unrecognized codes - will have to figure this out. 


The code is the first number on each line; 3 digits codes are MGD messages; 4 digit codes are SS7 messages. The layout is as follows:

Call start events (IAM 1001, callstart 128):

        sprintf (s_cdr, "%u, %s, %u, %s, %lu, %lu, %u, %u, %u, %u, %s, %u, %s, %u\n",
                        p_cdr->code,
                        p_cdr->msg_direction ? "in" : "out",
                        p_cdr->sysid,
                        p_ds,
                        p_cdr->timestamp.tv_sec,
                        p_cdr->timestamp.tv_usec,
                        p_cdr->call_setup_id,
                        p_cdr->span,
                        p_cdr->chan,
                        p_cdr->called_number_dig_count,
                        p_cdr->called_number_digits,
                        p_cdr->calling_number_dig_count,
                        p_cdr->calling_number_digits,
                        p_cdr->calling_number_presentation_indicator
                        );

Call stop events (REL 1012):

        sprintf (s_cdr, "%u, %s, %u, %lu, %lu, %u, %u, %u, %u\n",
                        p_cdr->code,
                        p_cdr->msg_direction ? "in" : "out",
                        p_cdr->sysid,
                        p_cdr->timestamp.tv_sec,
                        p_cdr->timestamp.tv_usec,
                        p_cdr->call_setup_id,
                        p_cdr->span,
                        p_cdr->chan,
                        p_cdr->release_cause);

Simple events - most of the entries (ACM 1006, ANS 1009, RLC 1016):

        sprintf (s_cdr, "%u, %s, %u, %lu, %lu, %u, %u, %u\n",
                        p_cdr->code,
                        p_cdr->msg_direction ? "in" : "out",
                        p_cdr->sysid,
                        p_cdr->timestamp.tv_sec,
                        p_cdr->timestamp.tv_usec,
                        p_cdr->call_setup_id,
                        p_cdr->span,
                        p_cdr->chan);

Thursday, March 15, 2012

Redundant ss7box Refinements

redundant ss7box is being constructed in the lab once again with new twist to simulate an IP network with long and diverse links. We need to simulate the loss of an IP link between a remote Asterisk box and ss7box - something that will be more likely in a network that spans a very large region.  Using two distinctly different IP carriers for IP connections to each ss7box ensures link diversity.


In this new configuration we'll put in a small IP switch between the two mated ss7boxes and an Asterisk box. To test IP link loss we'll pull the IP cable between one of the ss7boxes and the IP switch. The result should be that the ss7box that lost the IP link should redirect the traffic to that IP  link to the crosslink to its mate ss7box, and the Asterisk box that lost the IP link should remap SLS to use its remaining in-service IP link to the other ss7box.  There will be a small window where signals could get lost during the transition. Calls with lost signals will timeout and be cleared. The call parties will experience abrupt call termination indications. Detecting loss of an IP link is  not as punctual as loss of SS7 link detection - IP is weak in this area. We'll help the situation by using a better ping-pong protocol on the IP links to indicate link loss.


The functionality described above does not exist yet, so we'll set up the test network; make test calls before making improvements and confirm that half of all calls will not complete when an IP link is lost; make appropriate code changes; run the call tests under IP link loss and restore conditions. We'll release this functionality in a major revision release - probably 2.7.


Here's a scan of the drawing we are using to build the lab network:




We use Google Doc spreadsheets for the configuration.


Progress:
  1. The ss7 link between 1002 and the new mated 159 ss7box at 192.168.1.62 is up.
  2. The SIP client on the Asterisk box on the 159 cluster has to be rebuilt. The laptop it was running on lost its HDD. The HDD was replaced and Mint 12 was installed last week. Was using Blink and XP previously.  Will need to find and test a suitable SIP client that works on Mint 12 and a Dell Vostro 1000.
    - looks like linphone is the first candidate; and it works too - tested on the 1003 node
  3. write up the linphone and asterisk configuration
This is the link report from the 1002 ss7boxd that shows 3 ss7 links up:


Mar 15 11:34:27 ana156 ss7boxd[7056]: R:link util:ls 0:link 0:msu oc 26:tot oc 161840:util 0

Mar 15 11:34:27 ana156 ss7boxd[7056]: R:link util:ls 1:link 0:msu oc 34:tot oc 161840:util 0

Mar 15 11:34:27 ana156 ss7boxd[7056]: R:link util:ls 1:link 1:msu oc 34:tot oc 161920:util 0





Wednesday, March 14, 2012

ss7boxd 2.6.0.13 Released

Announcing a minor ss7boxd release to distinguish the need to use 0.9 revision ss7box.conf configuration file which is created by the smgcfg09.py program.  The 2.6.0.12 ss7boxd was released with some using 0.8 versions of the config file and later releases used the 0.9 revision of the config file.  Sorry for the confusion.

You can download them from here:

http://www.ss7box.com/tmp/ss7boxd-2.6.0.13-ANSI
http://www.ss7box.com/tmp/ss7boxd-2.6.0.13-ITU

The difference between 0.8 and 0.9 revision conf files is described in the Change History inside the smgcfg09.py file.  The only difference for ss7box.conf is the change to the revision number in the conf file.  ss7boxd 2.6.0.13 is looking for rev 0.9 in the ss7box conf file.