Saturday, March 31, 2012

Three Urgent Problems in isupd

Problem 1:

Mar 26 02:21:03 v3 sangoma_isupd[15421]: F:sb_cmm.c:get_new_overall_rte_status:illegal status, should be DAVA:tg/sg0-stat/sg1-stat/overall-stat:0:2:1:1

We are sure this one is fixed in the isupd 2.6.1 patch 8.

Patch 8 (2012-03-30 Fr)
  * when heartbeat is lost to SG in simplex or both SG in duplex, M3UA route
    is set to DUNA; patch also sets individual SG route status to DUNA which
    prevents failure of a status check elsewhere in the code

Problem 2:

*** glibc detected *** /usr/local/ss7box/sangoma_isupd: double free or corruption (out): 0x08dd90a8 ***
======= Backtrace: =========
/lib/libc.so.6[0x83a6c5]
/lib/libc.so.6(cfree+0x59)[0x83ab09]
/usr/local/ss7box/sangoma_isupd[0x804d7e1]
/usr/local/ss7box/sangoma_isupd[0x804992d]
/lib/libc.so.6(__libc_start_main+0xdc)[0x7e6e9c]
/usr/local/ss7box/sangoma_isupd[0x80494b1]

We searched the code for a while but the threads were long, so we decided to devise a usage semaphore in each timer that would let us detect double freeing of timer memory and dump info to the log to help us track down the problem. The timer semaphore is in isupd 2.6.1 patch 9.

Patch 9 (2012-03-31 Sa)
  * added a semaphore to timers to test for double freeing of memory for timers

Problem 3:

CQM that cross T1 span boundries cause improper responses that cause loss of circuits. A restart of isupd is required to recover the lost circuits. The CQM arrives nightly at several locations. We are devising a multi-step solution. The first step is to develop a work-around solution to prevent circuit loss as quickly as possible. The first step will be to create a response based on the incoming CQM that always reports the indicated circuits as being in working order.

Friday, March 23, 2012

How To Read the CDR Log File



Here is an example with fake phone numbers:

1001, in, 1, 2012, 03, 23, 16, 02, 22, 1332543742, 499660, 0, 0, 15, 0, , 0, , 0
128, out, 1, 2012, 03, 23, 16, 02, 22, 1332543742, 499692, 0, 0, 15, 10, 1112223333, 10, 
                 4445556666, 0
129, in, 1, 1332543742, 758203, 0, 0, 15
1006, out, 1, 1332543742, 758220, 0, 0, 15
80, unrecognized
1044, unrecognized
132, in, 1, 1332543742, 758427, 0, 0, 15
1009, out, 1, 1332543742, 758441, 0, 0, 15
1012, in, 1, 1332543789, 456939, 0, 0, 15, 16
133, out, 1, 1332543789, 456956, 0, 0, 15
134, in, 1, 1332543789, 464574, 0, 0, 15
1016, out, 1, 1332543789, 464594, 0, 0, 15

Note the 2 unrecognized codes - will have to figure this out. 


The code is the first number on each line; 3 digits codes are MGD messages; 4 digit codes are SS7 messages. The layout is as follows:

Call start events (IAM 1001, callstart 128):

        sprintf (s_cdr, "%u, %s, %u, %s, %lu, %lu, %u, %u, %u, %u, %s, %u, %s, %u\n",
                        p_cdr->code,
                        p_cdr->msg_direction ? "in" : "out",
                        p_cdr->sysid,
                        p_ds,
                        p_cdr->timestamp.tv_sec,
                        p_cdr->timestamp.tv_usec,
                        p_cdr->call_setup_id,
                        p_cdr->span,
                        p_cdr->chan,
                        p_cdr->called_number_dig_count,
                        p_cdr->called_number_digits,
                        p_cdr->calling_number_dig_count,
                        p_cdr->calling_number_digits,
                        p_cdr->calling_number_presentation_indicator
                        );

Call stop events (REL 1012):

        sprintf (s_cdr, "%u, %s, %u, %lu, %lu, %u, %u, %u, %u\n",
                        p_cdr->code,
                        p_cdr->msg_direction ? "in" : "out",
                        p_cdr->sysid,
                        p_cdr->timestamp.tv_sec,
                        p_cdr->timestamp.tv_usec,
                        p_cdr->call_setup_id,
                        p_cdr->span,
                        p_cdr->chan,
                        p_cdr->release_cause);

Simple events - most of the entries (ACM 1006, ANS 1009, RLC 1016):

        sprintf (s_cdr, "%u, %s, %u, %lu, %lu, %u, %u, %u\n",
                        p_cdr->code,
                        p_cdr->msg_direction ? "in" : "out",
                        p_cdr->sysid,
                        p_cdr->timestamp.tv_sec,
                        p_cdr->timestamp.tv_usec,
                        p_cdr->call_setup_id,
                        p_cdr->span,
                        p_cdr->chan);

Thursday, March 15, 2012

Redundant ss7box Refinements

redundant ss7box is being constructed in the lab once again with new twist to simulate an IP network with long and diverse links. We need to simulate the loss of an IP link between a remote Asterisk box and ss7box - something that will be more likely in a network that spans a very large region.  Using two distinctly different IP carriers for IP connections to each ss7box ensures link diversity.


In this new configuration we'll put in a small IP switch between the two mated ss7boxes and an Asterisk box. To test IP link loss we'll pull the IP cable between one of the ss7boxes and the IP switch. The result should be that the ss7box that lost the IP link should redirect the traffic to that IP  link to the crosslink to its mate ss7box, and the Asterisk box that lost the IP link should remap SLS to use its remaining in-service IP link to the other ss7box.  There will be a small window where signals could get lost during the transition. Calls with lost signals will timeout and be cleared. The call parties will experience abrupt call termination indications. Detecting loss of an IP link is  not as punctual as loss of SS7 link detection - IP is weak in this area. We'll help the situation by using a better ping-pong protocol on the IP links to indicate link loss.


The functionality described above does not exist yet, so we'll set up the test network; make test calls before making improvements and confirm that half of all calls will not complete when an IP link is lost; make appropriate code changes; run the call tests under IP link loss and restore conditions. We'll release this functionality in a major revision release - probably 2.7.


Here's a scan of the drawing we are using to build the lab network:




We use Google Doc spreadsheets for the configuration.


Progress:
  1. The ss7 link between 1002 and the new mated 159 ss7box at 192.168.1.62 is up.
  2. The SIP client on the Asterisk box on the 159 cluster has to be rebuilt. The laptop it was running on lost its HDD. The HDD was replaced and Mint 12 was installed last week. Was using Blink and XP previously.  Will need to find and test a suitable SIP client that works on Mint 12 and a Dell Vostro 1000.
    - looks like linphone is the first candidate; and it works too - tested on the 1003 node
  3. write up the linphone and asterisk configuration
This is the link report from the 1002 ss7boxd that shows 3 ss7 links up:


Mar 15 11:34:27 ana156 ss7boxd[7056]: R:link util:ls 0:link 0:msu oc 26:tot oc 161840:util 0

Mar 15 11:34:27 ana156 ss7boxd[7056]: R:link util:ls 1:link 0:msu oc 34:tot oc 161840:util 0

Mar 15 11:34:27 ana156 ss7boxd[7056]: R:link util:ls 1:link 1:msu oc 34:tot oc 161920:util 0





Wednesday, March 14, 2012

ss7boxd 2.6.0.13 Released

Announcing a minor ss7boxd release to distinguish the need to use 0.9 revision ss7box.conf configuration file which is created by the smgcfg09.py program.  The 2.6.0.12 ss7boxd was released with some using 0.8 versions of the config file and later releases used the 0.9 revision of the config file.  Sorry for the confusion.

You can download them from here:

http://www.ss7box.com/tmp/ss7boxd-2.6.0.13-ANSI
http://www.ss7box.com/tmp/ss7boxd-2.6.0.13-ITU

The difference between 0.8 and 0.9 revision conf files is described in the Change History inside the smgcfg09.py file.  The only difference for ss7box.conf is the change to the revision number in the conf file.  ss7boxd 2.6.0.13 is looking for rev 0.9 in the ss7box conf file.

Thursday, December 15, 2011

Wanpipe Install Problem Fixed


Got this problem. Fixed it. Don't think it's important. It took a lot of time....wasted time....to figure all of this out.

Compiling WANPIPE API Development Utilities ...Failed!

        ERROR: Failed to compile WANPIPE API Tools !!!
        Please contact support at Sangoma Technologies
        email: techdesk@sangoma.com
        Please include the file setup_drv_compile.log


Let's see if we can get some detail:

[root@ana64 api]# cd /usr/src/Sangoma/wanpipe/api


[root@ana64 api]# make
make -C tdm_api
make[1]: Entering directory `/usr/src/Sangoma/wanpipe-3.5.12/api/tdm_api'
Ok.
make[1]: Leaving directory `/usr/src/Sangoma/wanpipe-3.5.12/api/tdm_api'
make -C legacy
make[1]: Entering directory `/usr/src/Sangoma/wanpipe-3.5.12/api/legacy'
make -C x25 all  APIINC=/usr/include/wanpipe
make[2]: Entering directory `/usr/src/Sangoma/wanpipe-3.5.12/api/legacy/x25'
Ok.
make[2]: Leaving directory `/usr/src/Sangoma/wanpipe-3.5.12/api/legacy/x25'
make -C chdlc all  APIINC=/usr/include/wanpipe
make[2]: Entering directory `/usr/src/Sangoma/wanpipe-3.5.12/api/legacy/chdlc'
cc -Wall -O2 -D__LINUX__ -D_DEBUG_=2 -D_GNUC_ -I../lib -I/usr/include/wanpipe -o chdlc_modem_cmd chdlc_modem_cmd.c ../lib/lib_api.c
chdlc_modem_cmd.c: In function 'handle_socket':
chdlc_modem_cmd.c:412: error: 'wp_api_hdr_t' has no member named 'error_flag'
make[2]: *** [chdlc_modem_cmd] Error 1
make[2]: Leaving directory `/usr/src/Sangoma/wanpipe-3.5.12/api/legacy/chdlc'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/Sangoma/wanpipe-3.5.12/api/legacy'
make: *** [all] Error 2


Problem is with legacy chdlc - not using it.  This is why this problem can most likely be ignored. Nevertheless, the fix follows. First, we look for a replacement for the offending "error_flag" field that's not defined.

[root@ana64 api]# grep -r "wp_api_hdr_t\;" ../* | grep "\.h\:"
grep: warning: ../patches/kdrivers/include/linux: recursive directory loop

../patches/kdrivers/include/wanpipe_api_hdr.h:} wp_api_hdr_t;
grep: warning: ../patches/kdrivers/wanec/linux: recursive directory loop

[root@ana64 api]# vi ../patches/kdrivers/include/wanpipe_api_hdr.h
[root@ana64 api]# vim ../patches/kdrivers/include/wanpipe_api_hdr.h

This looks promising:

/* CHDLC Old backdward comptabile */
#define wp_api_rx_hdr_chdlc_error_flag                  wp_api_rx_hdr_error_flag

Let's apply a change:


[root@ana64 api]# cd /usr/src/Sangoma/wanpipe/api/legacy/chdlc/
[root@ana64 chdlc]#

Create a file called "patch" and fill it with the following:



--- chdlc_modem_cmd.c   2011-12-15 17:05:20.000000000 -0500
+++ chdlc_modem_cmd.c.chg       2011-12-15 17:16:06.000000000 -0500
@@ -409,7 +409,7 @@
                                                return;
                                        }

-                                       switch (api_rx_el->api_rx_hdr.error_flag){
+                                       switch (api_rx_el->api_rx_hdr.wp_api_rx_hdr_error_flag){

                                        case 0:
                                                /* Rx packet is good */

Apply the patch:


[root@ana64 chdlc]# patch --ignore-whitespace < patch
patching file chdlc_modem_cmd.c

Now compile the api:

[root@ana64 chdlc]# cd /usr/src/Sangoma/wanpipe/api
[root@ana64 api]# make

Problem should be gone.  There will be tons of warnings depending on the gcc version you are using. As long as you don't see "error" in the output it should be fine.

Tuesday, December 13, 2011

Cluster Configuration Needs To Improve

Adding a cluster node to the lab this morning.  The lab is currently working using older versions of ss7box/SMG.  This configuration needs to remain intact.  The new cluster being added will be the development tip alpha test site.

It takes a lot of coordinated data to make it work because that's how the protocol works.  The smgcfg tool attempted to simplify the task and it did to some extent but plenty of feedback says we can do better.

Using a spreadsheet helps to make things visual and colorful, but downloading .csv and running smgcfgXX against it is kludgy. Two tools and manually pushing files is not good.  I'm getting reintroduced to the problem this morning.

A better approach would be to use a single tool that is aware of a group of nodes and a library of configurations.  The user interface needs to be efficient for users that want to use vi to edit a source file, a command line for lean systems with no X11 stuff loaded, and a command line IF that accommodates a web interface.  A diagramatic interface showing nodes and connections that allows click-to-query-or-modify would be helpful.  All of these interfaces should be supported interchangeably, for example, if one person wants to use vi on a source file and another wants to use the CLI or web interface (not at the same time), then it should be possible - because the CLI is a specialized source file editor and the web interface uses the CLI.  Of course, using vi to edit the source file could screw things up if the format is disturbed.  Ideally the CLI interface will not have this problem.  Furthermore, the CLI would have prompts like: add a link to a linkset, or add a trunk to a trunkgroup, or a powerful add a node. These prompts would lead the user through the collection of information needed.  Maybe the tool could generate a graphical representation output from input data as a precursor to using a graphical representation as an input.

Here's something interesting.  Whatever gets built, most of it is general purpose for all SS7 networks regardless of what equipment or protocol is being used.  Distinctions about specific equipment like ss7boxd, isupd, and sccpd are made in the final steps where specific conf files are created and pushed or pulled from specific nodes.  Sounds like an open source project.

Saturday, December 10, 2011

Lab Expansion Problem and Fix



The lab is expanding to support cluster configuration testing again.  We use old versions of Dahdi, Asterisk and Linux because we only like change in our own code.  So when we upgraded a Centos 5 system, we ran into the following problem:

  CC [M]  /usr/src/dahdi-linux-2.2.0.1/drivers/dahdi/xpp/card_bri.o
In file included from /usr/src/dahdi-linux-2.2.0.1/drivers/dahdi/xpp/xpd.h:31,
                 from /usr/src/dahdi-linux-2.2.0.1/drivers/dahdi/xpp/card_bri.c:29:
include/linux/device.h:407: error: expected identifier or â(â before âconstâ
make[3]: *** [/usr/src/dahdi-linux-2.2.0.1/drivers/dahdi/xpp/card_bri.o] Error 1
make[2]: *** [/usr/src/dahdi-linux-2.2.0.1/drivers/dahdi/xpp] Error 2
make[1]: *** [_module_/usr/src/dahdi-linux-2.2.0.1/drivers/dahdi] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.18-274.12.1.el5-PAE-i686'
make: *** [modules] Error 2


After trying to solve the problem as though something was missing, we had the insight that maybe something was being redefined.  Centos back ports lots of stuff into its 2.6.18 and at the same time, Dahdi does some of its own back porting.  We think we found duplicate back porting of the same item. The back ported item in the Dahdi package was eliminated and the problem went away.  The patch is as follows:

--- /usr/src/dahdi-linux-2.2.0.1/drivers/dahdi/xpp/xdefs-orig.h 2011-12-10 11:41:12.000000000 -0500
+++ /usr/src/dahdi-linux-2.2.0.1/drivers/dahdi/xpp/xdefs.h      2011-12-10 11:23:25.000000000 -0500
@@ -139,7 +139,7 @@
                ssize_t name(struct device_driver *drv, char * buf)

 #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,26)
-#define dev_name(dev)          (dev)->bus_id
+//#define dev_name(dev)                (dev)->bus_id
 #define dev_set_name(dev, format, ...) \
        snprintf((dev)->bus_id, BUS_ID_SIZE, format, ## __VA_ARGS__);
 #endif

Quite a relief.  Now on to installing the media and signal gateway applications, drivers, and patches.  We are creating a four-box cluster configuration with duplex signal gateways and two media gateways. We'll start with a two-box configuration composed of a signal gateway and a media gateway. We'll suspend system growth so that we can use this new system to establish the current functionality of the call detail recording system in isupd.  After this assessment we'll plan to fix any deficiencies in the CDR system. Then we'll add another media gateway to the system. After that, we'll detail the steps needed to convert a in-service simplex signal system cluster to a duplex signal system cluster. Then after that, we'll repeat the process at a live commercial 14-box operation.