Today was one of those days. First the two NSMXpress appliances failed yesterday (version 2008.2r2). No way of connecting the client gui. The webinterface and SSH connections worked fine though. Picked one up for examination, and since I had some *cough*good*cough* experiences a while back I assumed the latest software had some undocumented bug.
A back to factory defaults (version 2007.3r1) worked fine, but due to certain hardware the 2008 version was needed. So I upgraded the appliance (again) and found (while waiting) that the security certificate, used between the NSM server and the client gui, had expired on Juli 20th, 2009....... So someone forgot to update the certificates in the 2008.2r2 software.
After fixing that, the client gui worked like a charm.
Feature numero 2: When installing the two NSMXpress appliances at the beginning I documented each step to make sure the appliances were exaclty the same. One of the appliances will be used in production, while the other will be placed in a lab (prod == lab).
When I wanted to add a ISG2000 cluster to the lab NSMXpress everything went fine till the part where the ISG is supposed to 'talk' to the NSM (on port 7800). This took forever, and crashed the client gui in the end.
The system log on the ISG had the following messages:
NSM: Cannot connect to NSM server at <IP ADDRESS> Reason: 6, disconnected by peer (read == 0) (16 connect attempt(s))
The cluster is directly connected to the NSM, so no firewalls that can interfere. After a while we decided to see if the production NSM/cluster combo had the same problem. Guess what? That config worked like a charm.....
Now I have two options;
- reinstall the software (clean install and hope for the best), and be up and running in a couple of hours, or
- create a case and wait what JTAC has to say about this.
Anyway, that's a customer decision if you ask me.
So finally I ask you; does this look like a stable Juniper management environment?
The 2008 software release is a lot more stable than the 2007 releases I encountered. The biggest issue on the NSM(Xpress) software is still the unpredictability of the software.
And a final thought for the developpers;
Create a decent install/upgrade script that does things automatically. Every step in the install/upgrade manual can be scripted into the installer. This eliminates (a lot off potential) errors wile installing, maintaining and upgrading the device.
Anyway, no need to get bored tomorrow.
UPDATE: Well it seems that the problem is solved. The devSvrDataCollector process was down. The 'fun' part is that we didn't pick this up on the CLI. We discovered this by using the NSMXpress client GUI (Administer-> Server Manager-> Server Monitor-> DevSvr status).
After a restart of the Device Services (sh devSvr.sh restart) the import went fine.
What annoys (understatement) me the most about this, is why (several) reboots of the system didn't fix this? A service is down, which should come up when the system boots. Only a manual restart of the service fixed this.
After a reboot of the system everything still seems to function as advertised.....