PHP uniqid() not always a unique ID

For quite some time modern versions of JFFNMS have had a problem. In large installations hosts would randomly appear as down with the reachability interface going red. All other interface types worked, just this one.

Reachability interfaces are odd, because they call fping or fping6 do to the work. The reason is because to run a ping program you need to have root access to a socket and to do that is far too difficult and scary in PHP which is what JFFNMS is written in.

To capture the output of fping, the program is executed and the output captured to a temporary file. For my tiny setup this worked fine, for a lot of small setups this was also fine. For larger setups, it was not fine at all. Random failed interfaces and, most bizzarely of all, even though a file disappearing. The program checked for a file to exist and then ran stat in a loop to see if data was there. The file exist check worked but the stat said file not found.

At first I thought it was some odd load related problem, perhaps the filesystem not being happy and having a file there but not really there. That was, until someone said “Are these numbers supposed to be the same?”

The numbers he was referring to was the filename id of the temporary file. They were most DEFINITELY not supposed to be the same. They were supposed to be unique. Why were they always unique for me and not for large setups?

The problem is with the uniqid() function. It is basically a hex representation of the time.  Large setups often have large numbers of child processes for polling devices. As the number of poller children increases, the chance that two child processes start the reachability poll at the same time and have the same uniqid increases. It’s why the problem happened, but not all the time.

The stat error was another symptom of this bug, what would happen was:

  • Child 1 starts the poll, temp filename abc123
  • Child 2 starts the poll in the same microsecond, temp filename is also abc123
  • Child 1 and 2 wait poller starts, sees that the temp file exists and goes into a loop of stat and wait until there is a result
  • Child 1 finishes, grabs the details, deletes the temporary file
  • Child 2 loops, tries to run stat but finds no file

Who finishes first is entirely dependent on how quickly the fping returns and that is dependent on how quicky the remote host responds to pings, so its kind of random.

A minor patch to use tempnam() instead of uniqid() and adding the interface ID in the mix for good measure (no two children will poll the same interface, the parent’s scheduler makes sure of that.) The initial responses is that it is looking good.

 

JFFNMS 0.9.3

JFFNMS version 0.9.3 has been released today.  This is a vast improvement over the 0.9.x releases and anyone using that train is strongly recommended to upgrade.So what changed? What didn’t change!  A nice summary would be fixing a lot of things that were broken or needed some tweaking. A really, really big thanks to Marek for all the testing and bug reports and also patient “just run this and tell me what it says” tests he did too.  If something wasn’t right before and works now, it is quite likely it is working because Marek told me how it broke.

A brief overview of what has changed:

  • TFTP transfers work again
  • A lot of the wierd polling effects due to caching fixed
  • Lots of the selects in sub-tables now work
  • The PHP string-to-float brokeness in SLAs worked-around
  • Even more SNMP library cruft removed or escaped
  • HostMIB apps match properly
  • Interface autodiscovery delete and update fields back working

You can download the file off sourceforge at

https://sourceforge.net/projects/jffnms/files/JFFNMS%20Releases/

Enhanced by Zemanta

JFFNMS 0.9.3 1st release candidate

I have been putting a lot of testing into JFFNMS lately.  I have been very lucky to have had someone with the time and patience to try out various sub versions and give me access to their results.

The end-result of all this testing is a much, much less buggy JFFNMS.  There have been a strack of problems with caching results, for example, where status would not be updated or even worse the status of one device impacted on another.

The poller parent scheduler had a problem too where it would almost always sit in the first child starving the others of work which slowed things down. The scheduler now is a lot fairer across the children giving a speed up. I’ve heard speed-ups of 15x for this one change alone.

I also had a curious bug where if a device was set to not gather state it still did and created events but not alerts.  This meant your event table was spammed with down interface alerts even on interface you know are down and you turned state checking off.  0.9.3 now does it the right way.

The first RC is now uploaded and can be found at https://sourceforge.net/projects/jffnms/files/jffnms%20RC/ to try out.

I’m a little worried that the pollers now run too fast and could overwhelm the usually crummy control stack found in network devices for parsing SNMP.  I’m interested to hear how people find it.

Enhanced by Zemanta

JFFNMS 0.9.0 Released

After 3 release candidates JFFNMS is now at version 0.9.0. Both the web frontend and backend (engines) have had extensive re-work done to them to cleanup and tighten up the code. There should be a lot less warnings and errors from PHP when you set to a higher error level.

  • Fixed error in syslog consolidator which fails for postgresql
  • All webpage input passes through Sanitizer
  • register_globals no longer needs to be turned on, so turn it off!
  • rrdtool v1.0 support dropped
  • lots of cleanup
  • poller rewritten with parent/child code
  • autodiscovery the same as poller
  • Removed php rrd module support
  • Interface auto-discovery will check sysobjid before trying discovery
  • IPv6 reachability support
  • Separated interface selector code
  • SNMP interfaces can have High Speed and optinally no Cisco proxy ping
Enhanced by Zemanta