Using cacti (Debian 1.2.24+ds1-1+deb12u1) to graph data from a Nokia OLT (7360 FX).
Rough detail, looking at the tx and rx levels of both the OLT and ONT, and then I added a new CDEF to the graph template to show the attenuation in each direction. It's when I made this change that things seem to have broken.
I have some custom scripts which get the traffic counters and light levels, then expose an API for the also custom cacti script to query. As you can see from the debug output below, it seems pretty straight forward, and appears to be querying properly.
But the last bit pasted below, as the poller calls rrdtool (Debian 1.7.2-4+b8) to do the update, it provides the ds values in the wrong order.
Output from the cacti script server debug showing the call and the response:
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[1] INC: 'ss_davevm_api.php' FUNC: 'ss_davevm_api' PARMS: '10.100.105.96', 'get', 'oltrx', '26'
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[1] RESPONSE:'-24.00'
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[2] INC: 'ss_davevm_api.php' FUNC: 'ss_davevm_api' PARMS: '10.100.105.96', 'get', 'olttx', '26'
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[2] RESPONSE:'6.00'
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[3] INC: 'ss_davevm_api.php' FUNC: 'ss_davevm_api' PARMS: '10.100.105.96', 'get', 'ontrx', '26'
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[3] RESPONSE:'-14.57'
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[4] INC: 'ss_davevm_api.php' FUNC: 'ss_davevm_api' PARMS: '10.100.105.96', 'get', 'onttx', '26'
2024-01-04 18:20:01 - PHPSVR DEBUG: PID[82910] CTR[4] RESPONSE:'6.41'
Debug from the 'api' script directly - confirming it's returning the correct values:
[2024-01-04 18:20:01] method get ip 10.100.105.96 qtype oltrx index 26 return -24.00
[2024-01-04 18:20:01] method get ip 10.100.105.96 qtype olttx index 26 return 6.00
[2024-01-04 18:20:01] method get ip 10.100.105.96 qtype ontrx index 26 return -14.57
[2024-01-04 18:20:01] method get ip 10.100.105.96 qtype onttx index 26 return 6.41
But then the corresponding rrdtool update clearly shows it's doing it wrong:
2024-01-04 18:20:02 - POLLER: Poller[1] PID[82899] CACTI2RRD: /usr/bin/rrdtool update
/usr/share/cacti/site/rra/device-01_rxb_235.rrd --template onttx:oltrx:olttx:ontrx 1704392401:-14.57:U:-24.00:6.00
So the big caveat here - there are dozens of graphs that are working properly, it appears to only be a few of the newly created ones. I've not made any changes to the data source template. I have changed the graph template adding another CDEF for some more "legend" information, and on all working graphs, this added data represents correctly. Comparing a working to a non-working graph debug and data source debug, they appear to be the same.
It appears to be the poller messing up the ds index order - any pointers on further debugging, or has anybody else seen this type of behavior before?
Thanks for any ideas!
UPDATE:
Note that new updates are putting the ds in different order, but consistent about the values (so it's not randomly throwing the data in there, at some point it seems to have gotten confused about which key goes with which value):
2024-01-04 21:05:03 - POLLER: Poller[1] PID[3112] CACTI2RRD: /usr/bin/rrdtool update /usr/share/cacti/site/rra/device-01_oltrx_241.rrd --skip-past-updates --template onttx:ontrx:olttx:oltrx 1704402302:-14.57:6.00:-23.90:5.58
2024-01-04 21:10:02 - POLLER: Poller[1] PID[3245] CACTI2RRD: /usr/bin/rrdtool update /usr/share/cacti/site/rra/device-01_oltrx_241.rrd --skip-past-updates --template ontrx:olttx:oltrx:onttx 1704402602:6.00:-24.20:5.40:-14.56
UPDATE:
With further digging, I found a few things.
There were two script server instances running, and all of the failing graphs/data sources were from the second instance. My settings are configured for only one script server, so no idea what's going on. I wondered if there were some configuration differences between devices that were "forcing" a second instance, and I did notice that the site for the bad device was set to "none." I didn't think that should break anything, but I changed it to "edge" to match the other devices, and lo and behold my data started coming in properly, with just massive spikes at the cut points.
With that new knowledge, I deleted and recreated the device, as site "edge" to match.
No joy! Still a second instance of the script server, and still bad data. I change the site to anything else (set it to "core") and data corrects itself.
I will continue to make changes like this and see what differences it makes, but if anybody has seen this type of behavior I'd be very interested.