*Disclaimer that I am a Windows guy and not too familiar with linux.
I also have (or had) a virtual 4.2.3 Core install that had 3 of these errors popping up under the localhost events. I didn't think much of it until I started using maintenance windows on several groups of servers that have monthly reboots (patching). I happened to be awake when one of the maintenance windows was supposed to be active on a group of our servers (80+ devices set from Production to Maintenance). An hour into the maintenance window I noticed none of the devices in that group were set to maintenance. I verified the time/date was correct and the active maintenance window informational event was showing in the localhost events. I decided to change the status to maintenance in one swoop by manually selecting all the devices in the particular group. Low and behold, it failed to do so with no error. I chose a single device in the group and successfully had the state move to maintenance. I upped it a bit with half of the devices and attempted, it failed. I started doing 10 at time, success, success, then fail on the 3rd 10.
After testing each of those 10 individually I found the culprit. I wasn't able to update ANYTHING on this device, groups, status, details, etc. I even attempted to delete it. Every attempt to save/modify the device yield a fleeting yellow error banner at the top of the browser "POSKeyError: 0x1cd6a1". Sure enough when I went back and viewed the localhost events, the code matched.
With 3 total events like this on my localhost, I correctly guessed I had 2 other corrupted devices. This was easily tested by creating a temporary group and doing a drag/drop for a smaller amount of devices to that group. 50 at a time I eventually narrowed it down found 2 other corrupt devices.
Message:
Unhandled exception in zenhub service Products.ZenHub.services.PingPerformanceConfig.PingPerformanceConfig: 0x1cd6a1
Traceback (most recent call last):
File "/opt/zenoss/Products/ZenCollector/services/config.py", line 108, in _wrapFunction return functor(*args, **kwargs) File "/opt/zenoss/Products/ZenCollector/services/config.py", line 227, in _createDeviceProxies proxy = self._createDeviceProxy(device) File "/opt/zenoss/Products/ZenHub/services/PingPerformanceConfig.py", line 165, in _createDeviceProxy self._getComponentConfig(iface, perfServer, proxy.monitoredIps) File "/opt/zenoss/Products/ZenHub/services/PingPerformanceConfig.py", line 102, in _getComponentConfig for ipAddress in iface.ipaddresses(): File "/opt/zenoss/Products/ZenRelations/ToManyRelationship.py", line 71, in __call__ return self.objectValuesAll() File "/opt/zenoss/Products/ZenRelations/ToManyRelationship.py", line 174, in objectValuesAll return list(self.objectValuesGen()) File "/opt/zenoss/Products/ZenRelations/ToManyRelationship.py", line 181, in objectValuesGen for obj in self._objects: File "/opt/zenoss/lib/python2.7/_abcoll.py", line 532, in __iter__ v = self[i] File "/opt/zenoss/lib/python2.7/UserList.py", line 31, in __getitem__ def __getitem__(self, i): return self.data[i] File "/opt/zenoss/lib/python/ZODB/Connection.py", line 860, in setstate self._setstate(obj) File "/opt/zenoss/lib/python/ZODB/Connection.py", line 901, in _setstate p, serial = self._storage.load(obj._p_oid, '') File "/opt/zenoss/lib/python/relstorage/storage.py", line 476, in load raise POSKeyError(oid) POSKeyError: 0x1cd6a1
Google yielded no useful details and I came across several threads on here from other users with the same issue then eventually this post by jmp242:
- this is a corruption caused by MySQL exiting mid transaction, thereby corrupting objects in the relstorage ZopeDB. This is an issue if, say, MySQL is killed by the OOM killer - so sizing the Zenoss server appropriately is critical. The corruption can linger unnoticed for some time if you don't access the corrupted object, but you will see errors in some logs. Repair is an involved manual process, requiring a Zenoss Guru.
http://community.zenoss.org/thread/19880
And since this can "linger unnoticed", you can only hope your backups go far enough back, before the event initally occured. Transferring/restoring to a new Zenoss just carries the issue along with the DB.
So, if you see these, you have a corrupt device for each event. The way I had documented it our deployment, it was easier and faster to just rebuild. Plus it didnt seem like anyone who had these issues ever got them resolved knowing exactly how to fix it, vs the lucky 1 or 2 people with "I ran this script and it fixed my Zenoss". Like me not being too familiar with the what's under the Zenoss hood, the rebuild was faster. Plus 4.2.4 came out the week prior, 2 birds with one stone and all that jazz.
Ultimately it looks to require properly size the server (ram, disk) and possibly schedule a regular reboot of Zenoss to keep memory use down, aka the memory creep/leak.
Off topic/unrelated to this issue, when I say memory leak (I wish I had bookmarked the thread here about another person's ranting about why it doesn't release it back...) The memory utilization on the localhost just slowly creeps up and up and up. After "Used" tops out at 100%, then "Cached" slowly marches down until finally getting in the 10-15% range. At that point the swap begins to creep up (eventully spiking in our system). If swap pegs out, your system is most likely to tip over soon, emails not sent, collector's event queue hits max, etc. All sorts of fun. Hopefully it behaves after a reboot at that point.
In the picture of our memory utilization graph of one of the Zenoss Core 4.2.3 for our environment, the red stars indicate a manual reboot to the system. To "reset" the system back to lower memory use. The blue star indicates a bump from 24GB of ram to 36GB. Again the memory use creeps up, even after that. The two spikes next to the orange star, I have no explanation for, most likely some type of panic. The gaps being Zenoss unable to connect to SNMP on the localhost.