An Example

The following example is based on a “simple” problem that was difficult to fix.

One of our web servers was crashing at random times. Each time the memory was increased to fix the problem along with a script that rebooted the server every once in a while.

I do wonder how many servers out there are being rebooted every day rather than fixing the problem at root cause?

What we Know	How we Know it
Website crashed at 13:27 on 01/01/21	Zabbix reported “off line”
Website ran out of memory	Reported in/var/ logfile at 13:26
Website rebooted with twice the memory	System admin confirmed (knee jerk fix – you know who you are)
Software versions, centos7, WPress 5.x httpd 2.4, java, php 5.3	Verified using yum list installed

What we don’t know	How we can know it
What was the load on the website at the crash	See Zabbix or examine /var/httpd/logfiles
Is the memory a sensible value	See monitoring
Is software up to date	Check suppliers website
Any known problems with this version	Check using yum
Has the increased memory fixed the problem	Monitor closely using linux tools & Zabbix

These two tables are invaluable when working to trouble shoot, they stop people going down “rabbit holes” and retesting the same thing more than once.

From the two tables above we can now make hypothesis and a plan.

What we are going to do	How we do it & who does it	Answer
Get load statistics for website & collate	Download log files and sort by time in excel, plot graphs	Load normal
Check recommended memory for expected load	Read docs	Memory now too much
Confirm software versions	Sysadmin to run yum commands	Back version running
Check software bug list for version	Sysadmin to read docs	Some memory related bugs not specific to this problem? php
Monitor memory usage vs load	Add Zabbix & scripts to gather data for the next month	Memory usage still growing

Update the What we know.

What we know	How we know it
Website crashed at 13:27 on 01/01/21	Zabbix reported “off line”
Website ran out of memory	Reported in /var/logfile at 13:26
Website rebooted with twice the memory	System admin confirmed (knee jerk fix – you know who you are)
Software versions back issue with some memory bugs related	Yum & documentation – sysadmin
Memory is increasing under normal load conditions	Load graphs and resource graphs

New hypothesis and a new plan based on new information.

Looks like the software has some known bugs at this version so we need to eliminate these from the equation.

How we are going to do	How we do it & who does it
Update WPress & php to latest	Snapshot server & follow upgrade docs
Monitor memory usage vs load	Add Zabbix & scripts to gather data for the next month
Return memory to recommended size if OK	VMWare change

What now?

Now you know how to solve problems like an expert. If you want to learn more about problem solving, or want to get some advice, get in touch today!

Solve Problems Like an Expert in 10 Steps

More Method Than Madness When You Solve Problems Like an Expert

Minimal Information

Outsider Advantage