Add new comment

When PHP crashes: how to collect meaningful information and what to do with it

Difficulty: 
Piece of Cake

Introduction

PHP is a mess (not to say something worse). If you work with PHP and haven't realized yet, read the article PHP A fractal of bad design. In my 15 years as a developer (luckily I now develop for fun, but still spend time coding) I've never come across such an unstable, unpredictable and obscure environment, at least in the last 10 years. But... that's the fun.

The market is what it is, and open source has some benefits after all. So if you are into PHP sooner or later you will come across some broken piece in the stack (besides from the all broken language itself), either in PHP, in any of its extensions or even at the Web Server or OS Level. And when that happens, you'd better know what to do and how to collect meaningful information. This information is not for yourself, but for the maintainers of the different pieces of the stack (cross your fingers to get attention in a timely manner, after all you got all the way here without paying a single license). And if you expect to pay someone to solve those low level issues then be ready to cash out big bucks, and I mean really big, like probably more than the addition of all the license fees you were trying to skip all along.

And why would you be willing to collect this information? These bugs may or may not be critical to your business, but even if they are not, it is morally correct to give feedback to the original developers that gave you everything for free.

I've heard so many times that the benefits of open source is that you have acces to all the source code of the components in the stack. Really? Do yo plan and/or have the time and/or skills to solve a bug in PHP, an extension or even Apache? And some of these have very "amateur" roots, so do not expect the code or overal design of the software to have a decent level of maintainability. (Don't mistake my words, many of these pieces are taken care of by extremely skilled developers, but don't compare a single developer doing something in their free time, to a legion of well trained professionals in a corporate environment, the output is simply different).

In this article I will show you what are the best tools to collect meaningful dumps with the least possible effort that will aid developers in the event of a low level crash to figure out what was going on and how to solve it. I will also propose a strategy to integrate automatic dump collection in your production environments so that you will have everything you need in the event of failure, because sometimes, these bugs are barely impossible to reproduce and if you didn't get the dump when the crash happened, you'll just have to wait for it to happen again.

How to set up WER

A dump file includes the entire memory space of a process, the program's executable image itself, the handle table, and other information that will be useful to the debugger.

Let's start with the basics. A dump is like a picture of what was going on at the moment of the crash. A developer can use this (along with original source code and debugging symbols) to analize, understand and solve the broken code that lead to the crash.

On a server environment, dumps are not collected by default, the user needs to setup something to get them. There are two options available: Windows Error Reporting and Debug Diag.

Windows Error Reporting (WER) is a set of Windows technologies that capture software crash and hang data from end users. This data is analyzed to create a list of top user-mode (software) and kernel-mode (operating system) failures associated with a company’s mapped products. Through the Windows Dev Center hardware dashboard website, software and hardware vendors can access these reports and use them to analyze, fix and respond to these failures.

The Debug Diagnostic Tool (DebugDiag) is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or fragmentation, and crashes in any user-mode process. The tool includes built-in analysis rules focused on Internet Information Services (IIS) applications, web data access components, COM+, SharePoint and related Microsoft technologies.

Even though they might sound as doing the same thing, they are radically different tools. WER was designed to be an automatic (it is indeed a windows service) way MS had to collect failure data on a global scale and then get it sent back to MS over the internet to proactively solve problems with their software (or 3d party vendors). On the other hand, DebugDiag is a more advanced and customizable tool aimed at developers analyzing their software.

We are looking towards a simple strategy that will work in our production environment, so being DebugDiag a complex tool we'll be sticking to WER. Actually, using DebugDiag requires to install an additional piece of software (sometimes not easy due to IT policies) while WER is already part of the operating system, you just have to set it up properly to serve your needs.

WER can be configured from many places, but it all ends up in a set of Registry Keys found at one of these locations:

  • HKEY_CURRENT_USER\Software\Microsoft\Windows\Windows Error Reporting
  • HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\Windows Error Reporting

You can take a look at all the possible settings in this page.

The important thing to notice is that dump settings can be set on the application level and on the global level. That means that you can configure different WER behavior for different applications, or only enable WER for specific applications.

You will see that some of the settings include the [Application Name] label in their registry path. That should be replaced by the name of the binary (executable file) that you want those setting to be applied to.

In this article we will be using global settings because we want to know what is happening in our production environment, even if PHP or one of its friends is not responsible for the crash.

NOTICE: Full User Dumps are big, for a normal php-cgi.exe process we are talking about 300Mb (once compressed in ZIP or RAR it's just around 35Mb). If you are dumping automatically on a production environment, make sure you enforce the necessary constraints to prevent the dumps from eating up too much space.

This is what we have in our production environment (you can copy paste in a *.reg file, double click, and it will be automatically added to your registry) :

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps]
"DumpFolder"="d:\\werdumps\\"
"DumpCount"=dword:00000100
"DumpType"=dword:00000002

Here we are telling WER to save the user dumps in the DumpFolder location, to stop dumping after 100 dumps (to prevent disk space shortage issues) and to perform a full user dump (DumpType=2). Full dumps are big, but have all the possible information that can be extracted at the moment of the crash.

After configuring these settings, we need to turn on the WER service, open an elevated prompt and type this:

net start wersvc

Whenever you want to stop the service, do the opposite:

net stop wersvc

You can also do this from the Services Management UI, type "services.msc" in your command prompt to open it.

From now on Windows will start collecting full user dumps in the event of a crash.

You must also know that for every crash the OS creates an entry in the system's event log, so for every dump file you will have one (or more) entries with the same timestamp in the log.

You can open the event viewer typing "eventvwr" in the command line.

The information in the event viewer is critical to determine which component is being affected by the crash. You need to get this information in other to know who you must report the crash to. In the screenshot, the offending extension is php_xdebug, so we'll have to report this issue to the maintainers of XDebug extension. Note that sometimes this information is not reliable, the faulting module might not be the true root of the problem.

How to investigate and report an issue

Now that you are collecting meaningful data, you need to know when to look for it and what to do with it.

When

- If the crash is directly impacting your application (users complaining, downtimes, etc.).
- Every once in a while look into the dumps folder, and then use the timestamps to locate the issue details in the event viewer. Your application may be having issues randomly, and you might not be noticing. Be proactive and do not wait for customers to complain.

What

Once you have a dump with the event details, send this information to the issue queues for the corresponding projects, for example:

If the fault belongs to an extension that is bundled with PHP, you should use the PHP Bug tracker. Sometimes companies are collaborating with PHP core, so you could for example report a bug in PHP for the Wincache extension, and it will get to the right people.

It is also important to tell the developer WHAT software (version, binaries and where did you got them) was running at the moment of the crash.