This blog was authored by ExtraHop Reveal(x) customer Mitch Roberson. Visit his blog for more great tech content here.
I have worked with Exchange for years. It's always important to gain visibility into things that affect end users, but Exchange has a lot of different pieces and can be very confusing. It has some fantastic built-in monitoring and there are 3rd party tools that are available for it, but recently I have been using ExtraHop Reveal(x) to monitor the SMTP traffic to and from the server.
When you have a large environment, it is often hard to know if you are having issues. Especially when your users are busy and they do not generally call in for intermittent problems. By monitoring SMTP as it passes along the wire, I have learned of problems that my end users had but were not calling in.
Scanning to email is very common. One of the issues I recently found was that we were getting a "451 4.7.0 Temporary server error. Please try again later. PRX5." What is interesting is that we had probably been getting this error randomly off and on for a while, but none of the other monitoring tools I had were capturing it. Reveal(x) was.
We had not received a single user complaint, but I knew this would cause the scanner to error out and the end user would have to rescan the document. For me, this was unacceptable. I have long held the belief that the reason I have a job is to make sure the end users can do their jobs with the technology that we provide. This is a core belief of mine. At this point I knew it was affecting end users, so I needed to dig in. But that is for another blog post.
What I'm getting at here is that by monitoring the SMTP stream, I can see when the service is stopped or unavailable. As you can see below, we can tell when someone is trying to relay, even when there is an invalid address. Or, low and behold, that a message size exceeds the fixed maximum message size.
For years I would often get a screen shot from an end user telling me this was happening. In most cases, I had no idea how many people it was affecting. But as you can see below, we saw 202 instances of inactive service (we were doing some testing). And the cool thing is that most email systems are built so they will retry, but not all of them. So I could easily dig into to see which clients had failed and determine if something needed to be resent.
This is just the TIP of the iceberg. The amount of data and information you can get out of your wire data with Reveal(x) is incredible. Things like:
Longest processing time and round trip time:
Number of requests to number of responses:
And so much more. Check out another recent use case involving a strange spike in DNS transactions here.