Warehouse workers for a large ecommerce retailer began to experience intermittent delays of up to 45 seconds from their RFID scanners. Lines of trucks were backing up in the yard, leading to missed shipments and SLA penalties. The operations team analyzed the network, application, and database logs but were unable to find the root cause of the issue. They checked the resource utilization of every system across the entire delivery chain of the inventory and shipping application, which were all operating at normal capacity. The problem continued to intermittently occur, affecting productivity and invoking SLA penalties, with costs already above $150,000 in lost productivity, delayed shipments, and the SLA penalties.
- Quickly identify the cause of the slow RFID scan times without resorting to packet capture and offline analysis.
- Improve troubleshooting by investigating issues that occurred in the past.
- Easily identify and address intermittent issues.
- Reduce expenses associated with downtime and SLA penalties.
Lines of trucks were backing up in the yard, leading to missed shipments and SLA penalties. The operations team analyzed the network, application, and database logs but were unable to find the root cause of the issue.
The retailer was already using the ExtraHop platform for the external website, so they began directing a copy of warehouse systems and applications traffic to the appliance as well. Within a few hours the intermittent problem started again. The ExtraHop platform classified and showed the RFID scanner response time increase to 45 seconds. At the same time, the team saw a massive throughput spike from eight backend servers on the same VLAN, all connecting to the storage array.
In the ExtraHop web UI, the operations team observed the RFID clients attempt to communicate with the application servers, while retransmissions and slow starts spiked. Several switches in the application path showed pegged L2 throughput, dropping off after the traffic from the backend servers subsided. Investigating with other internal teams, the operations team identified the backend servers as a Hadoop cluster. The pegged switch's ASICs had been completely overrun by intermittent and variable Hadoop batch jobs.
The RFID scanner application and the Hadoop cluster had been on the same network for more than 18 months. What changed? It turned out that the data analytics team had recently upgraded their Hadoop servers from 1G to 10G servers. Hadoop jobs consume as much bandwidth as available. The IT team quickly rescheduled Hadoop jobs for nonpeak times until a separate Big Data network was created.
After the IT team rescheduled the Hadoop jobs, warehouse operations returned to normal, and the retailer was able to comply with the SLAs.
Day to day, the ExtraHop platform continued to provide the IT team with real-time information about the status of their warehouse operations. They could now promptly see if any transaction was taking longer than during the same time period in the past. While the ExtraHop platform was critical in resolving the Hadoop issue, the time-over-time comparisons provided ongoing confidence in the wellbeing of warehouse communications and operations.