Su,
I/O performance does not appear to be a root cause. At 3 ms latency max, it is likely that either all I/Os are being satisfied from cache, or you've got some great SSD, or both.
The CPU utilization does show spiking to 100% at times, so that might indeed be the cause of the slowdowns if the time matches up.
Another possible issue might have to do with the need to for the VM container in which the Delphix Engine guest resides to have full vCPU and full vRAM reservations in ESX, which is documented
online here.
If the ESX cluster in which the Delphix Engine resides is a busy one with several VM guests, and particularly if either (or both) of the source and target servers also reside within the same ESX cluster as VM guests, then resource stealing (particularly vCPU and vRAM) might occur.
From the standpoint of most VMware ESX administrators, this "stealing" of resources is expected and natural. The problem is that the Delphix Engine VM is not like other VMs in that the Delphix Engine VM is providing I/O services to other VMs. So, when resources are stolen from the Delphix Engine VM, it will potentially affect applications on other VMs, potentially many other VMs.
Of even more concern is if the source Oracle database resides on a VM within the same cluster as the Delphix Engine VM is the phenomenon of the "
perfect storm", where multiple factors converge to exacerbate a problem. In a "perfect storm" involving Delphix, highly increased workloads on VMs containing related Oracle databases will steal resources from the VM containing the Delphix Engine, thus exaggerating the effects of the increased workload on both sides.
What is really insidious about a "perfect storm" is that it is difficult to detect merely by viewing metrics gathered within the VMs, which do not record the effects of resource theft, as it is largely invisible to them. Such a situation can be detected at the ESX level only.
However, it is highly unlikely that two VMs within the same ESX cluster would have a 1 GbE network between them. Most ESX admins would ensure that a 10 GbE virtual adaptor is used, unless a 1 GbE vNIC was configured for some reason such as testing for real world conditions? Hopefully a "vmxnet3" vNIC is in use, if this is the case, and not an older or less capable vNIC?
Another issue might be that TCP "jumbo frames" might not be in use. Jumbo frames don't necessarily increase network throughput so much as decrease CPU utilization by the servers on both sides of the network connection. Given the 100% vCPU spikes seen, implementing jumbo frames might be a useful step, as documented online
here.
More general network optimization guidelines are documented
online here.
To summarize, I don't think that I/O is a cause of the performance issues you are experiencing. I think that the 1 GbE network link is a strong candidate as the root cause and I'm also concerned about the extreme spikes in CPU utilization shown.
I hope this helps?
Thanks!