very slow initial db_sync

  • 0
  • 2
  • Problem
  • Updated 3 years ago
  • (Edited)
Good day.

Please help me solve a problem.
Perform initial db_sync. The process is going very slowly (<= 30Megabit/sec).
8 network and 8 rman connections.
Network 1 Gbit.
99% of the Oracle wait events is the "Backup: MML write backup piece."

--
Product Delphix Engine - Demo
Version Delphix Engine - Demo 4.1.6.0
Build Date Tue, 3 Mar 2015 08:33:12 GMT
Photo of Su

Su

  • 280 Points 250 badge 2x thumb

Posted 3 years ago

  • 0
  • 2
Photo of Tim Gorman

Tim Gorman, Field Services

  • 2,754 Points 2k badge 2x thumb
Su,

Performance issues in Delphix are almost always related to the storage underlying the Delphix Engine, or the network between the two hosts/servers (i.e. source server and Delphix Engine).

To discover more about whether the 1 GbE network is a problem, the Delphix Engine v4.1 and above comes pre-loaded with the open source "iperf" network performance measurement tool.

To use "iperf", both sides of the network connection need to be instrumented.  You can access "iperf" within the Delphix Engine appliance using the Delphix CLI, as described in the Delphix documentation online here.  For the other side of the network connection, you can download an already-compiled executable for "iperf" to your source host/server from websites such as this, then run it from the OS command-line using the syntax "./iperf -s" (i.e. server mode).

You can run tests for network latency and for network throughput.  I am thinking that the network throughput test would be most relevant to the situation you describe.  Please feel free to share your "iperf" findings, so that we can determine whether the network is a contributor to the performance problem or not.

With regards to the performance of the storage underlying your Delphix Engine 4.1 appliance, one way is to measure the average I/O latency and overall throughput through the "Performance Analytics" graphs, as documented online here.  Another way is to corroborate the readings taken from within the Delphix "Performance Analytics" from the VMware ESX reports, particularly reports on I/O performance.

Please let us know what you find, and if this helps you understand better how to improve the performance of these operations?

Thanks!
Photo of Su

Su

  • 280 Points 250 badge 2x thumb
Tim, good day.

cft-dev (10.18.11.20) - OS AIX 7.1



EN1:


LO0:
(Edited)
Photo of Tim Gorman

Tim Gorman, Field Services

  • 2,754 Points 2k badge 2x thumb
Su,

This network shows throughput of about 340 Mb/s (~42 MB/s), which is normal for 1 GbE, though we could wish for better.

"iperf" makes network latency and throughput easy to measure, but the biggest factor in the performance of a Delphix Engine is that of the storage underlying it.  In many respects, Delphix is technically a network-attached storage appliance, so if the storage is sub-par, then the performance will be sub-par even if the network is blazing fast and wide as the Holland Tunnel.

In your Delphix Engine console, please go to Resources > Performance Analytics and show us a graph of your write I/O over the past several days.  To do this, click on the buttons indicated with the red circles and arrows in the screenshot below...

The graph has three sections, the top showing latency, the middle showing IOPS, and the bottom showing throughput.

Write I/O performance is particularly relevant for the problem you're examining because an initial DB_SYNC is primarily writing into the Delphix Engine's storage.  Don't neglect to examine the read I/O graph as well, because extremely read I/Os are also part of DB_SYNC and a few extremely slow read I/Os can queue up and slow processing as well.

It would also be useful to see similar graphs from the underlying OS or storage array.  If this is the Delphix free "Landshark" demo running on VMware Workstation or Fusion, then I'm not aware of any reporting capability, but perhaps there is something available at the OS level.

The other thing to consider is the fact that you seem to be running eight (8) RMAN channels on a demo Delphix Engine.  Considering that the default Delphix Engine in the "Landshark" demo is configured with one (1) vCPU, eight (8) RMAN channels might cause the the VM to become CPU-bound.  So you will also want to examine the Performance Analytics for CPU as well, to see if you're hitting 100% utilization for prolonged periods of time.

Please let us know what you find?

Thanks!
Photo of Su

Su

  • 280 Points 250 badge 2x thumb









Vmware



Photo of Tim Gorman

Tim Gorman, Field Services

  • 2,754 Points 2k badge 2x thumb
Su,

I/O performance does not appear to be a root cause.  At 3 ms latency max, it is likely that either all I/Os are being satisfied from cache, or you've got some great SSD, or both.

The CPU utilization does show spiking to 100% at times, so that might indeed be the cause of the slowdowns if the time matches up.

Another possible issue might have to do with the need to for the VM container in which the Delphix Engine guest resides to have full vCPU and full vRAM reservations in ESX, which is documented online here.

If the ESX cluster in which the Delphix Engine resides is a busy one with several VM guests, and particularly if either (or both) of the source and target servers also reside within the same ESX cluster as VM guests, then resource stealing (particularly vCPU and vRAM) might occur.

From the standpoint of most VMware ESX administrators, this "stealing" of resources is expected and natural.  The problem is that the Delphix Engine VM is not like other VMs in that the Delphix Engine VM is providing I/O services to other VMs.  So, when resources are stolen from the Delphix Engine VM, it will potentially affect applications on other VMs, potentially many other VMs.

Of even more concern is if the source Oracle database resides on a VM within the same cluster as the Delphix Engine VM is the phenomenon of the "perfect storm", where multiple factors converge to exacerbate a problem.  In a "perfect storm" involving Delphix, highly increased workloads on VMs containing related Oracle databases will steal resources from the VM containing the Delphix Engine, thus exaggerating the effects of the increased workload on both sides.

What is really insidious about a "perfect storm" is that it is difficult to detect merely by viewing metrics gathered within the VMs, which do not record the effects of resource theft, as it is largely invisible to them.  Such a situation can be detected at the ESX level only.

However, it is highly unlikely that two VMs within the same ESX cluster would have a 1 GbE network between them.  Most ESX admins would ensure that a 10 GbE virtual adaptor is used, unless a 1 GbE vNIC was configured for some reason such as testing for real world conditions?  Hopefully a "vmxnet3" vNIC is in use, if this is the case, and not an older or less capable vNIC?

Another issue might be that TCP "jumbo frames" might not be in use.  Jumbo frames don't necessarily increase network throughput so much as decrease CPU utilization by the servers on both sides of the network connection.  Given the 100% vCPU spikes seen, implementing jumbo frames might be a useful step, as documented online here.

More general network optimization guidelines are documented online here.

To summarize, I don't think that I/O is a cause of the performance issues you are experiencing.  I think that the 1 GbE network link is a strong candidate as the root cause and I'm also concerned about the extreme spikes in CPU utilization shown.

I hope this helps?

Thanks!
Photo of Su

Su

  • 280 Points 250 badge 2x thumb
Good day, Tim.

The source Oracle database cft_1502 resides on LPAR1 (IBM PowerVM. 780 machine in a Datacenter1)
and
the source Oracle database cft_tmp resides on LPAR2 (Another one 780 machine in a Datacenter2).

DelphixEngine VM in the Datacenter1.


db_sync for cft_tmp (parallel with db_sync cft_1502):



db_sync for cft_1502 (~84% done. 34+ hours elapsed):




db_sync for cft_1502 87% (at 24.05.2015 00:05)



The job is completed (2TB per ~37 hours).
(at 24.04.2015 00:15)




Last hour:







22-24 May: Scant!




ESX Stats for the DelphixEngine VM.
CPU:



Disk:


RAM:


Net:












Photo of Tim Gorman

Tim Gorman, Field Services

  • 2,754 Points 2k badge 2x thumb
Su,

Going back to the "iperf" test, we saw network throughput averaging 339 Mb/s or about 42 MB/s, thus theoretically it should take about 14 hours to simply transfer 2 TB at raw network speed, which is approximately one-third of the 37 hours you are actually observing, effectively 16 MB/s.

On 10GbE networks providing 3-4 Gbps in throughput tests, I usually observe 7 TB databases taking 7-8 hours for an initial load, or 1 TB/hr or about 270 MB/s.

One more thing to consider:  how much storage has been assigned to your Delphix Engine VM?  Have you been receiving "faults" about exceeding the thresholds for 78% or 85% of total storage?  Once used storage exceeds 85%, performance of writes (and reads) are affected as free space for new blocks for the "copy-on-write" file-system become scarce.  To protect against this, the Delphix Engine begins emitting warning faults at the 78% threshold, and some automatic services are disabled after the 85% threshold.

At any rate, I think the performance you are observing warrants opening a case at Delphix Support.  Can you do so?

Hope this helps...

Thanks!
Photo of Su

Su

  • 280 Points 250 badge 2x thumb
"..how much storage has been assigned to your Delphix Engine VM?.."

3TB SSDs. /See the prev. screenshots :) /

After I've created second vDB I have launched batch job in that vDB.
The batch job has inserted about 100,000 rows. Size of the vDB increased for ~10GB.

But in this time outgoing traffic (from "Target" to DelphixEngine VM) was much much better than when was performed initial db_sync.
See screenshot..




The Source and the Target is the same host - cft-dev.

--
"..At any rate, I think the performance you are observing warrants opening a case at Delphix Support.  Can you do so?"

Can you please tell how to do it.
We demo Delphix.
Thanks!
(Edited)
Photo of Tim Gorman

Tim Gorman, Field Services

  • 2,754 Points 2k badge 2x thumb
Su,

So the "source" and "target" are on the same server, therefore the same network used for DB_SYNC as well as VDB provisioning and operation.  Interesting then that they should display different performance characteristics.

I mentioned opening a case at Delphix Support because this case and its details seem to be moving rather slowly within the sequential Q&A format of a community forum like this.  As you evidently don't have an account with Delphix Support, that doesn't seem to be a straightforward option at this time.

Would you mind taking this discussion offline by sending me email and we'll see how best we can proceed?

Thanks!