In Delphix data masking jobs, what is the difference between "streams" and "threads"?

  • 0
  • 2
  • Question
  • Updated 1 year ago
  • Answered
When running Delphix data masking jobs, there are fields to specify the number of "streams" and the number of "threads".  Obviously, these represent concurrency and concurrently processing the job, but what is the difference?  For example, suppose we have a job that updates four tables, and we specify 1 stream and 2 threads?  How is that processed?  How is that different from 2 stream and 1 thread?  Or 2 and 2?

And when we specify how much memory (min and max) is to be used, how is that allocated?  If we specify 8GB, is it 8GB of memory divided amongst the streams and/or threads?  Or is it 8GB per stream, per thread?
Photo of Tim Gorman

Tim Gorman, Field Services

  • 2,794 Points 2k badge 2x thumb

Posted 2 years ago

  • 0
  • 2
Photo of Hims

Hims, Employee

  • 1,936 Points 1k badge 2x thumb
Official Response
Hi,
A data masking job runs on  a collection of table with algorithms applied to columns, this is called rule-set.
A rule-set  is given streams and threads, Streams are number of tables to process in parallel while threads are number of concurrent updates in those individual streams.

E.g.
2 stream and 1 thread will process 2 tables out of all tables in parallel with 1 update threads each
1 stream and 2 thread will process 1 table at a time with 2 update threads.
2 stream and 2 thread will process 2 tables out of all tables in parallel with 2 update threads in each ( total 4 updates concurrently).

Notes:
 - Not all databases support multi threading.
 - Multiple streams are usually better than multiple threads.
 - after a point there are diminishing returns.
 - Too many threads may deadlock the table especially in MSSQL.
 - Table allocation to stream is currently decided upfront and not dynamically, one stream may end up faster then other.
 - Longest running table determines total time, splicing up job into sub-jobs might help.

Memory allocation is dynamic and it is done by java, we do not have explicit control of how much of the 8GB heap goes to which table, kettle is very optimized to handle this; splitting job is the way to handle it manually if needed.
Photo of Tim Gorman

Tim Gorman, Field Services

  • 2,794 Points 2k badge 2x thumb
Official Response
Thanks for the great response, Hims!

Getting more specific on the memory allocation inputs, are the values we enter passed as parameters to the Java process?  For example, does entering "8192" for min and "16384" for max memory gets passed along to the Java process as parameter values "-Xms 8192mb -Xmx 16384mb"?

In other words, is each stream and each thread a separate Java process?  Or is each stream a separate Java process, and each thread a Java thread within the Java process?

For example, if we specify 8192 for min memory and 16384 for max memory, then specify 4 streams each with 4 threads, will we have 4 Java processes with the "-Xms" and "-Xmx" parameters as specified, or will we have 16 Java processes?

By the way, these questions are coming from customers for whom I'm presenting a virtualization class...