Hi,
A data masking job runs on a collection of table with algorithms applied to columns, this is called rule-set.
A rule-set is given streams and threads, Streams are number of tables to process in parallel while threads are number of concurrent updates in those individual streams.
E.g.
2 stream and 1 thread will process 2 tables out of all tables in parallel with 1 update threads each
1 stream and 2 thread will process 1 table at a time with 2 update threads.
2 stream and 2 thread will process 2 tables out of all tables in parallel with 2 update threads in each ( total 4 updates concurrently).
Notes:
- Not all databases support multi threading.
- Multiple streams are usually better than multiple threads.
- after a point there are diminishing returns.
- Too many threads may deadlock the table especially in MSSQL.
- Table allocation to stream is currently decided upfront and not dynamically, one stream may end up faster then other.
- Longest running table determines total time, splicing up job into sub-jobs might help.
Memory allocation is dynamic and it is done by java, we do not have explicit control of how much of the 8GB heap goes to which table, kettle is very optimized to handle this; splitting job is the way to handle it manually if needed.