best practice to create ruleset

  • 0
  • 1
  • Question
  • Updated 1 year ago
  • Answered

Hello,

  i have to profile about 444 tables by applying 7 algorythms/regular expressions.

I have created a unique ruleset that comprehends all theese tables and the job that profiles against this ruleset by applying all algorythms. This job have runned for a long period and so i have forced it to stop. The job have terminated in failure status and i have not been able to open pdf log because that activity have  remained suspended.

So now i think about split the same activity into mutiple lighter ones.

I could profile mulltiple time each time by applying only a subset of algorythms on a subset of tables. 

I have to create as much rulesets as the subsets of tables, as much profilling jobs as the number of ruleset multiply on the subsets in which i have splitted algorythms/regular expression. 

Rights?

So the question.

Can i use the same ruleset by applying each time different profiling algorythms? what about  during the same profiling job one excerpt of data has been matched by more than a regular expression with different masking technique associated with?

Luigi



Photo of luigidep

luigidep

  • 622 Points 500 badge 2x thumb

Posted 1 year ago

  • 0
  • 1
Photo of Robert Patten

Robert Patten, Employee

  • 460 Points 250 badge 2x thumb
Official Response
Profiling 444 tables in one job –

This is not a huge number of tables; one job / ruleset to profile them is OK for column level expressions. If using column level profile expressions the job should complete in less than a half hour. As a best practice you should evaluate column names in the schema to make sure they are indicative of the data contained in the column. For example, a column containing first name data named FIRST_NAME meets this requirement. You should always use column level expressions if possible, they are much faster.

If the columns have generic names such as ABC_XXX you will need to use data level expressions which will read the first hundred values of the column by default and match against the regular expression text of the profiler expression being used. In the PROFILER tab of the masking UI the expression level (column or data) appears in the “Level” column. If using many data level expressions the profile job will run considerably longer possibly several hours.

For very large schemas (using data level expressions) it may be beneficial to break up the rulesets used for profiling into smaller chunks. Since you’ll probably want to subset a large schema for masking (to increase parallel execution) you can save a step by splitting the rulesets for your initial profiling run. Each ruleset you create will have its own inventory which will be empty (no domains or algorithms assigned) prior to running the profiler or assigning domains manually.

You can run multiple profile jobs (each with a different profile set) using the same ruleset. The profiler will overlay the domain and algorithm with each execution unless you change the ID method to “USER” in the inventory. If you are manually updating a column’s domain and/or algorithm in the profiler you should always change the ID method to “USER” so subsequent profile executions will not overlay the domain and algorithm you have manually selected. It is also possible that multiple expressions will be matched during profile execution. Care should be take to assure the correct expression is matched and the correct domain/algorithm is applied. Default expressions / profile sets will usually not have multiple matches. It is recommended that the default expressions not be modified, instead create your own version and test it with a Regex test tool.

If your job is suspended you may want to open a support ticket to have the issue cleared. It’s possible that the ruleset may have been corrupted (rare). If the problem is not resolvable via support ticket you may have to delete the ruleset and rebuild it.
(Edited)