Excluding Nulls from Profiling

  • 0
  • 1
  • Question
  • Updated 7 months ago
  • Answered
Is there a way to exclude Null values from being profiled ? I know we can set the sample size for number of rows as well as we can update the property file to exclude certain data types from being profiled, but is there an option to exclude null values from being picked up in the profiler sample set ?

One way of doing it is specifying the condition in the rule set filter, but since i have 5000+ tables and many tables, each having multiple columns having lot of null values, it is difficult to put this query for each.

So, is there an easier way to exclude Nulls from my profiling sample ?
Photo of Mayank Ahluwalia

Mayank Ahluwalia

  • 708 Points 500 badge 2x thumb

Posted 7 months ago

  • 0
  • 1
Photo of Mouhssine SAIDI

Mouhssine SAIDI

  • 4,732 Points 4k badge 2x thumb
Hi Mayank,

As the profiler acts on ruleset just add a filter to your set to exclude null values, but it could be more complicated and need more rulesets and profile jobs if you have a wide number of columns.

Regards,

Mouhssine
Photo of Mayank Ahluwalia

Mayank Ahluwalia

  • 708 Points 500 badge 2x thumb
Thanks Mouhssine, since we have a Siebel application, this is the issue with huge number of tables and columns
Photo of Gary Hallam

Gary Hallam, Official Rep

  • 1,946 Points 1k badge 2x thumb
Hi Mayank,

We have opened a conversation with the engineering team, as this is clearly functionality that would improve the profiling of Siebel.  We also encountered this issue and whilst there is a work-around this creates additional manual effort, which is undesirable.  Hopefully we can get this feature added to a future release.

It stands to reason that there are always a number of manual checks needed to find sensitive data as well as to check for profiler false positives and false negatives.  The objective is to reduce these to a minimum by creating a bespoke profiler, as I imagine you are doing.  There are a number of improvements that the engineering team are working on as part of the ongoing product development.

One thing that we did with the Siebel subject matter experts was to try to identify potentially sensitive data to remove tables that were not used or not needed, pre-processing the data for them to identify potentially sensitive data tables and columns.

For instance the following check of the User Columns which had a high number of distinct values was useful in narrowing down the search for sensitive data:

SELECT TABLE_NAME, COLUMN_NAME FROM USER_TAB_COLUMNS WHERE NUM_DISTINCT > 1000 

This could be further refined by searching for specific column names:

SELECT TABLE_NAME, COLUMN_NAME FROM USER_TAB_COLUMNS WHERE NUM_DISTINCT > 1000 AND COLUMN_NAME LIKE '%NUM' OR COLUMN_NAME LIKE '%NAME';

A further refinement might be to use the regular expressions:

SELECT TABLE_NAME, COLUMN_NAME FROM USER_TAB_COLUMNS WHERE NUM_DISTINCT > 1000 AND (REGEXP_LIKE (COLUMN_NAME, '^NUM');

Obviously it would be ideal if the profiler did these things automatically.  It's useful however that we operate on virtual databases as this allows us to make modifications as needed to improve the masking and profiling performance and test and rewind and keep branched versions for test purposes.

There is some internal chatter about this feature, so hopefully we can announce some improvements for you in the near future.

Thanks for highlighting this issue and bringing it to our attention.

Regards,
Gary
Photo of Mayank Ahluwalia

Mayank Ahluwalia

  • 708 Points 500 badge 2x thumb
Thanks Gary.
For us how Siebel is configured is that the column names unfortunately don't give an accurate representation of the data inside, so, we have to do quite a bit of profiling on the data level.

Also, since the system has been through lot of migrations, in many cases, the initial few thousands of rows are all nulls and then the values start to come up. So, while profiling, it becomes a real problem, because the profiler tends to pick the null values and hence doesn't match.

I was thinking of the exclusion of nulls from a multi-fold perspective.  One it would definitely help profile accurately, but also, if we come across columns that are completely null but might have been marked sensitive due to a column name match, if we could somehow flag such columns then the during the masking run, we could validate that flag and not process the columns at all instead of the current scenario in which we check the value during the algorithm execution. This would then save time on the masking run as well.
Photo of Gary Hallam

Gary Hallam, Official Rep

  • 1,946 Points 1k badge 2x thumb
Hi Mayank,
Yes, these are good ideas and we have highlighted this to product management for inclusion in the profiler as enhancement requests.  At the moment pre-processing a VDB is the best way forward and/or excluding unused tables from the owner used to connect to the Siebel database.  I'd suggest raising these issues with your Delphix representatives.
Regards,
Gary
Photo of Mayank Ahluwalia

Mayank Ahluwalia

  • 708 Points 500 badge 2x thumb
Thanks Gary. I am consolidating all such ideas/enhancements and problems to share them with our success manager from Delphix side.