We recommend switching to the latest versions of Edge, Firefox, Chrome or Safari. Using Internet Explorer will result in a loss of website functionality.

Run Data Validation rules in D3S Analyse

Comments

21 comments

  • Avatar
    Adrian Williams

    Here is one way that mirrors your example:

    The lower Create Data node represents the imported data from your InputData.csv file and the upper Create Data node is equivalent to the imported data from the InputData.csv file.

    A new field is added to both data sets to indicate the type of record: 'Test' or 'Data' and then the records are concatenated with the 'Test' records at the top of the combined data set.

    A transform node then extracts the details from the 'Test' records and puts the details in a list. 

    Then the set of tests are performed on each 'Data' record. If a test fails the node outputs a record with the details of the failed test and the original input data.

    See the attached example data flow (requires Analyze v.3.6.x).

     

     

    Attached files

    DQ_Validation - 29 Apr 2021.lna

     

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian, this is excellent thank you! much appreciated. 

    I have a related query.. it seems very similar use case to the above but as I am very new to D3S so would appreciate your help again.

    Before applying the TestCasesForInputData.csv file I first need to filter my input file using a similar config file approach. e.g. a config file called FilterConfig.csv containing rules for when to include records from InputData.csv. For example the below rules would keep the two records from the InputData.csv example file above. 

    FilterNumber, FilterFieldName, FilterType, FilterTypeValue

    Filter_01,TradeId,NotNull, n/a

    Filter_02,BuySellFlag, CheckInEnum,“B, S, X”

    Filter_03,TradeDate,DateGreaterThan, 20-01-2021

    Please could you provide a similar example D3S dataflow?

    Much appreciated

    Scott

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    I'm not sure I entirely understand the details of what you are trying to achieve, but it can probably be implemented using a Join node that matches the required fields in the Filter data with the corresponding fields in the Test data:

    The Join node is configured with the default JoinMode (Inner) and the required MatchKeys:

    The node's FieldList property is configured to exclude from the output all fields in the Left data set. In this case the fields were excluded en-bloc (click on the green '+' button to set the include/exclude mode), but you could have individually un-checked each field.

    In this example the *TypeValue fields were used as matches. However, as the values in your Test data and Filter Data do not match these tests would not have been output by the Join node. In the attached data flow I tweaked the filter data so the tests would be output but you may decide to only use the *Number and *FieldName fields in your Join MatchKeys criteria. The remainder of the data flow is as before.

    Please note that the example is only intended as a proof of concept and there is no validation of the environmentals for the tests being performed, e.g. whether the data type of the selected field is appropriate for the tests being performed, etc. The values for the test criteria and the input data (derived from the .csv files) are all assumed to be strings and the logic that performs the tests makes that assumption.

     

    Attached files

    DQ_Validation_w_Filter - 1 May 2021.lna

     

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian

    Thanks for this filter example. 

    Going back to the test cases, I am trying to add a "script" test case where the test config file can contain a script. So in your "Perform DQ Validation Tests" node i added the following code which works:

    elif test[2] == 'Script':
    if eval("""in1[test[1]] is not Null and len(in1[test[1]]) > 0"""):
    pass
    else:
    out1 += in1
    out1.TestNumber = test[0]
    out1.TestFieldName = test[1]
    out1.TestType = test[2]
    out1.TestTypeValue = test[3]
    node.write(0, out1)

    but when I replace it with the following code it does not work. There is no error, it just does not evaluate. 

    elif test[2] == 'Script':
    if eval(test[3]):
    pass
    else:
    out1 += in1
    out1.TestNumber = test[0]
    out1.TestFieldName = test[1]
    out1.TestType = test[2]
    out1.TestTypeValue = test[3]
    node.write(0, out1)

     

    test[3] is the TestTypeValue from your test cases input data "TestCasesForInputData" (which i switched to a csv delimted input file to allow for escape of double quotes). For some reason it does not evaluate the string from TestTypeValue i.e test[3]. 

    TestNumber  TestFieldName  TestType  TestTypeValue
    TEST_05 LEI Script """in1[test[1]] is not Null and len(in1[test[1]]) > 0"""
           

    Please can you advise how to get this to work?

    much appreciated!

    Scott

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    After a lot of head scratching it seems to be a quirk of the eval() function when presented with a string that would normally be considered a multi-line comment - i.e. text surrounded by three sets of double quotes. If the string value for the script only has a single set of double quotes in the Create Data node (i.e. there are no quote characters in the string when the value is viewed in the data viewer) then it works!

     

     

    Attached files

    DQ_Validation_w_Script - 26 May 2021.lna

     

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Thanks Adrian you're a legend! I've been down lots of rabbit holes trying to fix this so I'm glad it turned out to be doable. Thanks again! :-) 

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian

    Apols, one more hopefully small question from a D3S newbie. I am trying to assign the output field name for my field name called 'LEI'. Out1.LEI works but how do i pull the field name from my test input data so its dynamic based on the test being executed. I tried out1.test[1] but that doesn't work. Please can you advise?

    out1 += in1
    out1.LEI = eval(test[5]) - This works
    out1.test[1] = eval(test[5])  This doesn't work

    Thanks in advance

    Scott

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    Unfortunately that will not be possible. You can dynamically name a field using a property value (this could be a property on the node itself, a parent Composite node, a Data Flow property or a Run property), or based on something derived from the node's input metadata. However, you cannot define the name of an output field based on a value derived from something in an input data record. This is because the output metadata is determined when the Transform node's ConfigureFields script is run. This script runs when the node starts execution and this happens before the first record is read - so no values from the input record are available at this time.

    Her is an example of using a property value:

    A custom property has been defined on the node ('OutputFieldName'). The ConfigureFields script uses a textual property substitution to obtain the value set on the OutputFieldName property and assigns the value to the 'newFieldName' variable. If no value has been set on the property then the default value is used instead ("_DefaultName"). The out1[<string>] method is used to specify the name of the field and the value of the newFieldName variable will be used as the name of the output field. The newFieldName variable's value is also used in the ProcessRecords script when assigning a value to the required output field.

    The following example shows how an aspect of the input metadata can be used to dynamically set an output field name:

    The ConfigureFields script (starting line 10) checks whether there is a field called 'color' in the input metadata. If it is present the 'dynamicName' variable is set to the specifed value and used to create the new output field. If there was no 'color' field in the input metadata a different value is assigned to the dynamicName variable and again this is used to create the new output field. The dynamicName variable is available when the ProcessRecords script so it can be used when assigning a value to the output field when processing records (as described in the previous example).

    Please see here for further information on the Transform node:

    https://support.infogix.com/hc/en-us/articles/360018962334-Using-the-Python-based-Transform-Node

    And the Python scripting > API and script bindings topic in your Help documentation, or the online documentation for the latest version of the product, here:

    https://d3sa-preview.infogixsaas.com/docs/dist/help/Content/f-script/python/python-api.htm

     

    You can use the Execute Data Flow node to pass values calculated in  one data flow into a 'child' data flow. The values that are passed in to the child data flow are part of the state of the child data flow and are treated in a similar way to other Run properties. See the 'Running with data driven properties' section of the Execute Data Flow node's Help

    https://d3sa-preview.infogixsaas.com/docs/dist/help/Default.htm#e-node-help/Interfaces_and_Adapters/execute-data-flow.htm

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian

    Thanks for the reply. Thankfully my output fields name list will always exist beforehand so the fields will always be on the input and therefore output file so I don't need to create a new fieldname dynamically (apologies I should have been more clear on that). I had a play around just now and seem to have a solution using the exec function as follows. Please let me know if there is a better way of doing it?

    out1.LEI = '123ABC' - This works fine (obviously)

    out1.test[1] = '123ABC' -  As we saw, this doesn't work even though test[1] ="LEI".

    Using exec function:

    tmp = "out1." + test[1] + " = " + "123ABC"

    exec (tmp) - this seems to work fine. 

    Kind regards

    Scott

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian

    Just expanding out the types of tests I will be able to execute from my test configuration file to now include an if else script. I can get the If statement to work directly in an exec statement in the code but not an If, else statement. 

    For example

    This works:

    exec("if in1[test[1]] is not Null and len(in1[test[1]]) > 0:Tmp = 'BUYBUY' ") 

    This doesn't work:

    exec("if in1[test[1]] is not Null and len(in1[test[1]]) > 0:Tmp = 'BUYBUY' else: Tmp = 'SELLSELL' ")

    Please can you help me out with how to do that?

    Thanks in advance

    Scott 

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    You may want to try restructuring the if else statement as a tertiary operator assignment:

    https://www.webucator.com/how-to/how-do-ternary-operator-assignment-python.cfm

     

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Sounds good thanks, I'll give it a try. 

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hey Adrian, just to let you know that tertiary operator assignment worked. thanks again!

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian

    Hope you had a nice long weekend!

    Another query for you...after the "Perform DQ Validation Tests" node I need to split the output into multiple  outputs at runtime, split by TestFieldName. Any suggestions? The split would need to be dynamic at runtime, i.e. not hardcoding the test field names in a split node. 

    I should also add that once the split is done by TestFieldName i need to be able to filter each split output using a config file (which i can see is possible by feeding into the change metadata node)

    thanks in advance!

    Scott 

    ps i just found this reference from some years ago whereby it seems a solution was made but I cant open the example file perhaps as its an old brg file from 6 years ago.

    https://support.infogix.com/hc/en-us/community/posts/360051778573-Dynamic-Outputs-Creating-multiple-dynamic-output-files-from-one-data-source-

     

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    There is no means to have a truly dynamic split because you need to manually create the output pins of the node. If you know the number of unique tests (ie test name values) then you can create a Transform node with N + 1 outputs and steer records to one of the outputs based on the value of the TestFieldName in the input records:

    If the value did not match the record would be steered to the final output and, the number of records counted in an Aggregate node:

    If there were unrecognised records this could be used to generate an error:

     

     

     

    Attached files

    DQ_Validation_Results_Split - 2 Jun 2021.lna

     

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    thanks. I also found this previous post. 

    https://support.infogix.com/hc/en-us/community/posts/360051778573-Dynamic-Outputs-Creating-multiple-dynamic-output-files-from-one-data-source-

    There is .brg file at the link that has a "Node 2 (Output To Multiple Excel Files By Group)" that sounds like it does what I need but I cant open the brg file as don't have lavastorm. Was "output to multiple excel files by group" a lavastorm node that is no longer available? 

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian

    Just to let you know I was able to import the brg file but first had to open the file in a text editor and change the BRGVersion:5 to BRGVersion:6. Once update the "RootFileName" path location to my c drive it works! Just need to see if i can enhance it for my use case now..

    Scott

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    The brg file is for the legacy Lavastorm LAE product. As you note, only version 6 .brg files can be imported into Analyze. However version 5 files are not guaranteed to work when imported so your mileage may vary.

    I'm not sure I can recommend it's use unless you are familiar with the Lavastorm product as the nodes use the legacy scripting language. 

    It wasn't clear from your earlier messages, but if your use case is to segment a data set so it can be written to multiple files you can achieve this in a simpler manner - provided the output file format is a csv/delimited file. The Output CSV/Delimited node can be configured to obtain the output file name from an input field by using the 'From field' source option on the Filename property:

    See the attached example.

     

    Attached files

    Split_Data_Into_Multiple_CSV_Files - 3 Jun 2021.lna

     

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    Hi Adrian

    Just exploring this suggestion you made in the graph above for splitting data into multiple csv files. I can see this works well in your example but I have an additional requirement to also specify particular fields in each output file that I am able to do in the do-while-loop of the other graph using a set metadata node before creating the output file. 

    Is there anyway to use the filename(from field) approach but also drive the output fields per file based on data in the input file?

    thanks

    Scott 

     

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    Hi Scott, 

    No, you cannot subset the fields that are output in the csv files on a per-file basis, all of the fields will be output to each of the csv files.

    0
    Comment actions Permalink
  • Avatar
    Scott Perry

    okay thanks

    0
    Comment actions Permalink

Please sign in to leave a comment.