We recommend switching to the latest versions of Edge, Firefox, Chrome or Safari. Using Internet Explorer will result in a loss of website functionality.
Our Support systems migrated on Saturday, May 21. We'll automatically forward you to the new location for this content.

Variable length string lookup

Comments

4 comments

  • Avatar
    Infogix Support
    Originally posted by: stonysmith

    Lookup is the wrong node for that. Use Join.

    As you discovered, Lookup, with a key of "1" and "DuplicateKeyBehavior" set to "Ignore" will only process the last record of the LookupKey input.

    Join or Xref will process all records. However.....
    The method works well enough for small record sets. But, if you try this on large recordsets, then it's going to blow up because Join is producing a many-to-many match (cartesian Join) and it will end up trying to match too many records at once and will blow up with a memory error.

    Look at the system parameter ls.brain.node.join.maxbuffersize to find out what your size limit is. If you're on the edge, you can increase the limit, but it's a bad idea to try to increase that size too large - i think it's top absolute limit is 4gb.

    You can mitigate this if there is some other column that you can add to the Key. Example: If you were matching addresses, then adding ZipCode to the match keys would provide some filtering on how many records that Join tries to match at one time.

    In your case, If you are only matching a small number of groups (under 1000), then you may never see the memory problem.
    0
    Comment actions Permalink
  • Avatar
    Infogix Support
    Originally posted by: linnoinen

    Ok. Data volumes aren't that big of an issue, there are only a handful of groups that need to be matched with the data. The data sets that are going to be processes have some thousands of records but are pretty small still.
    I'll try duplicating this with Xref or Join and see how that goes.

    Thanks for you help!
    0
    Comment actions Permalink
  • Avatar
    Infogix Support
    Originally posted by: linnoinen

    Hi,

    I've tried the normal JOIN node but that doesn't seem to work. The Xref works better, however it doesn't seem to want to pick up all the records.
    Does it make a difference if the incoming data has other columns too, not just the ID that I'm trying to join on?
    The Xref result seems to output the incoming data into the left orphan, nothing in the middle and the group lookup IDs into the right orphan.

    Should the Xref node be able to handle the join by default even though the IDs are different lengths?
    0
    Comment actions Permalink
  • Avatar
    Infogix Support
    Originally posted by: stonysmith

    The default Xref does not handle the type of join you want.

    Copy/paste this code into your graph:
    node:Cat
    bretype:core::Cat
    editor:sortkey=59c2723468861deb
    input:@40fd2c7476b11c42/=Duplicate_Detection.412b51604cbd4212
    input:59c272a57f733d68/out2=Find_UnMatched_Records.59c0ffe65b9646ee
    editor:bend=1=480|140
    editor:bend=1=650|140
    output:@40fd2c74676e03c3/=
    prop:ConcatenationMode=union
    editor:XY=680,100
    end:Cat
    
    node:Duplicate_Detection
    bretype:core::Duplicate Detection
    editor:sortkey=59c2711a33747151
    input:@412b51602ae908b4/=Find_UnMatched_Records.40fd2c7445835585
    output:@412b51604cbd4212/=
    output:@45770db33b314718/=
    prop:InputExpr=<<EOX
    IDs
    EOX
    editor:XY=570,100
    node:Error_if_dups_found
    bretype:::Error if dups found
    editor:shadow=45770da20c26095b
    input:@40fd2c74167f1ca2/=
    output:@40fd2c7420761db6/=
    end:Error_if_dups_found
    
    node:sort
    bretype:::Sort
    editor:shadow=45770da245f22c38
    input:@40fd2c743ebf4304/=
    output:@40fd2c746a2a3b47/=
    end:sort
    
    node:Join
    bretype:::Join
    editor:shadow=45771849435f233f
    input:@40fd2c745b6d7704/=
    input:@40fd2c74504921cd/=
    output:@40fd2c7430f76546/=
    output:@457720e45ed701a9/=
    end:Join
    
    node:Agg
    bretype:::Agg
    editor:shadow=45772d9e1f7c1166
    input:@40fd2c7427456e5b/=
    output:@40fd2c744c862db0/=
    end:Agg
    
    end:Duplicate_Detection
    
    node:Find_UnMatched_Records
    bretype:core::Lookup
    editor:Label=Find UnMatched Records
    editor:sortkey=4c98c69e673908b3
    input:@40fd2c746abc6dc7/=Dataset.40fe6c55598828e5
    input:@40fd2c74486e4494/=Fuzzy_Match.40fd2c7430f76546
    output:@40fd2c7445835585/=
    output:59c0ffe65b9646ee/out2=
    prop:DuplicateKeyBehavior=Ignore
    prop:InputKey=<<EOX
    IDs
    EOX
    prop:LookupKey=<<EOX
    IDs
    EOX
    prop:Script=<<EOX
    output 1 {
    emit 1:*
    default emit 2:*
    where matchIsFound
    }
    output 2 {
    emit 1:*
    where not matchIsFound
    }
    
    
    EOX
    editor:XY=450,100
    end:Find_UnMatched_Records
    
    node:Problem_Keys
    bretype:core::Join
    editor:Label=Problem Keys
    editor:sortkey=59c266891dda5dbe_2
    input:@40fd2c745b6d7704/=Lookup_2.40fe6c55598828e5
    input:@40fd2c74504921cd/=Lookup_2.40fe6c55598828e5
    output:@40fd2c7430f76546/=
    prop:JoinType=i
    prop:LeftInputKey=<<EOX
    1
    EOX
    prop:RightInputKey=<<EOX
    1
    EOX
    prop:Script=<<EOX
    Key1='1:IDs'
    Key2='2:IDs'
    emit Key1
    emit Key2
    emit 1:Group
    where join.match
    and Key1.left(len(Key2))==Key2 and Key1<>Key2
    
    EOX
    editor:XY=310,260
    end:Problem_Keys
    
    node:Fuzzy_Match
    bretype:core::Join
    editor:Label=Fuzzy Match
    editor:sortkey=59c266891dda5dbe
    input:@40fd2c745b6d7704/=Dataset.40fe6c55598828e5
    input:@40fd2c74504921cd/=Lookup_2.40fe6c55598828e5
    output:@40fd2c7430f76546/=
    prop:JoinType=i
    prop:LeftInputKey=<<EOX
    1
    EOX
    prop:RightInputKey=<<EOX
    1
    EOX
    prop:Script=<<EOX
    Key1='1:IDs'.left(len('2:IDs'))
    Key2='2:IDs'
    emit 1:*
    emit Key2
    emit '2:Group'
    where join.match
    and Key1==Key2
    
    EOX
    editor:XY=340,150
    end:Fuzzy_Match
    
    node:Dataset
    bretype:core::Static Data
    editor:Label=Dataset
    editor:sortkey=4c9834ee2af84d9f
    output:@40fe6c55598828e5/=
    prop:StaticData=<<EOX
    IDs:string
    162538129317
    182718476234
    148712492626
    125959262622
    162536647772
    162532342326
    162532347447
    163272727272
    189330235233
    200030040050
    300500300303
    EOX
    editor:XY=220,100
    end:Dataset
    
    node:Lookup_2
    bretype:core::Static Data
    editor:Label=Lookup
    editor:sortkey=4c98352228d53ee3
    output:@40fe6c55598828e5/=
    prop:StaticData=<<EOX
    IDs:string,Group:string
    162538129317,Group1
    162536647775,Group1
    1625323423,Group1
    16253234,Group1
    182718476,Group2
    14871249,Group3
    125959,Group4
    163,Group5
    189330,Group6
    EOX
    editor:XY=220,260
    end:Lookup_2
    0
    Comment actions Permalink

Please sign in to leave a comment.