From time to time, Join nodes fail with an Out Of Memory error.
What is a Cartesian Join / Product?
A Cartesian join, also known as a Cartesian product, is a join of every row of one data set to every row of another data set. For example, if data set 1 has a table with 100 rows and is joined with data set 2, which has 1,000 rows, the Cartesian join will produce a data set with 100,000 rows.
It is the multiplication of two sets to form a third, larger data set that comprises all possible pairs of the two sets. For example, take the following two data sets and see how they form the third...
Set A = {a, b} and B = {1, 2, 3}. The Cartesian product / join is Set C which is A×B and B×A.
Set A = {a, b} Set B = {1, 2, 3}
A and B multiplied together gives us...
Set C: A×B = {(a,1), (a,2), (a,3), (b,1), (b,2), (b,3)} B×A = {(1,a), (1,b), (2,a),(2,b), (3,a), (3,b)}
To Create a Cartesian Product:
Cartesian joins can be useful when you need to produce a data set that contains every combination of two or more tables. The attached data flow (developed in 3.5.0) example uses the Join node to create the Cartesian join. By default, the Join node errors when it encounters a Cartesian join. To allow the joins to occur, you must set the "NoJoinKeyBehaviour" property to "Ignore", per the below screenshot:
The data flow takes these as two columns as inputs / data sets:
Then it uses the Join node to create the Cartesian join, which results in 25 records:
To Avoid a Cartesian Product:
Depending on the complexity of the data sets, the Cartesian joins can cause Out Of Memory errors similar to this: "buffer size cannot exceed xxxxxxxxx bytes" on Join nodes. This occurs when the expression used to join the two inputs is not unique and has large input data sets. This results in a Cartesian product of matches, for example 200 rows with a certain key value on both inputs will result in 40,000 matching output rows. Due to the way joins are implemented in Data360 Analyze, the 40,000 rows all reside in a memory buffer which is limited by a configuration property and when this is exceeded the node will error.
There are a couple of ways to avoid this limitation:
- Before joining, remove the duplicates in one or both of the data sets
- Modify the join key expression to be unique, this will lower the amount of Cartesian join matches
- Increase the size of the join memory buffer by following the steps in the Help section, under: System administration > Performance tuning > Java heap space. However, the the buffer size is limited by how much available memory there is on your computer.
Comments
0 comments
Please sign in to leave a comment.