DB Execute running too long, need to find a new way to solve this !
Hi Lavastorm Community,
I gave a condition:
Input CSV file contained 700k records (2 columns, SalesAgent ID & SalesOrder ID) had been performed HashSplit into 10 output which each output automatically contained around 70k records. Then, I do the linkage each of 10 output of HashSplit with 10 unit of DB Execute Node which perform sqlSelect script command to the same Oracle database table "SALES_ORDER" (contains 500 million records).
DB Execute Node
Script: sqlSelect(1,"{{^Query^}}",SalesAgent_ID,SalesOrder_ID)
Query: select NN_Agent_No, NN_Order_Item_No from SALES_ORDER where NN_Agent_No = :1 and NN_Order_Item_No = :2
Problem Statement
Each of 10 unit of DB Execute Node are running too long (almost 20 hours) just to retrieve that both ID in CSV list either exist or not in database.
I need to find a better way to do this. If try to use a while loop node, it seems the data of each 70k still getting longer to run since its looks bulk feed 70k records to DB Execute Node.
However, when I try to feed let say 1 record from CSV list to DB Execute Node, its getting fast and I can't find a way how to do this with While Loop Node.
Any advice or ideas to resolve this issue.
-
You could try changing your query from = to in. Oracle allows you up to 1000 entries for an in condition query, I believe SQL Server has no limit, other databases will probably vary in what they allow. This way you would only be performing 1 query for a thousand entries (if it's Oracle), as opposed to 1000 individual queries. Afterward you could filter on the NN_Order_Item_No field to limit your dataset to specifically what you want or perhaps you could incorporate it into the query. I'm not sure how 2 in conditions in a query would function.
e.g. select NN_Agent_No, NN_Order_Item_No from SALES_ORDER where NN_Agent_No in (1,2,3,4,5)
-
Hi Gerry,
I am aware about 1000 limit of data being query. The problem if I change from "=" to "IN", the query in DB Execute Node will run too long due to table having more than 500 millions records. The only way is to loop by providing 1 record per each process for DB Execute Node to produce the output. In this case, I am not found any loop that be able to provide 1 by 1 record to process... There must be a way to process each record when going to DB execute Node. -
Sorry yes, you are correct! I forgot the scenario I described would require a loop for DB Execute. I ran into this issue before and replaced it with JDBC Query. Here's a library node I built for this purpose. I believe it will execute quicker than your current situation.
Also has a DBA added any indexes to your columns?
Attached files
JDBC IN Condition Query - Library Node.lna
-
I would recommend you consider upgrading, but in the meantime I recreated the solution in 3.4.3.
Attached files
example.png
JDBC In Condition - v3.4.3.lna -
Hi Mr. Gerry,
Appreciate your assistance. Currently the JDBC class not available since I can't simply upload any Oracle JAR or files related to Linux server.
I'm looking a way to use loop but I'm not familiar how it works if want to loop based on the record feed into that loop. Let say, if the source contain 70k records, I need to loop 70k time and each record shall be feed into DB Execute Node which I believe it will very fast return the result eventhough its keep loop 70k times.
The While loop sample not very much help on this. -
I don't understand why you can't use JDBC, as Oracle JDBC Driver is shipped with Analyze. Maybe you experienced some version compatibility or other issue? In any case, if you want to continue with oci and DB Execute, rather than using a loop, which will work but will take more time, you can use the same principle in the data flow but replace the JDBC Query node with a DB Execute node. I have added an example to the data flow.
Attached files
example_dbexecute.png
JDBC and DB Ececute In Condition - 3.4.3.lna -
Hi Mr. Gerry,
My aim to handle 70k records are, the input only required 1 records (contain 2 columns) per each time DB Execute Node run and require to loop again for another 69,999 records.
e.g. select NN_Agent_No, NN_Order_Item_No from SALES_ORDER where (NN_Agent_No,NN_Order_Item_No) in (:1,:2) -
There is no looping in the example, hence why it should be quicker than using a Do While node in the process. The data flow I provided will only allow one SqlQueryFieldBinding value, not 2 like you want. However I don't see that as a major issue for your situation. It will group the 1 column into groups of 1000 and in the case of 70k rows, it will execute 70 queries. You can then use a Lookup or Merge node to filter out the results you don't need from the query with the 2 columns you want in your query. Your query using one column in the data flow I attached would look like this:
select NN_Agent_No, NN_Order_Item_No from SALES_ORDER where NN_Agent_No in {{^InList^}}
The output of this query will contain more data that you want, so you would directly follow that with a Lookup node, add a second pin to split between match and non-match. Then the matches found would be the dataset you actually want. From past experience I have done a similar thing and found the total execution time to be brought down dramatically.
-
Hi Mr. Gerry,
I had perform 3 time HashSplit from full records of 900k >> 90k >> 9k >> 900.
Each DB Execute Node perform the query only with avg less than 1k records.
I had save the time from 15 hours to 4 hours process. Appreciate your advice on 1000 records per each query and the SQL command changes between "=" and "in".
At this moment, issue had been resolved.
Thank you very much.
Please sign in to leave a comment.
Comments
11 comments