The Excel File node allows you to read in data from an Excel worksheet and defaults to reading the first worksheet from cell A1.
However, there are times when we want to read in data from an Excel workbook that differs from this - for example, only include up to a particular row or perhaps it is a formatted report and we want to exclude the header rows.
This article is intended to help you get started by showing you three examples on how you can use the WorkbookSpec property in the Excel File node to take control of the data that you are reading in from an Excel workbook.
More detailed information can be found within the help under: Node help > Input Connectors > Excel File
Download the Example Files
The attachment at the bottom of this article contains the examples referenced including the data:
- WorkbookSpec Examples - 1 Sep 2020.lna
- CPI 2015_data.xlsx
- Gov UK Recorded Crime 1989 to 2002.xls
- obes-phys-acti-diet-eng-2016-tab.xlsx
Download the file and extract to the location of your choosing. Once the files have been extracted, continue with the instructions below.
Importing the Example File
Import the example data flow 'WorkbookSpec Examples - 1 Sep 2020.lna' by following the instructions in the section titled "Importing and exporting data flows and library nodes" within the "Importing and exporting documents" help section.
Example 1
File: | CPI 2015_data.xlsx |
Name: | Transparency International - Corruption Perceptions Index 2015 |
Source: |
Data originally pubished by Transparency International |
In the workbook 'CPI 2015_data.xlsx', there is one worksheet called ‘CPI 2015’ containing data that starts in cell A2:
To successfully read in this table of data into Data360 Analyze, we need to specify that the column headers are on row 2. We don't need to specify that the data starts on row 3 as the tool will default to looking for data on the next row from the header row.
- Add an Excel file node to your canvas and name it “Transparency International - CPI 2015”
- In the Properties panel, populate the File property with the location of the file CPI 2015_data.xlsx. You can use the file picker to select the file.
- Further down in the Properties panel, populate the WorkbookSpec property with the following:
<workbook>
<sheet headerRow="2" />
</workbook>
It will look like this in Data360 Analyze:
This indicates that the column headers of the data should be taken from row 2 of the worksheet. We don’t need to specify the worksheet as there is only one and thus no ambiguity. We also don’t need to specify that the data starts from row 3 as it defaults to taking the row following the header row.
Note: We always need to start the WorkbookSpec property with <workbook> and end with </workbook>.
Example 2
File: | Gov UK Recorded Crime 1989 to 2002.xls |
Name: | UK Historical Crime data 1898 - 2001/02 |
Source: | https://data.gov.uk/dataset/recorded-crime-data-1898-2001-02 |
In the workbook 'Gov UK Recorded Crime 1989 to 2002.xls', there is a single worksheet containing a formatted data table with column headers on row 6 and the data starting on row 8. Further, the data starts in the 2nd column i.e. column B. The last year of data is on row 113 and there are table notes below that row. We only want to read in the column headers on row 6 and the yearly data on rows 8 through to 113.
- Add an Excel file node to your canvas and name it “UK Historical Crime Data”
- In the Properties panel, populate the File property with the location of the file 'Gov UK Recorded Crime 1989 to 2002.xls'
- In the Properties panel, populate the WorkbookSpec property with the following:
<workbook>
<sheet headerRow="6" dataStartRow="8" dataEndRow="113" dataStartColumn="2" /></workbook>
It will look like this in Data360 Analyze:
This indicates that the column headers of the data should be taken from row 6 of the worksheet using the headerRow tag and that the actual data should be taken from row 8 and up to row 113 using the dataStartRow and dataEndRow tags. We also specify that the column to start is column 2 using the dataStartColumn tag.
Note that the row and column values are specified with double-quotes.
Example 3
File: | obes-phys-acti-diet-eng-2016-tab.xlsx |
Name: | UK Statistics on Obesity, Physical Activity and Diet - England 2014/15 |
Source: | https://data.gov.uk/dataset/statistics_on_obesity_physical_activity_and_diet_england |
In the workbook 'obes-phys-acti-diet-eng-2016-tab.xlsx', there are 14 different tables across 14 different worksheets tabs.
Let’s say we want to read in the data for Table 1 and Table 2 but keeping the data as separate tables. We can do this by adding an additional output pin to the Excel File node.
- Add an Excel file node to your canvas and name it “UK Statistics on Obesity, Physical Activity and Diet”
- In the Properties panel, populate the File property with the location of the file obes-phys-acti-diet-eng-2016-tab.xlsx
- In the Define tab of the Properties panel, scroll down to the 'Outputs' section of the properties and change the name of “out1” to “Table 1” then enter a new output by typing “Table 2” on a new line.
This adds an additional output pin to the Excel file node and if you hover over the output pins you will see them labelled accordingly:
- Navigate back to the Configure tab of the Properties panel. Populate the WorkbookSpec property with the following:
<workbook>
<sheet name="Table 1" headerRow="4" dataStartRow="6" dataEndRow="16" outputIndex="1"/>
<sheet name="Table 2" headerRow="4" dataStartRow="6" dataEndRow="16" outputIndex="2"/>
</workbook>
It will look like this in Data360 Analyze:
Line 2 indicates that from the worksheet named 'Table 1', referenced by the name tag, we want to read in column headers from row 4 using the headerRow tag and data from rows 6 through to 16 using the dataStartRow and dataEndRow tags. Further, we specify that we want this data to be output to the first output pin of the Excel file node using the outputIndex tag. Line 3 is similar but instead we are referencing the worksheet named 'Table 2'. - When you run the Excel file node and view the output, you will see that the second column of data is all null values due to the formatting of the Excel worksheet and that the Excel File node has auto-named the empty column as “Column2”. As this column contains no useful data, we can simply exclude ‘Column2’ using a Transform node.
Add a Modify Fields node to the top output of the Excel file node and un-check 'Column2' to exclude it. It will look like this in Data360 Analyze:
Do the same for the second output of the Excel node by simply copy and pasting the Modify Fields node.
Comments
0 comments
Please sign in to leave a comment.