When preparing data for analysis, it is best practice to profile your data to identify any outliers - but what is an outlier? An outlier is an observation whose value is markedly different from the other values in the sample data.
There are a range of methods for detecting possible outliers. One method that is sometimes used is the Z-Score which provides a metric that indicates the numeric distance of a data point (in terms of the number of standard deviations) from the sample's mean.
That is, the Z-score of an observation is:
Zi = (Yi - Ymean) / s
where Yi it the value of the ith observation, Ymean is the sample mean and s is the standard deviation.
However, the presence of extreme values in the data can impact the value of the sample mean - resulting in misleading results when considering what constitutes a possible outlier.
A more robust measure of 'central tendency' (that is, what constitutes the "middle" value of the data) is the median as the presence of extreme values have a reduced effect on calculation of the median compared with the mean.
The median value is used in the Modified Z-Score outlier detection method. The Modified Z-Score is defined as:
Mi = 0.6745*(Yi - Ymedian) / MAD
where Ymedian is the sample's median value and MAD is the median absolute deviation.
The median absolute deviation is defined as the median of the absolute difference of the observation from the sample median, i.e.
MAD = median(|Yi - Ymedian|)
Iglewicz and Hoaglin recommend using a Modified Z-Score of greater than 3.5 as a means to identify possible outliers.
The attached example data flow contains a community custom node that calculates the Modified Z-Score for a selected numeric input field. The node outputs the observations' Modified Z-Score and a boolean field indicating whether the field is a potential outlier. Any Null values are stripped out of the data before calculating the scores.
You will need Dataverse release 3.2.0 or above to import the data flow.
Please sign in to leave a comment.