Published September 15, 2025 | Version 1.0.0
Workflow Open

Workflow for Ensemble Outlier detection and removal

  • 1. BOKU University
  • 2. University of Twente
  • 3. Leibniz Institute of Freshwater Ecology and Inland Fisheries (IGB): Berlin, DE

Description

The workflow is described in four steps, which involve data retrieval, merging, preprocessing, and outlier detection and removal. In this workflow, we illustrate biogeographical datasets.

  1. Step 1: Data retrieval: This includes obtaining data from open access databases such as the Global Biodiversity Information Facility. This can be limited by the number of species or the area of concern. The environmental variables are also retrieved from the different databases, including Hydrography90m, Worldclim, or Copernicus. 
  2. Step 2: Data merging: If the user has locally available datasets, then these can also be merged with the open access datasets.
  3. Step 3: Precleaning and extraction of environmental predictors: The species names can be harmonized at this step to remove synonyms in the datasets that may inflate the species number. Thereafter, environmental predictors are extracted from step one, using coordinates where the species are found.
  4. Outlier detection and obtaining quality-controlled datasets: The precleaned dataset forms the reference datasets, where outlier detection is conducted. Different outlier detection methods are supported, namely 
    1. Univariate methods, namely, Z-score, Hampel method, distributed boxplot, interquartile range, adjusted boxplot, sequential fences, reverse jackknifing, mixed interquartile range, median rule, and semi-interquartile range.
    2. Multivariate methods include the Mahalanobis method, isolation forest, k-means clustering, one-class support vector machines, local outlier factor, k-nearest neighbor, and global-local outlier score from hierarchies.
    3. Species ecological ranges, which depend on the species limits, flag out records outside the known species ecological ranges:

Please note that:

  1. If the user has already extracted environmental predictors for the species, then steps 1 to 3 are not required.
  2. If it is general data analysis (not biogeographical data), steps 1 to 3 are also not required.

Files

Files (16.9 kB)

Name Size Download all
Checksum: md5:16d46ac114f9c0ade690a78a2297122b

PID: http://hdl.handle.net/11304/61f75c10-7244-4125-920c-af1deb2cc97c
16.9 kB Download

Additional details

Related works

Is identical to
10.5281/zenodo.17119852 (DOI)
Is supplement to
10.5281/zenodo.17076781 (DOI)
Is version of
10.5281/zenodo.17119851 (DOI)

Funding

European Commission
101094434
European Commission
101093985