manuscript_peerj.tex

%% Submissions for peer-review must enable line-numbering
%% using the lineno option in the \documentclass command.
%%
%% Preprints and camera-ready submissions do not need
%% line numbers, and should have this option removed.
%%
%% Please note that the line numbering option requires
%% version 1.1 or newer of the wlpeerj.cls file, and
%% the corresponding author info requires v1.2

\documentclass[fleqn,10pt,lineno]{wlpeerj} % for journal submissions

% ZNK -- Adding headers for pandoc

\setlength{\emergencystretch}{3em}
\providecommand{\tightlist}{
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\usepackage{lipsum}
\usepackage[unicode=true]{hyperref}
\usepackage{longtable}


% Pandoc syntax highlighting
% See https://github.com/rstudio/rticles/issues/182


% Pandoc Header
\usepackage[export]{adjustbox}
\usepackage{float}
\floatplacement{figure}{H}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{multirow}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{colortbl}
\usepackage{pdflscape}
\usepackage{tabu}
\usepackage{threeparttable}
\usepackage{threeparttablex}
\usepackage[normalem]{ulem}
\usepackage{makecell}
\usepackage{xcolor}

\title{No one-size-fits-all solution to clean GBIF}

\author[1,2]{Alexander Zizka}

\corrauthor[1,2]{Alexander Zizka}{\href{mailto:alexander.zizka@idiv.de}{\nolinkurl{alexander.zizka@idiv.de}}}
\author[3]{Fernanda Antunes Carvalho}

\author[4]{Alice Calvente}

\author[5]{Mabel Rocio Baez-Lizarazo}

\author[6]{Andressa Cabral}

\author[4]{Jéssica Fernanda Ramos Coelho}

\author[6]{Matheus Colli-Silva}

\author[4]{Mariana Ramos Fantinati}

\author[7]{Moabe Ferreira Fernandes}

\author[4]{Thais Ferreira-Araújo}

\author[4]{Fernanda Gondim Lambert Moreira}

\author[4]{Nathália Michelly da Cunha Santos}

\author[7]{Tiago Andrade Borges Santos}

\author[4]{Renata Clicia dos Santos‐Costa}

\author[8]{Filipe Cabreirinha Serrano}

\author[4]{Ana Paula Alves da Silva}

\author[4]{Arthur de Souza Soares}

\author[4]{Paolla Gabryelle Cavalcante de Souza}

\author[4]{Eduardo Calisto Tomaz}

\author[4]{Valéria Fonseca Vale}

\author[7]{Tiago Luiz Vieira}

\author[9,10,11]{Alexandre Antonelli}


\affil[1]{sDiv, German Center for Integrative Biodiversity Research Halle-Jena-Leipzig (iDiv), Leipzig, Germany}
\affil[2]{Naturalis Biodiversity Center, Leiden, The Netherlands}
\affil[3]{Departamento de Genética, Ecologia e Evolução, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil}
\affil[4]{Departamento de Botânica e Zoologia, Universidade Federal do Rio Grande do Norte, Natal, Brazil}
\affil[5]{Departamento de Botânica, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil}
\affil[6]{Departamento de Botânica, Universidade de São Paulo, São Paulo, Brazil}
\affil[7]{Departamento de Ciências Biológicas, Universidade Estadual de Feira de Santana, Feira de Santana, Brazil}
\affil[8]{Departamento de Ecologia, Universidade de São Paulo, São Paulo, Brazil}
\affil[9]{Gothenburg Global Biodiversity Centre, University of Gothenburg, Gothenburg, Sweden}
\affil[10]{Department for Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden}
\affil[11]{Royal Botanic Gardens Kew, Richmond, United Kingdom}


%
% \author[1]{First Author}
% \author[2]{Second Author}
% \affil[1]{Address of first author}
% \affil[2]{Address of second author}
% \corrauthor[1]{First Author}{f.author@email.com}

% 

\begin{abstract}
Species occurrence records provide the basis for many biodiversity studies. They derive from georeferenced specimens deposited in natural history collections and visual observations, such as those obtained through various mobile applications. Given the rapid increase in availability of such data, the control of quality and accuracy constitutes a particular concern. Automatic filtering is a scalable and reproducible means to identify potentially problematic records and tailor datasets from public databases such as the Global Biodiversity Information Facility (GBIF; www.gbif.org), for biodiversity analyses. However, it is unclear how much data may be lost by filtering, whether the same filters should be applied across all taxonomic groups, and what the effect of filtering is on common downstream analyses. Here, we evaluate the effect of 13 recently proposed filters on the inference of species richness patterns and automated conservation assessments for 18 Neotropical taxa, including terrestrial and marine animals, fungi, and plants downloaded from GBIF. We find that a total of 44.3\% of the records are potentially problematic, with large variation across taxonomic groups (25 - 90\%). A small fraction of records was identified as erroneous in the strict sense (4.2\%), and a much larger proportion as unfit for most downstream analyses (41.7\%). Filters of duplicated information, collection year, and basis of record, as well as coordinates in urban areas, or for terrestrial taxa in the sea or marine taxa on land, have the greatest effect. Automated filtering can help in identifying problematic records, but requires customization of which tests and thresholds should be applied to the taxonomic group and geographic area under focus. Our results stress the importance of thorough recording and exploration of the meta-data associated with species records for biodiversity research.
% Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text. Dummy abstract text.
\end{abstract}

\begin{document}

\flushbottom
\maketitle
\thispagestyle{empty}

\hypertarget{introduction}{%
\section*{Introduction}\label{introduction}}
\addcontentsline{toc}{section}{Introduction}

Publicly available species distribution data have become a crucial resource in biodiversity research, including studies in ecology, biogeography, systematics and conservation biology. In particular, the availability of digitized collections from museums and herbaria and citizen science observations has increased drastically over the last few years. As of today, the largest public aggregator for geo-referenced species occurrences data, the Global Biodiversity Information Facility (www.gbif.org), provides access to more than 1.5 billion geo-referenced occurrence records for species from across the globe and the tree of life.

A central challenge to the use of these publicly available species occurrence data in research is problematic geographic coordinates, which are either erroneous or unfit for downstream analyses (for instance because they are overly imprecise, Anderson et al. 2016). Problems mostly arise because data aggregators such as GBIF integrate records collected with different methodologies in different places at different times---often without centralized curation and only rudimentary meta-data. For instance, problematic coordinates caused by data-entry errors or automated geo-referencing from vague locality descriptions are common (Maldonado et al. 2015; Yesson et al. 2007) and cause recurrent problems such as records of terrestrial species in the sea, records with coordinates assigned to the centroids of political entities, or records of species in cultivation or captivity (Zizka et al. 2019).

Manual data cleaning based on expert knowledge can detect these issues, but it is only feasible on small taxonomic or geographic scales, and it is time-consuming and difficult to reproduce. As an alternative, automated filtering methods to identify potentially problematic records have been proposed as a scalable option, as they are able to deal with datasets containing up to millions of records and many different taxa. Those methods are usually based on geographic gazetteers (e.g., Chamberlain 2016; Zizka et al. 2019; Jin and Yang 2020) or on additional data, such as environmental variables (Robertson, Visser, and Hui 2016). Additionally, filtering procedures based on record meta-data, such as collection year, record type, and coordinate precisions, have been proposed to improve the suitability of publicly available occurrence records for biodiversity research (Zizka et al. 2019).

Problematic records are especially critical in conservation, where stakes are high. Recently proposed methods for automated conservation assessments could support the formal assessment procedures for the global Red List of the International Union for the Conservation of Nature (IUCN) (Dauby et al. 2017; Bachman et al. 2011; Pelletier et al. 2018). These methods approximate species' range size, namely the Extent of Occurrence (EOO, which is the area of a convex hull polygon comprising all records of a species), the Area of Occupancy (AOO, which is the sum of the area actually occupied by a species, calculated based on a small-scale regular grid), and the number of locations for a preliminary conservation assessment following IUCN Criterion B (``Geographic range''). These methods have been used to propose preliminary global (Stévart et al. 2019; Zizka et al. 2020) and regional (Schmidt et al. 2017; Cosiaux et al. 2018) Red List assessments. However, all metrics, and especially EOO, are sensitive to individual records with problematic coordinates. Automated conservation assessments may therefore be biased, particularly if the number of records is low, as it is the case for many tropical species.

While automated filters hold great promise for biodiversity research, their use across taxonomic groups and datasets remains poorly explored. Here, we test the effect of automated filtering of species geographic occurrence records on the number of records available in different groups of animals, fungi, and plants. Furthermore, we test the impact of automated filtering procedures for the accuracy of preliminary automated conservation assessments compared to full IUCN assessments. Specifically, we evaluate a pipeline of 13 automated filters to flag possibly problematic records by using record meta-data and geographic gazetteers in two categories: 1) erroneous (coordinates, that are likely wrong, irrespective of the downstream analyses, for instance due to data entry errors) and 2) unfit for purpose (coordinates that are not wrong \emph{per se}, but likely unfit for the planned downstream analyses, for instance because they are overly imprecise). We address three questions:

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
  Which filters lead to the biggest loss of data when applied?
\item
  Does the importance of individual filters differ among taxonomic groups?
\item
  Does automated filtering improve the accuracy of automated conservation assessments?
\end{enumerate}

\hypertarget{material-and-methods}{%
\section*{Material and Methods}\label{material-and-methods}}
\addcontentsline{toc}{section}{Material and Methods}

\hypertarget{choice-of-study-taxa}{%
\subsection*{Choice of study taxa}\label{choice-of-study-taxa}}
\addcontentsline{toc}{subsection}{Choice of study taxa}

This study is the outcome of a workshop held at the Federal University of Rio Grande do Norte in Natal, Brazil in October 2018 which gathered students and researchers working with different taxonomic groups of animals, fungi, and plants across the Neotropics (Fig. \ref{fig:species}). Each participant analysed geographic occurrence data from their taxonomic group of interest and commented on the results for their group. Hence, we include groups based on the expertise of the participants rather than following an arbitrary choice of taxa and taxonomic ranks. We acknowledge a varying degree of documented expertise and number of years working on each group. We obtained public occurrence records for 18 taxa, including one plant family, nine plant genera, one genus of fungi, three families and one genus of terrestrial arthropods, one family of snakes, one family of skates, and one genus of bony fish (Table 1).

\hypertarget{species-occurrence-data}{%
\subsection*{Species occurrence data}\label{species-occurrence-data}}
\addcontentsline{toc}{subsection}{Species occurrence data}

We downloaded occurrence information for all study groups from www.gbif.org using the \texttt{rgbif} v1.4.0 package (Chamberlain 2017) in R (GBIF.org, 2019a-p,2020a,b). We downloaded GBIF-interpreted data including only records with geographic coordinates and limited the study area to a rectangle between 90\(^\circ\)S - 33\(^\circ\) N and 35\(^\circ\) W - 120\(^\circ\) W reflecting the Neotropics (Morrone 2014), our main area of expertise. The natural distributions of all included taxa are confined to the Neotropics except for Arhynchobatidae, Diogenidae, Dipsadidae, Entomobryidae, \emph{Gaylussacia}, Iridaceae, Neanuridae, and \emph{Tillandsia}, for which we only obtained the Neotropical occurrences. We consider GBIF data generally of high quality and use them as a case study because GBIF is the largest, most widely used and taxonomically most comprehensive data source for species occurrence records; however many more exist (e.g., \url{https://bien.nceas.ucsb.edu/bien/}, www.fishbase.de or Guedes et al. 2018). GBIF provides information on the internal consistency of records, among others including information on decimal rounding of coordinates, geographic projection and date validity and geospatial issues (including the zero coordinates test used in this study). Since we specifically aimed to test the effect of user-level filtering we included records flagged with issues by GBIF (this was also the default option). Geospatial issues flagged by GBIF only concerned 0.4\% of the records used in this study and including them had the added benefit to make our results directly comparable to other databases, which may use different internal consistency checks or none at all.

\hypertarget{automated-cleaning}{%
\subsection*{Automated cleaning}\label{automated-cleaning}}
\addcontentsline{toc}{subsection}{Automated cleaning}

We followed the cleaning pipeline outlined by Zizka et al. (2019) and first filtered the data as downloaded from GBIF (``raw'', hereafter) using meta-data for those records for which they were available (although meta-data were often missing, Peterson et al. 2018), removing: 1) records with a coordinate precision below 100 km (as this represents the grain size of many macro-ecological analyses); 2) fossil records and records of unknown source; 3) records collected before 1945 (before the end of the Second World War, since coordinates of old records are often imprecise); and 4) records with an individual count of less than one and more than 99. Furthermore, we rounded the geographic coordinates to four decimal places and retained only one record per species per location (i.e., test for duplicated records). In a second step, we used the \texttt{clean\_coordinates} function of the \texttt{CoordinateCleaner\ v2.0-11} package (Zizka et al. 2019) with default options to flag errors that are common to biological data sets (``filtered'', hereafter). These include: coordinates in the sea for terrestrial taxa and on land for marine taxa, coordinates containing only zeros, coordinates assigned to country and province centroids, coordinates within urban areas, and coordinates assigned to biodiversity institutions. See Table 2 for a summary of all filters we used and their classification into ``erroneous'' and ``unfit''.

\hypertarget{downstream-analyses}{%
\subsection*{Downstream analyses}\label{downstream-analyses}}
\addcontentsline{toc}{subsection}{Downstream analyses}

We first generated species richness maps using 100x100 km grid cells for the raw and filtered datasets respectively, using the package \texttt{speciesgeocodeR\ v2.0-10} (Töpel et al. 2017). We then performed an automated conservation assessment for all study groups based on both datasets using the \texttt{ConR\ v1.2.4} package (Dauby et al. 2017). \texttt{ConR} estimates the EOO, AOO, and the number of locations, and then suggests a preliminary conservation status based on Criterion B of the global IUCN Red List. While these assessments are preliminary (see IUCN Standards and Petitions Subcommittee 2017), they can be a proxy used by the IUCN to speed up full assessments. We then benchmarked the preliminary conservation assessments against the global IUCN Red List assessments for the same taxa (where available), which we obtained from www.iucn.org via the \texttt{rredlist\ v.0.5.0} package (Chamberlain 2018).

\hypertarget{evaluation-of-results}{%
\subsection*{Evaluation of results}\label{evaluation-of-results}}
\addcontentsline{toc}{subsection}{Evaluation of results}

Each author provided an informed comment on the performance of the raw and cleaned datasets, concerning the number of removed records and the accuracy of the overall species richness maps. We then compared the agreement between automated conservation assessments based on raw and filtered occurrences with the global IUCN Red List for those taxa where IUCN assessments were available (www.iucn.org).

We carried out all analyses in the R computing environment (R Core Team 2019), using standard libraries for data handling and visualization (Wickham 2018; Garnier 2018; Ooms 2014, 2019; Hijmans 2019). All scripts are available from a zenodo repository (\url{doi:10.5281/zenodo.3695102}).

\hypertarget{results}{%
\section*{Results}\label{results}}
\addcontentsline{toc}{section}{Results}

We retrieved a total of 218,899 species occurrence records, with a median of 2,844 records per study group and 10 records per species (Table 3, Appendix 1). We obtained most records for Dipsadidae (64,249) and fewest for \emph{Thozetella} (51). The species with most records was \emph{Harengula jaguana} (19,878).

Our automated tests filtered a total of 97,004 records (Fig. \ref{fig:total}, erroneous: 9,254, unfit: 91,298), with a median of 45\% per group (erroneous: 0.3\%, unfit: 37.4\%). Overall, the most important test was for duplicated records (on average 35.5\% per taxonomic group). The filtering steps based on record meta-data that filtered the largest number of records were the basis of records (5.9\%) and the collection year (3.4\%). The most important automated tests were for urban area (8.6\%) and the occurrence from records of terrestrial taxa in the sea and marine taxa on land (4.3\%, see Table 3 and Appendix 1 in the electronic supplement for further details and the absolute numbers). Only a few records were filtered by the coordinate precision, zero coordinates and biodiversity institution tests (Fig. \ref{fig:split}).

Entomobryidae, Diogenidae, and Neanuridae had the highest fraction of filtered records (Table 3). In general, the different filters we tested were of similar importance for different study groups. There were few outstanding exceptions, including the particularly high proportions of records filtered by the ``basis of record test'' for \emph{Tityus} (7.0\%), Dipsadidae (5.6\%), \emph{Prosthechea} (5.0\%) and \emph{Tillandsia} (4.9\%), by the collection year for Dipsadidae (11.3\%), by the taxonomic identification level for \emph{Tityus} (1.6\%), by the capital coordinates for \emph{Oocephalus} (6.1\%) and \emph{Gaylussacia} (3.2\%), by the seas/land test for Diogenidae and \emph{Thozetella}, and by the urban areas test for \emph{Oocephalus} (13.3\%) and Iridaceae (12.3\%). Furthermore, Entomobryidae differed considerably from all other study taxa with exceptionally high numbers of records filtered by the ``basis of record'', ``level of identification'' and ``urban areas'' tests.

Geographically, the records filtered by the ``basis of record'' and ``individual count'' tests were concentrated in Central America and southern North America, and a relatively high number of records were filtered due to their proximity to the centroids of political entities were located on Caribbean islands (Fig. \ref{fig:split}). See Appendix 2 for species richness maps using the raw and cleaned data for all study groups.

We found IUCN assessments for 579 species that were also included in our distribution data from 11 of our study groups (Table 4, Appendix 3). The fraction of species evaluated varied among the study group, with a maximum of 100\% for \emph{Harengula} and \emph{Lepismium} and a minimum of 2.3\% for Iridaceae (note that the number of total species varied considerably among groups). The median percentage of species per study group with an IUCN assessment was 15\%. A total of 102 species were listed as \emph{Threatened} by the IUCN global Red List (CR = 19, EN = 40, VU = 43) and 477 as \emph{Not Threatened}.

We obtained automated conservation assessments for 2,181 species in the filtered dataset. Based on the filtered data, the automated conservation assessment evaluated 1,382 species as possibly threatened (63.4\%, CR = 495, EN = 577, VU = 310, see Appendix 3 for assessments of all species). The automated assessment based on the filtered dataset agreed with the IUCN assessment for identifying species as possibly threatened (CR, EN, VU) for 358 species (64\%; Table 4). Filtering reduced the EOO by 18.4\% and the AOO by 9.9\% on median per group. For the raw dataset the agreement with IUCN was higher at 381 species (65.7\%).

\hypertarget{discussion}{%
\section*{Discussion}\label{discussion}}
\addcontentsline{toc}{section}{Discussion}

Automated flagging based on meta-data and automatic tests filtered on average 45\% of the records per taxonomic group; 25.9\%-90.3\% as ``unfit'' and 0\%-44.3\% as ``erroneous''. The filters for basis of record, duplicates, collection year, and urban areas flagged the highest fraction of records (\textbf{Question 1}). The importance of different tests was similar across taxonomic groups, with particular exceptions for the tests on basis of record, collection year, capital coordinates, and urban areas (\textbf{Question 2}). The results for species richness were similar between the raw and filtered data with some improvements using the filters. We found little impact of filtering on the accuracy of the automated conservation assessments (\textbf{Question 3}).

\hypertarget{the-relevance-of-individual-filters}{%
\subsection*{The relevance of individual filters}\label{the-relevance-of-individual-filters}}
\addcontentsline{toc}{subsection}{The relevance of individual filters}

The aim of automated filtering is to identify possibly problematic records that are unsuitable for particular downstream analyses. While those records filtered as ``erroneous'' will likely cause problems for most biodiversity research, those filtered as ``unfit'' might have varying impact, depending on the type and spatial resolution of the downstream analyses. Unwanted effects include an unnecessary computational burden, which can be a bottleneck for large-scale analyses (i.e.~duplicates, Antonelli et al. 2018), and increased uncertainty (due to low precision), or completely compromising results. For instance, records assigned to country centroids might be acceptable for inter-continental comparisons, but are likely to be erroneous for species distribution modelling on a local scale. The importance of each test and the linked thresholds must be judged based on the specific downstream analyses. As our results show, it may be advisable to adapt automated tests to the geographic study area or the taxonomic study group. For instance, the high number of records flagged for centroids on the Lesser Antilles (Fig. \ref{fig:split}) might be overly strict (\url{https://data-blog.gbif.org/post/country-centroids/}), although we chose a conservative distance for the Political centroid test (1 km).

Several factors may explain the high proportion of records flagged as duplicates. First, the deposition of duplicates from the same specimen at different institutions is common practice, especially for plants, where a specimen duplication is entirely feasible. Second, independent collections at similar localities may occur, in particular for local endemics. Third, low coordinate precision, for instance based on automated geo-referencing from locality descriptions, may lump records from nearby localities. Fourth, different data contributors might add the same record to GBIF, if their sources overlap, as can for instance be the case for the Barcode of Life and Plazi databases.

\hypertarget{similarities-and-differences-among-taxa}{%
\subsection*{Similarities and differences among taxa}\label{similarities-and-differences-among-taxa}}
\addcontentsline{toc}{subsection}{Similarities and differences among taxa}

The number of records flagged by individual tests was similar across study groups, suggesting that similar problems might be relevant for collections of plants and animals. Therefore, the same filters can be used across taxonomic groups. Some notable exceptions stress the need to adapt each filter to the taxonomic study group to balance data quality and data availability. The high fraction of records filtered by the ``basis of record'' filter for \emph{Tityus}, Dipsadidae, \emph{Prosthechea} and \emph{Tillandsia}, were mostly caused by a high number of records in these groups based on unknown collection methods, which might be caused by the contribution of specific datasets lacking this information for these groups. The high fraction of records flagged by the ``collection year'' filter for Dispadidae was caused by a high collection effort in the late 1880s and early 1900s, as can be expected for a charismatic group of reptiles, but also by 500 records dated to the year 1700. The latter records likely represent a data entry error: they are all contributed to GBIF from the same institution, and the institution's code for unavailable collection dates is 1700-01-01 - 2014-01-01, which has likely erroneously been converted to 1700. The high number of species flagged at capital coordinates and within urban areas for the plant groups Iridaceae and \emph{Oocephalus} might be related to horticulture, since at least some species in those groups are commonly cultivated as ornamentals. This was supported by the detailed examination of the data for Iridaceae, which showed that after filtering 1605 records from 69 exotic species remained in the dataset, stressing the importance to address these species in certain taxonomic groups.

The general agreement between the species richness maps based on raw and filtered data was encouraging, in terms of the use of this data for large-scale biogeographic research (Fig. \ref{fig:speciesrichness}, Appendix 2). The filter based on political centroids had an important impact on species richness patterns, which is congruent with the results from a previous study in the coffee family (Maldonado et al. 2015). Records assigned to country or province centroids are often old records, which are geo-referenced at a later point based on vague locality descriptions. These records are at the same time more likely to represent dubious species names, since they might be old synonyms or type specimens of species that have only been collected and described once, which are erroneously increasing species numbers.

Overall, we consider the effect of the automated filters as positive since they identified the above-mentioned issues and increased the data precision and reduced computational burden (Table 3, Appendix 2). However, in some cases filters failed to remove major issues, often due to incomplete meta-data. For instance, for Diogenidae we found at least two records of an species known only from Eocene fossils (\emph{Paguristes mexicanus}) which slipped the ``basis of record'' test because they were marked as ``preserved specimen'' rather than ``fossil specimen''. Furthermore, for Entomobryidae we found that for 1,996 records the meta-data on taxonomic rank was ``UNRANKED'' despite all of them being identified to species level, leading to a high fraction of records removed by the ``Identification level'' filter. Additionally automated filters might be overly strict or unsuitable for certain taxa. For instance, in Entomobryidae, 2,004 samples were marked as material samples and therefore removed by our global filter retaining only specimen and observation data, which in this case was overly strict.

The filters we included in this study address a set of important but relatively easy to identify problems. In fact, the internal quality control of GBIF does flag some of the problems we tested for (i.e., zero coordinates, equal lat/lon) while others might be implemented in the near future (country centroids, \url{https://data-blog.gbif.org/post/country-centroids/}). While this internal quality is very helpful, we see a huge potential to overcome issues with data quality in a user-feedback system that allows users to provide expert assessments, i.e.~a meta-annotation of records being challenged (and why). Such a system would not need to change the original data and could include multiple levels to account for differing opinions.

As next steps for automated filtering, tests for intrinsic consistency and support by external data (if available) can help to detect additional problematic records. For instance, testing if records' coordinates fall within the state or province of collection noted for a record (intrinsic) or agree with external species distribution information, for example from www.iucn.org (vertebrates) or \url{https://wcsp.science.kew.org/} (selected seed plant families; extrinsic) can further corroborate the accuracy of a record's geographic referencing. If such tests are included it is essential to account for the sampling year, in particular for older records, since the names of provinces may change and the ranges of species may shift. Furthermore, while in this study we focused on meta-data and geographic filtering, taxonomic cleaning---the resolution of synonymies and identification of accepted names---is another important part of data curation, but depends on taxon-specific taxonomic backbones and synonymy lists which are not readily available for many groups and often are contradictory within individual taxa.

\hypertarget{the-impact-of-filtering-on-the-accuracy-of-automated-conservation-assessments}{%
\subsection*{The impact of filtering on the accuracy of automated conservation assessments}\label{the-impact-of-filtering-on-the-accuracy-of-automated-conservation-assessments}}
\addcontentsline{toc}{subsection}{The impact of filtering on the accuracy of automated conservation assessments}

The accuracy of the automated conservation assessment was in the same range as found by previous studies (Nic Lughadha et al. 2019; Zizka et al. 2020). The similar accuracy of the raw and filtered dataset for the automated conservation assessment was surprising, in particular given the EOO and AOO reduction observed in the filtered dataset (Table 4) and the impact of errors on spatial analyses observed in previous studies (Gueta and Carmel 2016). The robustness of the automated assessment was likely due to the fact that the EOO for most species was large, even after the considerable reduction caused by filtering. This might be caused by the structure of our comparison, which only included species that were evaluated by the IUCN Red List (and not considered as \emph{Data Deficient}) and at the same time had occurrences recorded in GBIF. Those inclusion criteria are likely to have biased the datasets towards species with large ranges, since generally more data are available for them. The robustness of automated conservation assessments to data quality is encouraging, although these methods are only an approximation (and not replacements) of full IUCN Red List assessments, especially for species with few collection records (Rivers et al. 2011).

\hypertarget{conclusions}{%
\section*{Conclusions}\label{conclusions}}
\addcontentsline{toc}{section}{Conclusions}

Our results suggest that between one quarter to half of the occurrence records obtained from GBIF might be unsuitable for downstream biodiversity analyses. While the majority of these records might not be erroneous \emph{per se}, they are overly imprecise and thereby increase uncertainty of downstream results or add computational burden on big data analyses.

While our results suggest that large-scale species richness patterns and automated conservation assessments are largely resilient to the effects of problematic occurrence records, they also stress the importance of (meta-)data exploration prior to most biodiversity analyses. Automated filtering can help to identify problematic records, but also highlight the necessity to customize tests and thresholds to the specific taxonomic groups and geographic area of interest. The putative problems we encountered point to the importance to train researchers and students to curate species occurrence datasets and to visibly associate user-feedback with individual records on aggregator platforms such as GBIF so that it can contribute to the overall accuracy and precision of public biodiversity databases.

\hypertarget{acknowledgements}{%
\section*{Acknowledgements}\label{acknowledgements}}
\addcontentsline{toc}{section}{Acknowledgements}

We thank GBIF and all data collectors and contributors for their excellent work. We thank Town Peterson, Roderic Page and one anonymous reviewer for the helpful comments on an earlier version of this manuscript. This study enrolled participants of the workshop ``Biodiversity data: from field to yield'' led by Alice Calvente, Fernanda Carvalho, Alexander Zizka, and Alexandre Antonelli through the Programa de Pós Graduação em Sistemática e Evolução of the Universidade Federal do Rio Grande do Norte (UFRN) and promoted by the 6th Conference on Comparative Biology of Monocotyledons - Monocots VI. We thank the Pró-reitoria de Pesquisa and the Pró-reitoria de Pós-graduação of UFRN for financial support (edital 02/2016 - internacionalização). AZ is funded by iDiv via the German Research Foundation (DFG FZT 118), specifically through sDiv, the Synthesis Centre of iDiv. AA is supported by the Swedish Research Council, the Knut and Alice Wallenberg Foundation, the Swedish Foundation for Strategic Research and the Royal Botanic Gardens, Kew. FS was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001 and Fundação de Amparo à Pesquisa do estado de São Paulo (FAPESP) (FAPESP, process 2015/20215-7).

\hypertarget{supplementary-material}{%
\section*{Supplementary material}\label{supplementary-material}}
\addcontentsline{toc}{section}{Supplementary material}

\begin{itemize}
\item
  Appendix 1 - Absolute number of flagged records per taxonomic group and test
\item
  Appendix 2 - Taxon-specific richness maps and comments
\item
  Appendix 3 - Full results of the conservation assessment
\end{itemize}

\newpage{}

\hypertarget{tables}{%
\section*{Tables}\label{tables}}
\addcontentsline{toc}{section}{Tables}

\newpage{}

\begin{table}

\caption{\label{tab:tabletaxa}The study groups and their taxonomy. This study includes three marine and 15 terrestrial taxa, seven of them animals, one group of fungi and ten plants, belonging to 16 different orders. The outlines illustrate the broad taxonomic group. Pictures from www.phylopic.org if available under a Public Domain Dedication license, otherwise generated from photographs available in the public domain.}
\centering
\fontsize{11}{13}\selectfont
\begin{tabular}[t]{>{\raggedright\arraybackslash}p{1cm}>{\raggedright\arraybackslash}p{2.5cm}>{\raggedright\arraybackslash}p{1.5cm}>{\raggedright\arraybackslash}p{2cm}>{\raggedright\arraybackslash}p{2cm}>{\raggedright\arraybackslash}p{2cm}>{\raggedright\arraybackslash}p{3cm}}
\toprule
 & Taxon & Taxon rank & Realm & Common name & 'Phylum' & Family\\
\midrule
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/diogenidae.png} & Diogenidae & Family & Marine & Hermit crabs & Arthropoda & Diogenidae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/entomobryidae.png} & Entomobryidae & Family & Terrestrial & Springtails & Arthropoda & Entomobryidae\\
\includegraphics[valign=T, scale=0.2, raise= 2mm]{input/phylopics/converted/neanuridae.png} & Neanuridae & Family & Terrestrial & Springtails & Arthropoda & Neanuridae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/tityus.png} & \em{Tityus} & Genus & Terrestrial & Scorpions & Arthropoda & Buthidae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/arhynchobatidae.png} & Arhynchobatidae & Family & Marine & Skates & Chordata & Arhynchobatidae\\
\addlinespace
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/dipsadidae.png} & Dipsadidae & Family & Terrestrial & Snakes & Chordata & Dipsadidae\\
\includegraphics[valign=T, scale=0.18, raise= 2mm]{input/phylopics/converted/harengula.png} & \em{Harengula} & Genus & Marine & Herrings & Chordata & Clupeidae\\
\hline
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/thozetella.png} & \em{Thozetella} & Genus & Terrestrial & Sac fungi & Ascomycota & Chaetosphaeriaceae\\
\hline
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/conchocarpus.png} & \em{Conchocarpus} & Genus & Terrestrial & NA & Angiosperms & Rutaceae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/gaylussacia.png} & \em{Gaylussacia} & Genus & Terrestrial & Huckleberries & Angiosperms & Ericaceae\\
\addlinespace
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/harpalyce.png} & \em{Harpalyce} & Genus & Terrestrial & NA & Angiosperms & Fabaceae\\
\includegraphics[valign=T, scale=0.075, raise= 2mm]{input/phylopics/converted/iridaceae.png} & Iridaceae & Family & Terrestrial & NA & Angiosperms & Iridaceae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/lepismium.png} & \em{Lepismium} & Genus & Terrestrial & Cacti & Angiosperms & Cactaceae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/oocephalus.png} & \em{Oocephalus} & Genus & Terrestrial & NA & Angiosperms & Lamiaceae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/pilosocereus.png} & \em{Pilosocereus} & Genus & Terrestrial & Cacti & Angiosperms & Cactaceae\\
\addlinespace
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/prosthechea.png} & \em{Prosthechea} & Genus & Terrestrial & Orchids & Angiosperms & Orchidaceae\\
\includegraphics[valign=T, scale=0.25, raise= 2mm]{input/phylopics/converted/tillandsia.png} & \em{Tillandsia} & Genus & Terrestrial & Bromeliads & Angiosperms & Bromeliaceae\\
\includegraphics[valign=T, scale=0.2, raise= 2mm]{input/phylopics/converted/tocoyena.png} & \em{Tocoyena} & Genus & Terrestrial & NA & Angiosperms & Rubiaceae\\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[!h]

\caption{\label{tab:tabletests}The automated filters used in this study.}
\centering
\fontsize{9}{11}\selectfont
\begin{tabular}[t]{>{\raggedright\arraybackslash}p{2cm}>{\raggedright\arraybackslash}p{1cm}>{\raggedright\arraybackslash}p{2.5cm}>{\raggedright\arraybackslash}p{9cm}}
\toprule
Test & Type & Basis & Rationale\\
\midrule
\rowcolor{gray!6}  Biodiversity institutions & Error & Gazetteer-based & Records may have coordinates at the location of biodiversity institutions, e.g. because they were erroneously entered with the physical location of the specimen or because they represent individuals from captivity or horticulture, which are not clearly labeled as such\\
Equal lat/lon & Error & Gazetteer-based & Coordinates with equal latitude and longitude are usually indicative of data entry errors\\
\rowcolor{gray!6}  Sea & Error & Gazetteer-based & Coordinates from terrestrial organisms in the sea are usually indicative of data entry errors, e.g. swapped latitude and longitude\\
Zeros & Error & Gazetteer-based & Coordinates with plain zeros are often indicative of data entry errors\\
\rowcolor{gray!6}  Capitals & Unfit & Gazetteer-based & Records may be assigned to the coordinates of country capitals based on a vague locality description\\
\addlinespace
Duplicates & Unfit & Gazetteer-based & Duplicated records may add unnecessary computational burden, in particular for large scale biodiversity analyses and distribution modelling for many species\\
\rowcolor{gray!6}  Political centroids & Unfit & Gazetteer-based & Records may be assigned to the coordinates of the centroids of political entities based on a vague locality description\\
Urban areas & Unfit & Gazetteer-based & Records from urban areas are not necessarily errors, but often represent imprecise records automatically geo-referenced from vague locality descriptions or old records from different land-use types\\
\rowcolor{gray!6}  Basis of record & Unfit & Meta-data & Records might be unsuitable or unreliable for certain analyses dependent on their source, e.g. 'fossil' or 'unknown'\\
Collection year & Unfit & Meta-data & Coordinates from old records are more likely to be imprecise or erroneous coordinates since they are derived from  geo-referencing based on the locality description. This is more problematic for older records, since names or borders of places may change\\
\addlinespace
\rowcolor{gray!6}  Coordinate precision & Unfit & Meta-data & Records may be unsuitable for a study if their precision is lower than the study analysis scale\\
Identification level & Unfit & Meta-data & Records may be unsuitable if they are not identified to species level.\\
\rowcolor{gray!6}  Individual count & Unfit & Meta-data & Records may be unsuitable if the number of recorded individuals is 0 (record of absence) or if the count is too high, as this is often related to records from barcoding or indicative of data entry problems.\\
\bottomrule
\end{tabular}
\end{table}

\begin{landscape}\begin{table}

\caption{\label{tab:tablecoords}The impact of automated filtering on occurrence records for 18 Neotropical taxa downloaded from www.gbif.org. From column four onwards the numbers show the percentage of records flagged by the respective test. Only tests that flagged at least 0.1\% of the records in any group are shown. Individual records can be flagged by multiple tests, therefore the sum of percentages from all tests can supersede the total percentage.}
\centering
\fontsize{8}{10}\selectfont
\begin{tabular}[t]{>{\raggedright\arraybackslash}p{1.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{1.1cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{1.1cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}>{\raggedleft\arraybackslash}p{0.9cm}}
\toprule
\multicolumn{1}{c}{ } & \multicolumn{4}{c}{Summary} & \multicolumn{3}{c}{Errors} & \multicolumn{9}{c}{Unfit} \\
\cmidrule(l{3pt}r{3pt}){2-5} \cmidrule(l{3pt}r{3pt}){6-8} \cmidrule(l{3pt}r{3pt}){9-17}
Taxon & Total records & Fraction flagged [\%] & Fraction error [\%] & Fraction unfit [\%] & Biodiversity Institutions [\%] & Sea/land area [\%] & Zeros [\%] & Capitals [\%] & Duplicates [\%] & Political centroids [\%] & Urban areas [\%] & Basis of record [\%] & Collection year [\%] & Coordinate precision [\%] & Id-level [\%] & Individual count [\%]\\
\midrule
Diogenidae & 13,840 & 68.7 & 44.3 & 38.2 & 0.0 & 44.3 & 0.0 & 0.7 & 33.8 & 0.2 & 1.3 & 1.7 & 2.5 & 0.0 & 0.0 & 0.0\\
Entomobryidae & 2,767 & 90.3 & 0.1 & 90.3 & 0.1 & 0.0 & 0.0 & 0.1 & 85.5 & 0.0 & 70.1 & 72.9 & 2.0 & 0.0 & 72.1 & 0.0\\
Neanuridae & 689 & 66.9 & 0.0 & 66.9 & 0.0 & 0.0 & 0.0 & 0.0 & 62.4 & 0.0 & 2.0 & 2.9 & 1.3 & 0.0 & 0.0 & 0.0\\
\em{Tityus} & 1,018 & 55.2 & 0.5 & 54.9 & 0.5 & 0.0 & 0.0 & 1.2 & 43.5 & 0.1 & 6.9 & 7.0 & 0.4 & 1.8 & 1.6 & 0.0\\
Arhynchobatidae & 14,633 & 38.5 & 3.8 & 37.4 & 0.0 & 3.8 & 0.0 & 0.0 & 35.4 & 0.0 & 1.9 & 1.7 & 1.3 & 0.0 & 0.9 & 0.0\\
\addlinespace
Dipsadidae & 64,249 & 57.7 & 0.3 & 57.6 & 0.3 & 0.0 & 0.0 & 1.8 & 46.3 & 0.4 & 8.5 & 5.6 & 11.3 & 0.8 & 0.0 & 0.1\\
\em{Harengula} & 36,697 & 31.0 & 5.5 & 27.8 & 0.0 & 5.5 & 0.0 & 0.2 & 27.0 & 0.1 & 0.2 & 1.0 & 0.4 & 0.0 & 0.3 & 0.0\\
\hline
\em{Thozetella} & 51 & 35.3 & 23.5 & 29.4 & 0.0 & 23.5 & 0.0 & 0.0 & 27.5 & 0.0 & 2.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0\\
\hline
\em{Conchocarpus} & 1,551 & 43.2 & 0.5 & 42.9 & 0.1 & 0.4 & 0.0 & 0.0 & 39.6 & 0.9 & 2.3 & 0.5 & 1.9 & 0.1 & 0.0 & 0.0\\
\em{Gaylussacia} & 3,998 & 47.2 & 0.1 & 47.1 & 0.1 & 0.1 & 0.0 & 3.2 & 41.8 & 1.1 & 5.2 & 0.7 & 4.4 & 0.6 & 0.0 & 0.0\\
\addlinespace
\em{Harpalyce} & 870 & 33.1 & 0.0 & 33.1 & 0.0 & 0.0 & 0.0 & 1.0 & 26.0 & 1.3 & 3.8 & 0.5 & 5.5 & 0.7 & 0.0 & 0.9\\
Iridaceae & 23,127 & 33.6 & 0.5 & 33.5 & 0.4 & 0.1 & 0.0 & 1.0 & 17.1 & 0.4 & 12.3 & 0.9 & 4.7 & 0.1 & 0.0 & 1.3\\
\em{Lepismium} & 825 & 29.7 & 0.0 & 29.7 & 0.0 & 0.0 & 0.0 & 0.1 & 21.9 & 0.1 & 7.8 & 0.0 & 2.1 & 0.0 & 0.0 & 0.0\\
\em{Oocephalus} & 883 & 49.3 & 0.0 & 49.3 & 0.0 & 0.0 & 0.0 & 6.1 & 41.9 & 0.8 & 13.3 & 0.0 & 0.7 & 0.3 & 0.0 & 0.1\\
\em{Pilosocereus} & 1,940 & 25.9 & 0.2 & 25.9 & 0.2 & 0.0 & 0.0 & 0.5 & 16.8 & 0.5 & 2.1 & 1.8 & 7.0 & 0.0 & 0.0 & 0.9\\
\addlinespace
\em{Prosthechea} & 6,617 & 31.5 & 0.1 & 31.5 & 0.0 & 0.0 & 0.1 & 0.4 & 19.6 & 1.7 & 0.9 & 5.0 & 8.3 & 0.1 & 0.0 & 0.2\\
\em{Tillandsia} & 42,222 & 35.3 & 0.4 & 35.2 & 0.3 & 0.0 & 0.0 & 0.7 & 19.8 & 0.7 & 9.2 & 4.9 & 5.1 & 0.1 & 0.0 & 1.0\\
\em{Tocoyena} & 2,922 & 37.6 & 0.3 & 37.4 & 0.0 & 0.2 & 0.0 & 0.8 & 32.3 & 0.8 & 5.0 & 0.1 & 1.9 & 0.2 & 0.0 & 0.5\\
\hline
Total & 218,899 & 44.3 & 4.2 & 41.7 & 0.2 & 4.0 & 0.0 & 1.0 & 32.3 & 0.4 & 7.1 & 4.2 & 5.6 & 0.3 & 1.0 & 0.4\\
\bottomrule
\end{tabular}
\end{table}
\end{landscape}

\begin{landscape}\begin{table}

\caption{\label{tab:unnamed-chunk-3}Conservation assessment for 11 Neotropical taxa of plants and animals based on three datasets. IUCN: global red list assessment obtained from www.iucn.org; GBIF Raw: preliminary conservation assessment based on IUCN Criterion B using ConR and the raw dataset from GBIF; GBIF filtered: preliminary conservation assessment based on IUCN Criterion B using ConR and the filtered dataset. Only taxa with at least one species evaluated by IUCN shown.}
\centering
\fontsize{9}{11}\selectfont
\begin{tabular}[t]{l>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.2cm}>{\raggedleft\arraybackslash}p{1.5cm}>{\raggedleft\arraybackslash}p{1.5cm}}
\toprule
\multicolumn{1}{c}{ } & \multicolumn{3}{c}{IUCN} & \multicolumn{3}{c}{GBIF Raw} & \multicolumn{5}{c}{GBIF Filtered} \\
\cmidrule(l{3pt}r{3pt}){2-4} \cmidrule(l{3pt}r{3pt}){5-7} \cmidrule(l{3pt}r{3pt}){8-12}
Taxon & n taxa & Evaluated [\%] & Threatened [\%] & n taxa & Threatened [\%] & Match with IUCN [\%] & n taxa & Threatened [\%] & Match with IUCN [\%] & EOO change compared to raw [\%] & AOO change compared to raw [\%]\\
\midrule
Arhynchobatidae & 37 & 51.3 & 17.9 & 39 & 35.9 & 45.0 & 39 & 41.0 & 40.0 & -32.7 & -18.5\\
Dipsadidae & 520 & 68.0 & 8.8 & 638 & 58.3 & 63.0 & 598 & 59.9 & 61.2 & -2.3 & -15.6\\
\em{Harengula} & 4 & 100.0 & 0.0 & 4 & 0.0 & 100.0 & 4 & 0.0 & 100.0 & -38.0 & -36.9\\
\hline
\em{Conchocarpus} & 4 & 8.7 & 0.0 & 46 & 63.0 & 100.0 & 45 & 62.2 & 100.0 & -15.3 & -7.1\\
\em{Gaylussacia} & 2 & 3.3 & 0.0 & 61 & 59.0 & 50.0 & 58 & 60.3 & 50.0 & -22.5 & -8.6\\
\addlinespace
\em{Harpalyce} & 3 & 15.0 & 5.0 & 20 & 65.0 & 66.7 & 17 & 58.8 & 50.0 & -18.4 & -16.5\\
Iridaceae & 13 & 2.3 & 0.2 & 531 & 64.4 & 50.0 & 466 & 62.9 & 62.5 & -18.2 & -12.3\\
\em{Lepismium} & 6 & 100.0 & 0.0 & 6 & 16.7 & 83.3 & 6 & 16.7 & 83.3 & -33.9 & -7.9\\
\em{Pilosocereus} & 41 & 80.9 & 19.1 & 47 & 55.3 & 73.7 & 46 & 56.5 & 71.1 & -8.5 & -5.8\\
\em{Tillandsia} & 54 & 11.6 & 6.0 & 464 & 61.4 & 85.2 & 453 & 62.7 & 83.3 & -13.7 & -9.9\\
\addlinespace
\em{Tocoyena} & 3 & 13.6 & 4.5 & 22 & 31.8 & 66.7 & 21 & 38.1 & 66.7 & -23.0 & -9.5\\
\bottomrule
\end{tabular}
\end{table}
\end{landscape}

\hypertarget{figures}{%
\section*{Figures}\label{figures}}
\addcontentsline{toc}{section}{Figures}

\begin{figure}
\includegraphics[width=0.9\linewidth]{./figures_tables/Fig1_study_groups} \caption{Examples of taxa included in this study. \textbf{A)} \textit{Pilosocereus pusillibaccatus} (\textit{Pilosocereus}), \textbf{B)} \textit{Conchocarpus macrocarpus} (\textit{Conchocarpus}); \textbf{C)} \textit{Tillandsia recurva} (\textit{Tillandsia}); \textbf{D)} \textit{Oxyrhopus guibei} (Dipsadidae); \textbf{E)} \textit{Aethiopella ricardoi} (Neanuridae); \textbf{F)} \textit{Tocoyena formosa} (\textit{Tocoyena}); \textbf{G)} \textit{Harengula jaguana} (\textit{Harengula}); \textbf{H)} \textit{Gaylussacia decipiens} (\textit{Gaylussacia}); \textbf{I)} \textit{Oocephalus foliosus} (\textit{Oocephalus}); \textbf{J)} \textit{Tityus carvalhoi} (\textit{Tityus}); \textbf{K)} \textit{Prosthechea vespa} (\textit{Prosthechea}), Image credits: A) Pamela Lavor, B) Juliana El-Ottra, C) Eduardo Calisto Tomaz, D) Filipe C Serrano, E) Raiane Vital da Paz, F) Fernanda GL Moreira, G) Thais Ferreira-Araujo, H) Luiz Menini Neto, I) Arthur de Souza Soares, J) Renata C Santos-Costa, K) Tiago Vieira.}\label{fig:species}
\end{figure}

\begin{figure}
\includegraphics[width=\linewidth]{./figures_tables/Fig2_number_of_records} \caption{The absolute number of records flagged as erroneous or unfit by automated geographic filters in dataset of 18 Neotropical taxa including animals, fungi, and plants, plotted in a 100 x 100 km grid across the Neotropics (Behrmann projection).}\label{fig:total}
\end{figure}

\begin{figure}
\includegraphics[width=\linewidth]{./figures_tables/Fig3_number_of_records_split} \caption{Geographic location of the occurrence records flagged by the automated tests applied in this study. Only filters that flagged at least 0.1\% of records in any taxon are shown.}\label{fig:split}
\end{figure}

\begin{figure}
\includegraphics[width=\linewidth]{./figures_tables/Fig4_number_species_richness_difference} \caption{Illustrative examples for the difference in species richness between the raw and filtered dataset (raw - filtered) from four of the study taxa. Total species number in the raw data sets: Dipsadidae: 637, \textit{Harengula}: 4, \textit{Thozetella}: 9, \textit{Tillandsia} 464.  Photo credits for C) by Tiago Andrade Borges Santos, otherwise as in Figure 1.}\label{fig:speciesrichness}
\end{figure}

\newpage{}

\hypertarget{references}{%
\section*{References}\label{references}}
\addcontentsline{toc}{section}{References}

\hypertarget{refs}{}
\leavevmode\hypertarget{ref-Anderson2016}{}%
Anderson, Robert P, Miguel Araújo, Antoine Guisan, Jorge M Lobo, Enrique Martínez-Meyer, Townsend Peterson, and Jorge Soberón. 2016. ``Final Report of the Task Group on GBIF Data Fitness for Use in Distribution Modelling - Are species occurrence data in global online repositories fit for modeling species distributions? The case of the Global Biodiversity Information Facility (GBIF).'' Copenhagen, Denmark: GBIF.

\leavevmode\hypertarget{ref-Antonelli2018}{}%
Antonelli, Alexandre, Alexander Zizka, Fernanda Antunes Carvalho, Ruud Scharn, Christine D. Bacon, Daniele Silvestro, and Fabien L Condamine. 2018. ``Amazonia is the primary source of Neotropical biodiversity.'' \emph{Proceedings of the National Academy of Sciences} 115 (23): 6034--9. \url{https://doi.org/10.1073/pnas.1713819115}.

\leavevmode\hypertarget{ref-Bachman2011}{}%
Bachman, Steven P., Justin Moat, Andrew Hill, Javier de la Torre, and Ben Scott. 2011. ``Supporting Red List threat assessments with GeoCAT: geospatial conservation assessment tool.'' \emph{ZooKeys} 150 (November): 117--26. \url{https://doi.org/10.3897/zookeys.150.2109}.

\leavevmode\hypertarget{ref-Chamberlain2016}{}%
Chamberlain, Scott. 2016. ``scrubr: Clean Biological Occurrence Records.'' \url{https://cran.r-project.org/package=scrubr}.

\leavevmode\hypertarget{ref-Chamberlain2018}{}%
---------. 2018. \emph{rredlist: 'IUCN' Red List Client}. \url{https://cran.r-project.org/package=rredlist}.

\leavevmode\hypertarget{ref-Chamberlain2017}{}%
Chamberlain, Scott A. 2017. ``rgbif: Interface to the Global Biodiversity Information Facility API. R package version 0.9.9.'' \url{https://github.com/ropensci/rgbif}.

\leavevmode\hypertarget{ref-Cosiaux2018}{}%
Cosiaux, Ariane, Lauren M. Gardiner, Fred W. Stauffer, Steven P. Bachman, Bonaventure Sonké, William J. Baker, and Thomas L. P. Couvreur. 2018. ``Low extinction risk for an important plant resource: Conservation assessments of continental African palms (Arecaceae/Palmae).'' \emph{Biological Conservation} 221 (May): 323--33. \url{https://doi.org/10.1016/j.biocon.2018.02.025}.

\leavevmode\hypertarget{ref-Dauby2017}{}%
Dauby, Gilles, Tariq Stévart, Vincent Droissart, Ariane Cosiaux, Vincent Deblauwe, Murielle Simo-Droissart, Marc S. M. Sosef, et al. 2017. ``ConR: An R package to assist large-scale multispecies preliminary conservation assessments using distribution data.'' \emph{Ecology and Evolution} 7 (24): 11292--11303. \url{https://doi.org/10.1002/ece3.3704}.

\leavevmode\hypertarget{ref-Garnier2018}{}%
Garnier, Simon. 2018. \emph{viridis: Default color maps from 'matplotlib'}. \url{https://cran.r-project.org/package=viridis}.

\leavevmode\hypertarget{ref-GBIForg2019c}{}%
GBIF.org. 2019a. ``Arhynchobatidae (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.uutyb6}.

\leavevmode\hypertarget{ref-GBIForg2019f}{}%
---------. 2019b. ``Conchocarpus (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.zjjpmh}.

\leavevmode\hypertarget{ref-GBIForg2019}{}%
---------. 2019c. ``Diogenidae (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.sojrfp}.

\leavevmode\hypertarget{ref-GBIForg2019d}{}%
---------. 2019d. ``Dipsadidae (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.8hnzfo}.

\leavevmode\hypertarget{ref-GBIForg2019g}{}%
---------. 2019e. ``Gaylussacia (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.4srw8a}.

\leavevmode\hypertarget{ref-GBIForg2019e}{}%
---------. 2019f. ``Harengula (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.zznjbv}.

\leavevmode\hypertarget{ref-GBIForg2019j}{}%
---------. 2019g. ``Iridaceae (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/doi.org/10.15468/dl.nmzgi9}.

\leavevmode\hypertarget{ref-GBIForg2019i}{}%
---------. 2019h. ``Lepismium (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.762543}.

\leavevmode\hypertarget{ref-GBIForg2019a}{}%
---------. 2019i. ``Neanuridae (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.bx0jjw}.

\leavevmode\hypertarget{ref-GBIForg2019n}{}%
---------. 2019j. ``Oocephalus (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.wkwque}.

\leavevmode\hypertarget{ref-GBIForg2019k}{}%
---------. 2019k. ``Pilosocereus (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.scmkx5}.

\leavevmode\hypertarget{ref-GBIForg2019m}{}%
---------. 2019l. ``Prosthechea (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.6bzfz4}.

\leavevmode\hypertarget{ref-GBIForg2019h}{}%
---------. 2019m. ``Thozetella (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.rpkjsh}.

\leavevmode\hypertarget{ref-GBIForg2019o}{}%
---------. 2019n. ``Tillandsia (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.zj2cyj}.

\leavevmode\hypertarget{ref-GBIForg2019b}{}%
---------. 2019o. ``Tityus (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.zv6kuq}.

\leavevmode\hypertarget{ref-GBIForg2019l}{}%
---------. 2019p. ``Tocoyena (29 December 2019) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.d34gos}.

\leavevmode\hypertarget{ref-GBIForg2020}{}%
---------. 2020a. ``Diogenidae (25 February 2020) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.qazjh4}.

\leavevmode\hypertarget{ref-GBIForg2020a}{}%
---------. 2020b. ``Entomobryidae (25 February 2020) GBIF Occurrence Download.'' \url{https://doi.org/10.15468/dl.ixq7wh}.

\leavevmode\hypertarget{ref-Guedes2018}{}%
Guedes, Thaís B., Ricardo J. Sawaya, Alexander Zizka, Shawn Laffan, Søren Faurby, R. Alexander Pyron, Renato S. Bérnils, et al. 2018. ``Patterns, biases and prospects in the distribution and diversity of Neotropical snakes.'' \emph{Global Ecology and Biogeography} 27 (1): 14--21. \url{https://doi.org/10.1111/geb.12679}.

\leavevmode\hypertarget{ref-Gueta2016}{}%
Gueta, Tomer, and Yohay Carmel. 2016. ``Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models.'' \emph{Ecological Informatics} 34: 139--45. \url{https://doi.org/10.1016/j.ecoinf.2016.06.001}.

\leavevmode\hypertarget{ref-Hijmans2019}{}%
Hijmans, Robert J. 2019. ``raster: Geographic data analysis and modeling.'' \url{https://cran.r-project.org/package=raster}.

\leavevmode\hypertarget{ref-IUCN2017}{}%
IUCN Standards and Petitions Subcommittee. 2017. ``Guidelines for Using the IUCN Red List - Categories and Criteria. Version 13. Prepared by the Standards and Petitions Subcommittee. Downloadable from http://www.iucnredlist.org/documents/RedListGuidelines.pdf.''

\leavevmode\hypertarget{ref-Jin2020}{}%
Jin, Jing, and Jun Yang. 2020. ``BDcleaner: A workflow for cleaning taxonomic and geographic errors in occurrence data archived in biodiversity databases.'' \emph{Global Ecology and Conservation} 21 (March): e00852. \url{https://doi.org/10.1016/j.gecco.2019.e00852}.

\leavevmode\hypertarget{ref-Maldonado2015}{}%
Maldonado, Carla, Carlos I. Molina, Alexander Zizka, Claes Persson, Charlotte M. Taylor, Joaquina Albán, Eder Chilquillo, Nina Rønsted, and Alexandre Antonelli. 2015. ``Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?'' \emph{Global Ecology and Biogeography} 24 (8): 973--84. \url{https://doi.org/10.1111/geb.12326}.

\leavevmode\hypertarget{ref-Morrone2014}{}%
Morrone, Juan J. 2014. ``Biogeographical regionalisation of the Neotropical region.'' \emph{Zootaxa} 3782 (1): 1. \url{https://doi.org/10.11646/zootaxa.3782.1.1}.

\leavevmode\hypertarget{ref-NicLughadha2019}{}%
Nic Lughadha, Eimear, Barnaby E. Walker, Cátia Canteiro, Helen Chadburn, Aaron P. Davis, Serene Hargreaves, Eve J. Lucas, et al. 2019. ``The use and misuse of herbarium specimens in evaluating plant extinction risks.'' \emph{Philosophical Transactions of the Royal Society B: Biological Sciences} 374 (1763): 20170402. \url{https://doi.org/10.1098/rstb.2017.0402}.

\leavevmode\hypertarget{ref-Ooms2014}{}%
Ooms, Jeroen. 2014. ``The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.'' \emph{arXiv}. \url{https://arxiv.org/abs/1403.2805}.

\leavevmode\hypertarget{ref-Ooms2019}{}%
---------. 2019. \emph{writexl: Export Data Frames to Excel 'xlsx' Format}. \url{https://cran.r-project.org/package=writexl}.

\leavevmode\hypertarget{ref-Pelletier2018}{}%
Pelletier, Tara A., Bryan C. Carstens, David C. Tank, Jack Sullivan, and Anahí Espíndola. 2018. ``Predicting plant conservation priorities on a global scale.'' \emph{Proceedings of the National Academy of Sciences} 115 (51): 13027--32. \url{https://doi.org/10.1073/pnas.1804098115}.

\leavevmode\hypertarget{ref-Peterson2018}{}%
Peterson, A. Townsend, Alex Asase, Dora Canhos, Sidnei de Souza, and John Wieczorek. 2018. ``Data Leakage and Loss in Biodiversity Informatics.'' \emph{Biodiversity Data Journal} 6 (November): e26826. \url{https://doi.org/10.3897/BDJ.6.e26826}.

\leavevmode\hypertarget{ref-rcoreteam2019}{}%
R Core Team. 2019. ``R: A language and environment for statistical computing.'' Austria, Vienna: R Foundation for Statistical Computing. \url{https://www.r-project.org/}.

\leavevmode\hypertarget{ref-Rivers2011}{}%
Rivers, Malin C., Lin Taylor, Neil A. Brummitt, Thomas R. Meagher, David L. Roberts, and Eimear Nic Lughadha. 2011. ``How many herbarium specimens are needed to detect threatened species?'' \emph{Biological Conservation} 144 (10): 2541--7. \url{https://doi.org/10.1016/j.biocon.2011.07.014}.

\leavevmode\hypertarget{ref-Robertson2016}{}%
Robertson, Mark P, Vernon Visser, and Cang Hui. 2016. ``Biogeo: an R package for assessing and improving data quality of occurrence record datasets.'' \emph{Ecography} 39: 394--401. \url{https://doi.org/10.1111/ecog.02118}.

\leavevmode\hypertarget{ref-Schmidt2017}{}%
Schmidt, Marco, Alexander Zizka, Salifou Traoré, Mandingo Ataholo, Cyrille Chatelain, Philippe Daget, Stefan Dressler, et al. 2017. ``Diversity, distribution and preliminary conservation status of the flora of Burkina Faso.'' \emph{Phytotaxa Monographs} 304 (1): 1--215.

\leavevmode\hypertarget{ref-Stevart2019}{}%
Stévart, T., G. Dauby, P. P. Lowry, A. Blach-Overgaard, V. Droissart, D. J. Harris, B. A. Mackinder, et al. 2019. ``A third of the tropical African flora is potentially threatened with extinction.'' \emph{Science Advances} 5 (11): eaax9444. \url{https://doi.org/10.1126/sciadv.aax9444}.

\leavevmode\hypertarget{ref-Topel2017}{}%
Töpel, Mats, Alexander Zizka, M. F. Maria Fernanda Calió, Ruud Scharn, Daniele Silvestro, and Alexandre Antonelli. 2017. ``SpeciesGeoCoder: Fast categorization of species occurrences for analyses of biodiversity, biogeography, ecology, and evolution.'' \emph{Systematic Biology} 66 (2): 145--51. \url{https://doi.org/10.1093/sysbio/syw064}.

\leavevmode\hypertarget{ref-Wickham2018}{}%
Wickham, Hadley. 2018. ``tidyverse: Easily install and load the 'Tidyverse'.'' \url{https://cran.r-project.org/package=tidyverse}.

\leavevmode\hypertarget{ref-Yesson2007}{}%
Yesson, Chris, Peter W Brewer, Tim Sutton, Neil Caithness, Jaspreet S Pahwa, Mikhaila Burgess, W Alec Gray, et al. 2007. ``How Global Is the Global Biodiversity Information Facility?'' Edited by James Beach. \emph{PLoS ONE} 2 (11): e1124. \url{https://doi.org/10.1371/journal.pone.0001124}.

\leavevmode\hypertarget{ref-Zizka2020b}{}%
Zizka, Alexander, Josue Azevedo, Elton Leme, Beatriz Neves, Andrea Ferreira, Daniel Caceres, and Georg Zizka. 2020. ``Biogeography and conservation status of the pineapple family (Bromeliaceae).'' \emph{Diveristy and Distributions} 26 (2): 183--95. \url{https://doi.org/10.1111/ddi.13004}.

\leavevmode\hypertarget{ref-Zizka2019}{}%
Zizka, Alexander, Daniele Silvestro, Tobias Andermann, Josué Azevedo, Camila Duarte Ritter, Daniel Edler, Harith Farooq, et al. 2019. ``CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases.'' Edited by Tiago Quental. \emph{Methods in Ecology and Evolution} 10 (5): 744--51. \url{https://doi.org/10.1111/2041-210X.13152}.


\end{document}