LifeCLEF 2019 Geo
Location-based species prediction
Note: Do not forget to read the Rules section on this page
Update 28/04/2019: All runs results available at : https://www.imageclef.org/GeoLifeCLEF2019
Update 28/04/2019: Submission deadline extended to the 05/05/2019 at 23:00 (UTC).
Update 11/04/2019: SUBMISSION IS NOW OPEN! Use the “Create submission” button and /!\ be carefull /!\ to fill the required information BEFORE chosing the file.
Update 25/04/2019: Added a dataset fusioning all train occurrences (with geographic filter for noPlant occurrences).
Update 08/04/2019: Added information about runs format on this page.
Update 05/04/2019: Added the identification of TestSet species in the Table of species IDs and names. See the Dataset tab.
Update 26/03/2019: The TEST SET is now available, download it from the Dataset tab The Protocol Note is also available for detailed informations about train and test datasets construction
Motivation
Automatically predicting the list of species that are the most likely to be observed at a given location is useful for many scenarios in biodiversity informatics. First of all, it could improve species identification processes and tools by reducing the list of candidate species that are observable at a given location (be they automated, semi-automated or based on classical field guides or flora). More generally, it could facilitate biodiversity inventories through the development of location-based recommendation services (typically on mobile phones) as well as the involvement of non-expert nature observers. Last but not least, it might serve educational purposes thanks to biodiversity discovery applications providing functionalities such as contextualized educational pathways.
Challenge description
The aim of the challenge is to predict the list of species that are the most likely to be observed at a given location. Therefore, we will provide a large training set of species occurrences, each occurrence being associated to a multi-channel image characterizing the local environment. Indeed, it is usually not possible to learn a species distribution model directly from spatial positions because of the limited number of occurrences and the sampling bias. What is usually done in ecology is to predict the distribution on the basis of a representation in the environmental space, typically a feature vector composed of climatic variables (average temperature at that location, precipitation, etc.) and other variables such as soil type, land cover, distance to water, etc. The originality of GeoLifeCLEF is to generalize such niche modeling approach to the use of an image-based environmental representation space. Instead of learning a model from environmental feature vectors, the goal of the task will be to learn a model from k-dimensional image patches, each patch representing the value of an environmental variable in the neighborhood of the occurrence (see figure below for an illustration). From a machine learning point of view, the challenge will thus be treatable as an image classification task. Participants will learn their models on a train set constituted of valid and uncertain citizen sciences occurrences. They will then try to predict the most likely species on an independent test set made up of expert plant occurrences with accurate identification and spatial location over very diverse biotic areas in France: The Mediterranean and Alpine regions.
Data
Train & Test data downloadable on the “Dataset” tab.
Check out the Protocol Note for detailed informations about the dataset construction and Python scripts at https://github.com/maximiliense/GLC19 to simplify formatting of the dataset for the learning process.
This year, the train dataset is augmented compared to the 2018 edition. In a nutshell, it will first include 280,945 train and test georeferenced occurrences of plant species from last year (file GLC_2018.csv). Plus, 2,367,145 plant species occurrences with uncertain identifications are added (file PL_complete.csv). They come from automatic species identification of pictures produced in 2017-2018 by the smartphone application Pl@ntNet, where users are mainly amators botanists. A trusted extraction of this dataset is also provided (file PL_trusted.csv), insuring a reasonable level of identification certainty. Finally, 10,618,839 species occurrences from other kingdoms (as mammals, birds, amphibias, insects, fungis etc.) were selected from the GBIF database (file noPlant.csv). 33 environmental rasters (directory rasters GLC19/) covering the French territory are made available this year, so that each occurrence may be linked to an environmental tensor via a participant customizable Python code. These environmental rasters were constructed from various open datasets including Chelsea Climate [1], ESDB soil pedology data [2,3,4], Corine Land Cover 2012 soil occupation data, CGIAR-CSI evapotranspiration data [5,6], USGS Elevation data (Data available from the U.S. Geological Survey.) and BD Carthage hydrologic data.
The test occurrences data come from independents datasets of the French National Botanical Conservatories. This TestSet includes 844 plant species. It is a subset of those found in the train set. Those species are indicated in the column “test” of the Table of species IDs and names and identification of TestSet species, downloadable on the Dataset tab. A detailed description of the protocol used to build the datasets is available in the Protocol_Note, download from the “Dataset” tab.
Submission instructions
Submission is open!
Each team is allowed to submit 20 runs maximum. A run is a .csv file with 4 columns separated by “;” and containing in this order : glc19TestOccId ; glc19SpId ; Rank ; Probability
Here is an example of the 5 first lines of a run file :
1 ; 10 ; 1 ; 0.5
1 ; 25 ; 2 ; 0.3
1 ; 301 ; 3 ; 0.2
2 ; 34 ; 1 ; 0.9
2 ; 41 ; 2 ; 0.1
Please watch your runs format. The 1st, 2nd and 3rd columns (respectively glc19TestOccId, glc19SpId and Rank) should be integers, while the last column probability is a float. One can give up to 50 species (glc19SpId) for an occurrence ID (glc19TestOccId), which must be distinct and their ranks must be strictly consecutive starting from 1. Each occurrence ID in the submitted run must exist in the testSet file (glc19TestOccId). Each species ID must match be one the species (glc19SpId) marked as TRUE in the column “test” of the Table of species Ids and names and identification of test set species.
WARNING: Any run inducing an error is NOT counted for the limit , except above 30 faulty runs, where it will count as a valid run. Exemples:
- if a participant submits 30 faulty runs, he can still submit 20 more runs
- if a participant submits 32 faulty runs, he can submit 18 more runs
- if a participant submits 10 successful runs, he can submit 10 more runs
- if a participant submits 7 faulty runs and 20 successful ones, he can submit 0 more runs
WARNING: There is no leaderboard while submission is open for this task. We removed it to maximize the independence between submitted algorithms and test data, and thus significance of results for future research purposes.
Citations
Information will be posted after the challenge ends.
Evaluation criteria
The main evaluation criteria will be the accuracy based on the 30 first answers, also called Top30. It is the mean of the function scoring 1 when the good species is in the 30 first answers, and 0 otherwise, over all test set occurrences. This metric has been carefully chosen for this challenge because it account for the known scientific fact that some tens of plant species usually coexist in the perimeter of the geolocation uncertainty of the occurrences. The Mean Reciprocal Rank was chosen as secondary metric for enabling comparison with the 2018 edition.
Resources
Contact us
- Discussion Forum : https://www.crowdai.org/challenges/lifeclef-2019-geo/topics
We strongly encourage you to use the public channels mentioned above for communications between the participants and the organizers. In extreme cases, if there are any queries or comments that you would like to make using a private communication channel, then you can send us an email at :
@Organisers:
Maximilien SERVAJEAN, Maximilien.Servajean@lirmm.fr
Christophe BOTELLA, christophe.botella@inria.fr
Alexis JOLY, alexis.joly@inria.fr
More information
You can find additional information on the challenge here: https://www.imageclef.org/GeoLifeCLEF2019
Prizes
LifeCLEF 2019 is an evaluation campaign that is being organized as part of the CLEF initiative labs. The campaign offers several research tasks that welcome participation from teams around the world. The results of the campaign appear in the working notes proceedings, published by CEUR Workshop Proceedings (CEUR-WS.org). Selected contributions among the participants, will be invited for publication in the following year in the Springer Lecture Notes in Computer Science (LNCS) together with the annual lab overviews.