International Journal of Computer Vision (2026) 134:183 https://doi.org/10.1007/s11263-026-02739-w WildIng: A Wildlife Image Invariant Representation Model for Geographical Domain Shift Julian D. Santamaria1,2 · Claudia Isaza1 · Jhony H. Giraldo2 Received: 21 March 2025 / Accepted: 1 January 2026 © The Author(s) 2026 Abstract Wildlife monitoring is crucial for studying biodiversity loss and climate change. Camera trap images provide a non-intrusive method for analyzing animal populations and identifying ecological patterns over time. However, manual analysis is time- consuming and resource-intensive. Deep learning, particularly foundation models, has been applied to automate wildlife identification, achieving strong performance when tested on data from the same geographical locations as their training sets. Yet, despite their promise, these models struggle to generalize to new geographical areas, leading to significant performance drops. For example, training an advanced vision-language model, such as CLIP with an adapter, on an African dataset achieves an accuracy of 84.77%. However, this performance drops significantly to 16.17% when the model is tested on an American dataset. This limitation partly arises because existing models rely predominantly on image-based representations, making them sensitive to geographical data distribution shifts, such as variation in background, lighting, and environmental conditions. To address this, we introduce WildIng, aWildlife image Invariant representation model for geographical domain shift. WildIng integrates text descriptions with image features, creating a more robust representation to geographical domain shifts. By leveraging textual descriptions, our approach captures consistent semantic information, such as detailed descriptions of the appearance of the species, improving generalization across different geographical locations. Experiments show that WildIng enhances the accuracy of foundation models such as BioCLIP by 30% under geographical domain shift conditions. We evaluate WildIng on two datasets collected from different regions, namely America and Africa. The code and models are publicly available at https://github.com/Julian075/CATALOG/tree/WildIng. Keywords Wildlife monitoring · Camera trap images · Geographical domain shift · Foundation models 1 Introduction Camera trap images are one of themost valuable data sources for wildlife monitoring, playing a crucial role in biodiver- sity conservation and climate change research (Reynolds et al., 2024; Gadot et al., 2024; Giraldo et al., 2019). These Communicated by B Banerjee. B Julian D. Santamaria julian.santamaria@udea.edu.co Claudia Isaza victoria.isaza@udea.edu.co Jhony H. Giraldo jhony.giraldo@telecom-paris.fr 1 SISTEMIC, Faculty of Engineering, Universidad de Antioquia-UdeA, Medellín, Colombia 2 LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France images provide a non-intrusive and scalable way to study animal populations, track endangered species, and under- stand ecological patterns over time (Pollock et al., 2025; Li et al., 2022; Santamaria et al., 2024). By capturing images in remote locations, camera traps allow researchers to collect extensive datasetswithout direct human intervention,making them an essential tool for ecological studies. Considering the vast volume of collected images, it is imperative to imple- ment automatic techniques for the identification of animal species present within the images. With the rise of large-scale deep learning models, researchers have started exploring the use of Foundation Models (FMs) in wildlife monitoring (Yang et al., 2025b; Gabeff et al., 2024; Fabian et al., 2023). FMs are trained on vast and diverse datasets, sometimes containing billions of data samples, allowing them to learn rich and transferable representations (Tang et al., 2025; Yang et al., 2025a; Wu et al., 2024). These models have demonstrated remarkable 0123456789().: V,-vol 123 http://crossmark.crossref.org/dialog/?doi=10.1007/s11263-026-02739-w&domain=pdf http://orcid.org/0009-0007-7287-5761 https://github.com/Julian075/CATALOG/tree/WildIng 183 Page 2 of 16 International Journal of Computer Vision (2026) 134:183 performance across various computer vision tasks, including image classification, object detection, and semantic segmen- tation, proving their flexibility and adaptability to different applications (Luo et al., 2024; Riz et al., 2024; Zang et al., 2024). Recently, researchers have begun adapting FMs for cam- era trap image recognition. Instead of training models from scratch, current approaches aim to fine-tune (Yang et al., 2025b) or adjust pre-trained FMs to incorporate domain- specific knowledge (Fabian et al., 2023). Some methods introduce adapters, which allow models to specialize in camera trap images without losing their general knowledge (Pantazis et al., 2022). Other models apply learning without forgetting strategies, ensuring that models retain their broad capabilitieswhile improving performance onwildlife images (Gabeff et al., 2024). Additionally, some approaches lever- age external knowledge sources, such as internet databases, to refine themodel’s understanding of specific animal species and their attributes (Fabian et al., 2023). These strategies aim to bridge the gap between the general-purpose knowledge of FMs and the specialized needs of camera trap image recog- nition. Despite their impressive performancewith in-domain geo- graphical data (data coming from the same geographical locations), these FM-based approaches often struggle when tested on out-of-domain geographical data (training and test data come from different geographical locations) (Hogeweg et al., 2024; Norman et al., 2023; Tuia et al., 2022). This limitation is particularly problematic for camera trap appli- cations, where the geographical locations differ substantially from those seen during the training phase (Schneider et al., 2020; Beery et al., 2018; Gomez-Villa et al., 2017). We observe that incorporating text into the input represen- tation for camera trap images helps extract stronger features, alleviating the geographical domain shift issue. In contrast, current models depend only on visual features, which are highly sensitive to changes in data distribution (Fang et al., 2025; Yu et al., 2023). Furthermore, many of these models are built on CLIP (Radford et al., 2021), which has shown a tendency to lose its generalization ability (Wang & Kang, 2025; Li et al., 2025a) and becomemore susceptible to spuri- ous correlations (Wang et al., 2024; Kempf et al., 2025) when themodel is fine-tuned. As a result, CLIP-basedmodels (e.g., WildCLIP (Gabeff et al., 2024) and BioCLIP (Stevens et al., 2024)) that rely solely on visual features often struggle to recognize images correctly in new geographical locations. Figure 1 provides an example of these observations, showing how such geographical variations cause the model’s learned features to fail in generalizing effectively, leading tomisclas- sifications (Liang et al., 2023; Wald et al., 2021). In this paper, we introduce the Wildlife image Invariant representation model for geographical domain shift (Wild- Ing). Our approach introduces a simple yet effective new Fig. 1 Comparison of WildIng and WildCLIP (Gabeff et al., 2024) under geographical domain shift. Both models are trained on the Snap- shot Serengeti dataset fromAfrica (Swanson et al., 2015) and evaluated on the Terra Incognita dataset from the United States (Beery et al., 2018). WildIng demonstrates superior performance representation for wildlife monitoring data to address geo- graphical domain shifts. This representation consists of using visual features in addition to text descriptions about the input images. This approach allows the model to cap- ture geographical domain-invariant features by leveraging text descriptions, which remain consistent across different geographical regions. WildIng consists of three main com- ponents: a text encoder, which includes a Large Language Model (LLM); an image encoder; and an image-text encoder, which incorporates a Vision-Language Model (VLM) and a Multi-Layer Perceptron (MLP). The MLP is used to address the domain shift between encoders, caused by the different feature spaces introduced by the VLMand LLMcomponents (Duan et al., 2022). An overview of the architecture is pro- vided in Section 3.2 and illustrated in Figure 2. To evaluateWildIng,we train ourmodel ononedataset and test it on another dataset from a different geographical region. This setup allows us to analyze how well the model adapts to newenvironmentswhere differences in background, lighting, and species composition create geographical domain shifts. Our results show thatWildIng either outperforms or achieves competitive performance compared to general-purpose and domain-specific FMs in camera trap image recognition, par- ticularlywhen the training and testing distributions differ due to geographical variations. In this work, we build upon and improve our preliminary study (Santamaria et al., 2025). To achieve this, we intro- duce modifications to the model architecture and perform additional experiments. More specifically, we replace the backbone of ourmodel by changing the combination of CLIP (Contrastive Language-Image Pre-Training) (Radford et al., 2021) and BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) with Long-CLIP (Zhang et al., 2024a). Additionally, we modify the class rep- resentation, using only the information provided by LLMs. We also evaluate WildIng’s robustness to multiple random 123 International Journal of Computer Vision (2026) 134:183 Page 3 of 16 183 initializations to analyze the effect of the introduced changes on its performance. Furthermore, we add new baselines for comparison. Finally, we perform additional ablation studies and sensitivity analyses to provide a deeper understanding of the contribution of each component in our approach. In summary, our main contributions are: • We introduce a novelWildIngmodel to represent wildlife monitoring data, enhancing the extraction of geographi- cal domain-invariant features. • When tested on datasets that differ from its training data, WildIng outperforms previous FMs in recognizing ani- mal species from camera trap images. • We conduct a series of ablation studies to validate the effectiveness of each component in our model. 2 RelatedWork 2.1 FoundationModels In recent years, FMs have emerged as a new approach that achieves remarkable performance across a wide range of tasks without requiring task-specific training. These models leverage large-scale pre-training to learn high-level repre- sentations, leading to significant advancements in the field of machine learning (Awais et al., 2025; Huang et al., 2024; Jiang et al., 2023; Touvron et al., 2023). A key advance- ment in this field was CLIP (Radford et al., 2021), which introduced a new learning approach by aligning visual fea- tures with text descriptions. CLIP significantly improved generalization across different tasks. Later models, such as Long-CLIP (Zhang et al., 2024a), extend the sequence length for better contextual understanding. Furthermore, CLIP- Adapter (Gao et al., 2024), which refines CLIP’s learned representations using lightweight adaptation layers, contin- ues to improve the alignment between visual features and text descriptions. More recently, LLMs and VLMs have demon- strated strong capabilities in processing and generating both textual and visual content (Zhang et al., 2024; Abdin et al., 2024; Li et al., 2023). Examples include GPT-4 (Achiam et al., 2023) andLLaVA (Liu et al., 2024),which leverage large- scale datasets to improve language and vision understanding across various applications. 2.2 FoundationModels for Biology FMs have been adapted to address domain-specific chal- lenges, particularly in biological research, where data is often complex and specialized. Most of the adaptations of FM in biology are related to processing text, extracting biological information (Jung et al., 2024; Lam et al., 2024), and model- ing biological structures (Garau-Luis et al., 2025; Jumper et al., 2021). Beyond language processing and structural mod- eling, FMs have also been applied to vision-based biological tasks. One example is BioCLIP (Stevens et al., 2024), which extends the principles of CLIP (Radford et al., 2021) to bio- logical data, enabling the classification of diverse categories such as plants, animals, and fungi. Unlike general-purpose vision-language models, BioCLIP integrates structured bio- logical knowledge and leverages taxonomic information, improving performance in fine-grained classification tasks (Stevens et al., 2024). 2.3 FoundationModels for Camera Trap Images The adaptation of FMs has extended beyond general and biological applications to camera trap image recognition, where they play a crucial role in wildlife monitoring and conservation. One such model is WildCLIP, which leverages CLIP’s ability to align visual features with text descriptions to accurately classify animal species in camera trap images (Gabeff et al., 2024). Similarly,WildMatch introduces a zero- shot classification framework by generating detailed visual descriptions of camera trap images and matching them to an external knowledge base for species identification (Fabian et al., 2023). Another approach, Eco-VLM, enhances models that align visual features with text descriptions for ecological applications by fine-tuning on wildlife-specific datasets and applying text augmentation techniques (Yang et al., 2025b). In contrast to models that align visual features with text descriptions, a more traditional deep learning approach was proposed by Gadot et al. (2024), who explored large-scale training for EfficientNetV2-M (Tan & Le, 2021), a CNN- based architecture. While previousmethods have significantly improved cam- era trap image recognition in geographically in-domain evaluation, they still struggle when applied to different geo- graphical regions and unseen species (Zhu et al., 2024; Simões et al., 2023; Gadot et al., 2024). To address this limitation, our proposal introduces a more robust repre- sentation for wildlife monitoring data. It leverages detailed descriptions from a VLM to incorporate semantic invariant features, which are then used together with image fea- tures. Furthermore, the inclusion of more detailed class descriptions generated by the LLM improves the alignment between the input representation and the corresponding class. This approach improves the input representation, enhancing robustness to geographical domain shifts. 123 183 Page 4 of 16 International Journal of Computer Vision (2026) 134:183 Fig. 2 Overview of WildIng. The model integrates image, text, and image-text encoders alongwith anLLM.By leveraging text descriptions and image features, it extracts invariant features, improving robustness against geographical domain shifts 3 WildIng 3.1 Problem Definition The objective of this paper is to train a model in an annotated dataset of camera trap images from a specific geographical location, denoted as D, which consists of Nd image-label pairs, D = {(xDi , yDi )}Nd i=1, with a set of classes CD . There- fore, we evaluate the model’s performance on a different camera trap image dataset from another geographical loca- tion, denoted as S, which represents a distinct geographical domain, containing Ns image-label pairs,S = {(xSi , ySi )}Ns i=1, with a set of classesCS . The set of classes of both datasetsmay or may not overlap, meaning that CD ∩CS may or may not be empty. Both datasets are derived from the natural world, but their image distributions differ due to being collected from different geographical regions, as illustrated in Figure 1. Our goal is to train a deep learning model using only the training dataset D and deploy it on the testing dataset S. This setting is highly practical in camera trap image research because the data used for testing usually comes from a different geo- graphical domain than the training data. 3.2 Overview of the Approach The architecture of our proposed model, WildIng, is illus- trated in Figure 2. It consists of three main components: i) text encoder, ii) image encoder, and iii) image-text encoder. In the text component,WildIng uses an LLM to extract class- specific knowledge for each category in our dictionary of classes, CD . Then, the LLM-generated descriptions are pro- cessed by the text encoder. The resulting text embeddings are used to compute class-specific centroids for each class in CD (Section 3.3). This process produces a single embed- ding of dimension F . For the image component, the model uses the image encoder to extract embeddings from a mini- batch of B images (Section 3.4). In the image-text encoder, WildIng employs a VLM coupled with a text encoder and an MLP to compute image-text embeddings from the mini- batch of images (Section 3.5). Text, image, and image-text embeddings are matched using a similarity mechanism (Sec- tion 3.6). Finally, we utilize the output of the similarity mechanism to compute a contrastive loss, which is used to train our model (Section 3.7). Most modules in WildIng are frozen ( ) apart from the image-text encoder ( ). 3.3 Text Encoder To generate textual descriptions for each category in our dataset, CD , we use an LLM that provides detailed informa- tion about the animals without requiring expert input. The prompt used to extract these descriptions from the LLM is provided in Appendix Appendix A. The LLM generates Mc short descriptions for each class c ∈ CD . WildIng assumes that this approach introduces more diverse information for representing each class. The generated descriptions are then processed by WildIng using the text encoder. To obtain the final embedding, the model computes the centroid of the resulting embeddings. Specifically, let P(c) ∈ R Mc×F be the set of Mc embeddings obtained from the LLM-generated descriptions for class c. The centroid for each class c is com- puted as: t(c) = 1 Mc Mc∑ i=1 P(c) i , (1) whereP(c) i represents the i-th rowofP(c), corresponding to an individual textual description embedding. Finally, the output of the text embedding component in WildIng is a matrix: T = [t1, t2, . . . , t|CD |]� ∈ R |CD |×F , (2) which contains the final embeddings for all classes in CD . 123 International Journal of Computer Vision (2026) 134:183 Page 5 of 16 183 3.4 Image Encoder 3.4.1 Pre-processing We employ an object detection model to process our cam- era trap datasets, aiming to extract image crops that contain relevant information for analysis. 3.4.2 Image Embeddings WildIng employs an image encoder to extract feature embed- dings from cropped images. The images are processed in mini-batches of size B, where each image is transformed into an embedding of dimension F using the image encoder. The output of this stage is a matrix: V = [v1, v2, . . . , vB]� ∈ R B×F , (3) where vi represents the visual embedding of the i-th image in themini-batch. Thismatrix is used as input for the subsequent stages of our framework, where text and image embeddings are aligned and contrasted. 3.5 Image-text Encoder In the image-text branch of WildIng, we use the mini-batch of cropped images as input. To generate textual descriptions of the animals in these images, we utilize an image-text encoder, as illustrated in Figure 3. This encoder consists of three main components: a VLM, a text encoder, and anMLP. First, the VLM generates textual descriptions based on the input images, using a prompt similar to the one described in (Fabian et al., 2023) and provided in the Appendix Appendix A. Therefore, these textual descriptions are processed using the text encoder. Finally,WildIng applies anMLP to refine the extracted embeddings by introducing trainable parameters. As demonstrated in Section 4, incorporating trainable param- eters improves the model’s performance. However, effective alignment between the image embeddings and the projected representations requires a dedicated similarity mechanism and a contrastive loss function, as detailed in Section 3.6 and Section 3.7. The output of the image-text encoder of WildIng is a matrix: L = [l1, l2, . . . , lB]� ∈ R B×F , (4) where li represents the transformed embedding of the i-th image description in the mini-batch. Fig. 3 Detailed illustration of the image-text module, which consists of a VLM, a text encoder, and an MLP. This module processes input images and converts them into image-text embeddings 3.6 Similarity Mechanism The embeddings from the three types of modalities, text (T), image (V), and image-text (L), are the inputs for the similar- ity method. The process incorporates two stages: i) similarity computation and ii) weighted integration. In the first stage, WildIng computes the cosine similarities between text and image embeddings, as well as text and image-text embed- dings. Specifically, letW ∈ R B×|CD | be the matrix of cosine similarities between the text and image embeddings, com- puted as follows: Wi j = 〈vi , t j 〉 ‖vi‖‖t j‖ ∀ 1 ≤ i ≤ B, 1 ≤ j ≤ |CD|, (5) where Wi j represents the (i, j) item of the matrix, 〈·, ·〉 denotes inner product, and ‖ · ‖ is the �2 norm of a vec- tor. Along the same process, WildIng calculates the cosine similarities between the text and image-text embeddings as follows: Qi j = 〈li , t j 〉 ‖li‖‖t j‖ ∀ 1 ≤ i ≤ B, 1 ≤ j ≤ |CD|, (6) where Q ∈ R B×|CD | is the matrix of cosine similarities between the text and image-text embeddings. Both cosine similarities are combined using a weighted averagebetween thematricesW andQ,where theweights are controlled by the hyperparameterα ∈ [0, 1]. Specifically, the output of the weighted integration is a matrix S ∈ R B×|CD |, defined as follows: S = αW + (1 − α)Q. (7) Since α ∈ [0, 1], the resulting matrix S is a convex combina- tion ofW and Q. As a result, each element Si j in the matrix is also between 0 and 1. 123 183 Page 6 of 16 International Journal of Computer Vision (2026) 134:183 3.7 Contrastive Loss Wetrain ourmodel using a contrastive loss function,L,which takes the matrix S as input. The loss function is calculated for each mini-batch as follows: L(S) = 1 B B∑ i=1 − log exp(Sik/τ) ∑|CD | j=1 exp(Si j/τ) , (8) where τ is a temperature hyperparameter and k is the index of the class in CD of the i th image in themini-batch. This loss function aims to ensure that the embeddings corresponding to the same species category are brought closer together in the feature space. 4 Experiments and Results In this section, we describe the datasets used in this work, the evaluation protocol, implementation details, results, and a discussion ofWildIng.We compare our proposalwithCLIP (Radford et al., 2021), CLIP-Adapter (Gao et al., 2024), Long-CLIP (Zhang et al., 2024a), BioCLIP (Stevens et al., 2024),WildCLIP (Gabeff et al., 2024), and some adaptations of Long-CLIP and BioCLIP. Additionally, we conduct abla- tion studies to analyze the contribution of each component of WildIng, such as the image encoder, the image-text encoder, and LLM. We explore the effect of incorporating a template set to introduce task-specific information and evaluate differ- ent LLMs to assess their impact on performance. Finally, we investigate the sensitivity of WildIng to the hyperparameter α in the similarity mechanism, and to the number of LLM- prompted sentences in the text encoder. All evaluations are reported using accuracy, and macro F1-score. 4.1 Datasets We evaluate WildIng using two publicly available camera trap datasets from different geographical regions: Snapshot Serengeti (Swanson et al., 2015), collected in savanna envi- ronments in Africa using Scoutguard cameras, and Terra Incognita (Beery et al., 2018), collected in the southwest of the United States, where the predominant environment is semi-arid desert and pinyon–juniper woodland (Archer & Predick, 2008). Information about the specific camera trap models used in the Terra Incognita dataset is not specified. This dataset presents several visual challenges, such as poor illumination (especially at night), motion blur due to low shutter speed, occlusions from vegetation or frame edges, and forced perspective when animals appear very close to the camera. Examples of cropped images from these datasets are Fig. 4 Cropped images from the Snapshot Serengeti and Terra Incog- nita datasets where we observe the geographical domain shift and the difference in classes (different taxonomic groups) shown in Figure 4 and their class distributions are shown in Figure 5 and Figure 6. • Snapshot Serengeti (Swanson et al., 2015). We use the version of the Snapshot Serengeti dataset adopted in WildCLIP (Gabeff et al., 2024), which consists of 46 classes. This dataset version contains 380 × 380 pixel image crops, generated by theMegaDetector model from theSnapshot Serengeti project, using a confidence thresh- old above 0.7. Only images containing single animals were selected. The dataset includes a total of 340, 972 images, with 230, 971 for training, 24, 059 for valida- tion, and 85, 942 for testing. • Terra Incognita (Beery et al., 2018). This dataset con- sists of 16 classes and introduces two testing groups: Cis-locations and Trans-locations. Cis-locations contain images similar to the training data, while Trans-locations feature images from different environments. These par- titions were originally designed to assess the robustness of computer vision models in an in-domain evaluation setting. We filter the images in this dataset using the MegaDetector model from the PyTorch-Wildlife library (Hernandez et al., 2024). The dataset contains a total of 45, 912 images, distributed as follows: 12, 313 for train- ing, 1, 932 forCis-validation, 1, 501 forTrans-validation, 13, 052 for Cis-test, and 17, 114 for Trans-test. 123 International Journal of Computer Vision (2026) 134:183 Page 7 of 16 183 Fig. 5 Class distribution of the Serengeti dataset Fig. 6 Class distribution of the Terra Incognita dataset 4.2 Evaluation Protocol We conduct two experiments to assess the performance of our model in comparison to state-of-the-art (SOTA) meth- ods. In the first experiment, we use the Snapshot Serengeti dataset for training and validation and the Terra Incognita dataset for testing. This cross-dataset setup allows us to eval- uate the model’s performance under geographical domain shift, i.e., geographical out-of-domain evaluation. Formally, we define D as the Snapshot Serengeti dataset and S as the Terra Incognita dataset. Snapshot Serengeti was collected in various protected areas in Africa, whereas Terra Incog- nita originates from the southwest of the United States. This experimental setup introduces two key challenges: i) a dis- tribution shift between the datasets D and S (geographical domain shift), and ii) a discrepancy in the set of classes,where CD = CS . These challenges are illustrated in Figure 4. Due to the difference in class sets, closed-set SOTAmethods can- not be used for comparison. We report accuracy results on the Cis-Test and Trans-Test subsets of Terra Incognita. In the second experiment, we modify our problem defi- nition from Section 3 and use the same dataset for training, validation, and testing to evaluate themodel without the chal- lenges introduced by the geographical domain shift and novel classes, i.e., under in-domain evaluation. Specifically, we train and evaluate the model on either the Snapshot Serengeti or the Terra Incognita datasets. This approach allows us to assess the model’s accuracy and robustness within a consis- tent domain. 4.2.1 Implementation Details For pre-processing, we employ the MegaDetector model (Beery et al., 2019). WildIng is implemented using the 3.5 version of ChatGPT for the LLM, the LongCLIP-B version of Long-CLIP for the text and image encoder, and the 1.5- 7B version of LLaVA for the VLM. For training our model in both experiments, we set τ = 0.1. The MLP architec- ture for the first experiment consists of a single hidden layer with a dimension of 793 and employs the Rectified Linear Unit (ReLU) as the activation function. Additionally, a skip connection is implemented between the input and output lay- ers. Additionally, we train the model for 30 epochs and set α = 0.5. Optimization is performed using the Stochastic Gradient Descent (SGD) algorithm with a learning rate of 0.09, a momentum of 0.80, and a batch size of 128. We per- form our experiments on GPUs Tesla P100-PCIE-16GB. In the second experiment, we fine-tune WildIng by unfreezing the image encoder and adjusting key hyperpa- rameters, including the α = 0.6. Specifically, we use a batch size of 256 and train themodel for 57 epochs using SGDwith a momentum of 0.82 and a learning rate of 1e−3. For the Snapshot Serengeti dataset, we apply an MLP with a single hidden layer with 256 dimensions. For the Terra Incognita dataset, we set an MLP with a single hidden layer of 733 dimensions. Early stopping is applied with a patience of 5 epochs. To optimize the hyperparameters, we use a random search. Furthermore, we evaluate the best hyperparameter combina- tion using 100 different random seeds for the first experiment and 3 random seeds for the second experiment. For details on the search space and additional information, please refer to Appendix Appendix B. 4.3 Quantitative Results 4.3.1 Comparison with the SOTA in Out-of-domain Evaluation Table 1 presents the zero-shot performance evaluation of various models trained on datasets such as ShareGPT4V and Snapshot Serengeti and evaluated on the Cis-Test and Trans-Test sets of the Terra Incognita dataset. Models such 123 183 Page 8 of 16 International Journal of Computer Vision (2026) 134:183 Table 1 Zero-shot performance results of WildIng and other foundation models on the Terra Incognita dataset (out-of-domain evaluation). All methods are trained on data different from the test dataset. The best method is highlighted in bold, and the second-best is underlined. Results are reported in accuracy (%) and macro F1 score (F1-M) Model Training Cis-Test F1-M Trans-Test F1-M CLIP OpenAI data 39.14 0.39 34.67 0.32 Long-CLIP ShareGPT4V 42.41 0.41 37.55 0.34 BioCLIP TREEOFLIFE-10M 21.12 0.20 14.53 0.15 CLIP Adapter Snapshot Serengeti 27.45±2.84 0.25±0.03 16.17±3.37 0.18±0.03 Long-CLIP Adapter Snapshot Serengeti 30.40±1.58 0.28±0.02 18.73±1.54 0.19±0.02 BioCLIP Adapter Snapshot Serengeti 14.21±3.30 0.12±0.03 8.59±3.02 0.08±0.01 WildCLIP Snapshot Serengeti 41.62±0.40 0.39±0.01 37.52±0.42 0.31±0.01 WildCLIP-LwF Snapshot Serengeti 43.67±0.12 0.40±0.01 40.17±0.05 0.34±0.01 WildIng (ours) Snapshot Serengeti 50,06±2.39 0.43±0.01 39.96±1.89 0.36±0.01 Table 2 Performance comparison in Snapshot Serengeti (in-domain evaluation). All models use the ViT-B/16 backbone. The best method is highlighted in bold, and the second-best is underlined. Results are reported in accuracy (%) Model Loss Function Test Linear Probe CLIP Cross-entropy 84.84±0.42 CLIP Adapter Contrastive 84.77±0.26 Long-CLIP Adapter Contrastive 83.97±1.03 BioCLIP Adapter Contrastive 80.47±0.89 WildCLIP Contrastive 68.73±0.28 WildCLIP-LwF Contrastive 69.53±0.02 WildIng (ours) Contrastive 90.74±0,05 as CLIP, BioCLIP, and Long-CLIP (first three rows) are reported without their standard deviation, as we used the pre- trained models provided by the authors of each model. For the remaining cases, we report the results obtained by training the model with 100 different random seeds. The results indicate that CLIP, trained on OpenAI data, achieves a Cis-Test accuracy of 39.14% and a Trans-Test accu- racy of 34.67%, while Long-CLIP, trained on ShareGPT4V, improves these metrics to 42.41% and 37.55%, respectively. BioCLIP, despite being trained on TREEOFLIFE-10M, struggles to generalize to the Terra Incognita dataset, achiev- ing only 21.12% on Cis-Test and 14.53% on Trans-Test. Based on CLIP Adapter (Gao et al., 2024), we evaluate this strategy and extend it to Long-CLIP and BioCLIP. The results for these adapter-based models show lower performance compared to models without adapters. Specif- ically, the original CLIP Adapter achieves an accuracy of 27.45% in Cis-Test. The Long-CLIP Adapter and BioCLIP Adapter variants obtain 30.4% and 14.21% accuracy in Cis- Test, while their Trans-Test accuracies are 16.17%, 18.73%, and 8.59%, respectively. Overall, these results suggest that adapter-based models are highly domain-specific and further widen the gap between domains compared to the original architectures. WildCLIP and its Learning without Forgetting (LwF) variant, both trained on Snapshot Serengeti, show better per- formance, achieving 41.62% and 43.67% on Cis-Test, and 37.52% and 40.17% on Trans-Test, respectively. Our pro- posedmethod,WildIng, outperforms all previously evaluated models on Cis-Test, achieving 50.06% accuracy, and obtains the second-best accuracyonTrans-Testwith 39.96%,demon- strating its effectiveness in geographical out-of-domain eval- uation. Furthermore, the macro F1 scores highlight that WildIng is less biased toward majority classes (see Fig. 6 for the class distribution of Terra Incognita), this behav- ior is observed in both test sets (Cis-Test and Trans-Test). WildIng is the only model that improves the macro F1 score relative to its base model (LongCLIP in our case), increas- ing it from 0.41 to 0.45 on Cis-Test and from 0.34 to 0.37 on Trans-Test. These results highlight that WildIng not only generalizes well across domains, but also improves per-class performance despite being trained on a highly imbalanced dataset such as Snapshot Serengeti (see Fig. 5). In addition, the standard deviation of our proposal is com- parable to that of SOTA models such as CLIP Adapter, Long-CLIP Adapter, and BioCLIP Adapter. These results highlight the advancements of WildIng in learning geo- graphical domain-invariant representations and improving open-set recognition. By leveraging additional semantic information to represent the input, WildIng surpasses pre- vious SOTA models in handling geographical domain shifts and recognizing unseen classes. 4.3.2 Trainable Parameters and Computational Cost In our comparison of trainable parameters, we exclude the zero-shot models CLIP, Long-CLIP, and BioCLIP, since these models are not trained as part of this work.WildIng has a total of 813,339 trainable parameters. This is significantly lower than theWildCLIP andWildCLIP-LwFmodels, which have approximately 86 million trainable parameters. This difference is because those models unfreeze the CLIP image encoder during training. Nevertheless,WildIng achieves best 123 International Journal of Computer Vision (2026) 134:183 Page 9 of 16 183 performance in out-of-domain evaluation on the Cis-Test set, with an accuracy of 50.06% and ranks second on the Trans-Test set with an accuracy of 39.96% (see Table 1). On the other hand, the adapter-based models (CLIP Adapter, Long-CLIP Adapter, and BioCLIP Adapter) have the small- est number of trainable parameters, with 262,914. However, these adapter-based models achieved the worst performance, as shown in Table 1. Finally, it is clear that WildIng offers a good trade-off between model complexity and performance. For computational cost, training the full model of Wild- Ing on the Snapshot Serengeti dataset takes around 30 minutes using a single Tesla P100-PCIE-16GB GPU. For comparison, training the standard version ofWildCLIP (Gab- eff et al., 2024) in the same dataset on A100-PCIE-40GB takes 13.4 hours. For that reason, it is clear that our proposal offers a lightweight alternative for addressing geographical domain shift, instead of training a large model such as Wild- CLIP. 4.3.3 In-domain Performance Comparison in the Snapshot Serengeti Dataset Table 2 shows a comparison of multiple models trained in the Snapshot Serengeti dataset. The Linear Probe CLIP model, trained with a cross-entropy loss, achieves a test accuracy of 84.84%. Despite this performance, it lacks open vocab- ulary capabilities due to its supervised training approach. Among the adapter-based models, the CLIP Adapter, Long- CLIP Adapter, and BioCLIP Adapter achieve test accuracies of 84.77%, 83.97%, and 80.47%, respectively. These results suggest that adapter-based methods are a good option for in-domain evaluation. WildCLIP achieves a test accuracy of 68.73%, demonstrating moderate performance but perform- ing worse than both adapter-based models and our method. The WildCLIP-LwF variant improves this performance, reaching 69.53%. The LwF strategy contributes positively to retaining learned information, but its improvement remains limited compared to other models like CLIP Adapter and WildIng. The proposed method, WildIng, achieves the high- est test accuracy of 90.74%, outperforming all previously evaluated models while maintaining a low standard devia- tion. 4.3.4 In-domain Performance Comparison in the Terra Incognita Dataset Similar to Section 4.3.3, Table 3 presents a comparison of different models in the geographical in-domain evaluation. All models in this comparison are trained and evaluated on the Terra Incognita dataset. The Linear Probe CLIP model achieves a Cis-Test accu- racy of 78.09%and aTrans-Test accuracy of 67.23%.Among the adapter-based methods, the CLIP Adapter, Long-CLIP Table 3 Performance comparison in Terra Incognita (in-domain eval- uation). All models use the ViT-B/16 backbone. The best method is highlighted in bold, and the second-best is underlined. Results are reported in accuracy (%). Model Cis-Test Trans-Test Linear Probe CLIP 78.09±0.34 67.23±2.31 CLIP Adapter 79.45±1.21 69.10±2.70 Long-CLIP Adapter 79.11±0.47 69.15±1.33 BioCLIP Adapter 77.09±1.19 63.45±0.85 WildCLIP 91.70±0.21 84.47±0.24 WildCLIP-LwF 88.93±0.02 82.91±0.01 WildIng (ours) 84.10±3.03 75.80±5.38 Adapter, and BioCLIP Adapter achieve Cis-Test accuracies of 79.45%, 79.11%, and 77.09%, respectively, while their Trans-Test accuracies are 69.10%, 69.15%, and 63.45%. These results indicate that adapter-based models provide competitive performance in in-domain evaluation. WildCLIP achieves a Cis-Test accuracy of 91.70% and a Trans-Test accuracy of 84.47%, demonstrating the highest performance among all evaluatedmodels. TheLwFvariant of WildCLIP slightly underperforms its base version, reaching a Cis-Test accuracy of 88.93% and a Trans-Test accuracy of 82.91%. The proposed method achieves a Cis-Test accuracy of 84.10%, outperforming the adapter-based models and the Linear Probe CLIP. In the Trans-Test scenario, WildIng achieves 75.80%, showing competitive performance but lower than WildCLIP and WildCLIP-LwF. However, we observe a significant increase in the standard deviation due to the smaller size of the Terra Incognita dataset, which limits the robustness when evaluated with only three ran- dom seeds. This effect is further amplified by unfreezing the image encoder and fine-tuning it in this experiment. It is well known in the domain generalization literature that there is an inherent trade-off between improving accuracy on a specific domain and achieving effective domain alignment for generalization across domains (Nguyen et al., 2022; Yu et al., 2024). To address this limitation, few-shot adaptation and training-free methods could be explored to reduce the dependence on large datasets (Zhang et al., 2022; Zanella & BenAyed, 2024; Bendou et al., 2025). In futurework, wewill explore these approaches, as their investigation falls beyond the scope of the current study. 4.4 Ablation Studies Weconduct several ablation studies on (i) the impact of incor- porating a template set to introduce task-specific information; (ii) the impact of the image encoder, the image-text encoder, and theLLM; and (iii) the influence of different LLMsonper- 123 183 Page 10 of 16 International Journal of Computer Vision (2026) 134:183 Fig. 7 Evaluation of the template set contribution to the WildIng in out-of-domain performance formance. This analysis is conducted for the out-of-domain evaluation. 4.4.1 Evaluating the Incorporation of a Template Set We evaluate the impact of using a predefined template set to describe the dataset categories, following the original configuration used in CLIP (Radford et al., 2021). This approach aims to incorporate task-specific information cap- tured by the template set and serves as a potential alternative to LLM-generated descriptions in scenarios where their use is impractical or computationally expensive. The template set adds context to the descriptions by specifying that the images were captured by camera traps. For example, a tem- plate might describe a category as: “A photo captured by a camera trap of a { }”. For a detailed list of the template set, refer to Appendix Appendix C. To integrate template-baseddescriptionswithLLMdescrip- tions, we introduce a hyperparameter β, which controls the contribution of each source of knowledge to the final text embedding for each class. For both the template set and LLM-generated descriptions, we compute the centroid as defined in equation (1). The final text embedding is obtained as a weighted combination of these two centroids tc = (1 − β)m(c) 1 + βm(c) 2 , where m(c) 1 represents the cen- troid of the template-based descriptions, andm(c) 2 represents the centroid of the LLM-based descriptions. The parameter β ∈ [0, 1] determines the relative influence of each source. Figure 7 presents the impact of the template set on model performance by varying the hyperparameter β. Figure 7 shows that the best results are achieved when β = 1, which corresponds to excluding the template set from the text embedding calculation. Table 4 Ablation studies for performance variations for different design choices ofWildIng.Allmodels are trained onSnapshot Serengeti and evaluated on Terra Incognita (out-of-domain evaluation). The best combination is highlighted in bold, and the second-best is underlined. Results are reported in accuracy (%) Img Enc Img-txt Enc LLM Cis-Test Trans-Test ✓ ✗ ✗ 46.67 37.18 ✓ ✗ ✓ 40.32 39.11 ✗ ✓ ✗ 28.14±2.53 21.72±1.79 ✗ ✓ ✓ 33.57±2.81 26.66±2.96 ✓ ✓ ✗ 47.62±2.39 38.96±1.89 ✓ ✓ ✓ 50.06±2.39 39.96±1.89 Table 5 Performance comparison with different LLMs. The best com- bination is highlighted in bold, and the second-best is underlined. Results are reported in accuracy (%) LLM Cis-Test Acc(%) Trans-Test LLAMA 28.82±0.84 21.31±0.60 Qwen 29.70±0.44 22.88±0.55 Phi 29.47±0.62 19.96±0.66 ChatGPT 50.06±2.39 39.96±1.89 4.4.2 Image Encoder, Image-text Encoder, and LLM We evaluate the performance of WildIng by removing the image encoder, the image-text encoder, and the textual descriptions generated by the LLM. Table 4 presents the geographical out-of-domain evaluation results for Cis-Test and Trans-Test under different design choices in the model. Our findings show that the lowest performance occurs when both the image encoder and LLM-generated descriptions are removed (third row in Table 4). This setting is equivalent to setting α = 0 in (7) and training the model using the camera trap template set introduced in Section 4.4.1. Under this con- figuration, the model achieves accuracy scores of 28.14% in the Cis-Test and 21.72% in the Trans-Test. Similarly, remov- ing only the image encoder also results in poor performance (fourth row in Table 4). These results highlight the crucial role of the image encoder in WildIng, indicating that the image-text encoder alone cannot fully replace it. When the image encoder is included, the model shows a significant improvement, achieving accuracies of 46.67% in the Cis-Test and 37.18% in the Trans-Test (first row in Table 4). Similar to the configuration in the first row of Table 4, adding LLM-generated descriptions improves performance in the Trans-Test, increasing accuracy to 39.11%. However, in the Cis-Test, the accuracy decreases to 40.32% (second row in Table 4). These results suggest that while textual descriptions can be beneficial, their effectiveness depends on the test dataset and the type of text information used. 123 International Journal of Computer Vision (2026) 134:183 Page 11 of 16 183 This suggests that using only the image encoder is insuffi- cient to capture geographically invariant features. Standard deviations are not reported in the first two rows in Table 4, as we exclusively used the pre-trained model provided in Long-CLIP (Zhang et al., 2024a). When both image and image-text encoders are included, the model provides more robust results across both test sets (fifth and sixth row in Table 4). In particular, incorporating the image encoder, image-text encoder, and LLM-generated descriptions leads to the highest accuracy, with 50.06% in the Cis-Test and 39.96% in the Trans-Test. This highlights the importance of integrating image and image-text embeddings more effectively to capture relationships between images and their categorical descriptions, helping to construct invariant representations against geographical domain shifts. 4.4.3 Evaluating different LLMs Table 5 shows a comparison of WildIng when trained with descriptions generated by different LLMs, including LLAMA (Touvron et al., 2023), Qwen (Yang et al., 2024), Phi (Abdin et al., 2024), and ChatGPT. The key difference among these models is the quality of the generated descrip- tions for each category. We observe that WildIng achieves an accuracy of 28.82% in the Cis-Test and 21.31% in the Trans-Test when trained with descriptions from LLAMA. Similarly, poor results are obtained with Qwen and Phi. In contrast, when using ChatGPT-generated descriptions, the model reaches 50.06% in the Cis-Test and 39.96% in the Trans-Test. This suggests that the descriptions generated by ChatGPT are signifi- cantly better than those produced by the other models in this specific task. These results highlight the importance of generating high-quality descriptions that provide the model with relevant information about each category. More infor- mative descriptions enhance themodel’s ability to distinguish between different categories, which in turn results in better performance across both test scenarios. 4.5 Sensitivity to the Hyperparameter˛ Figure 8 illustrates howvariations in the parameterα between 0 and 1 in (7) affect the performance of WildIng for Cis-Test and Trans-Test in Terra Incognita for out-of-domain evalua- tion. We observe that the information from both matrices, Q and W, is complementary. Figure 8 shows that the optimal value of α is 0.5, indicating that giving nearly equal impor- tance to bothmatrices results in the highest accuracy.Whenα deviates from this optimal value, the accuracy declines, sug- gesting that overemphasizing either matrix leads to a loss of useful information for classification. This trend is consistent across both evaluation sets, demonstrating the robustness of Fig. 8 Sensibility analysis of the hyperparameter α of WildIng in the Terra Incognita dataset for out-of-domain evaluation Fig. 9 Evaluation of the sensitivity of WildIng to the number of LLM- prompted sentences per class the model when incorporating information from both matri- ces. 4.6 Sensitivity to the number of LLM-prompted sentences The Figure 9 present the performance of WildIng when we vary Mc, the number of short description per class c in CD . We observe a consistent improvement in accuracy with more descriptions, especially on the Cis-Test split. These results suggest that WildIng benefits from a larger number of short descriptions that are diverse and clearly describe each class. 4.7 Limitations Although FMs have shown promising results in recognizing animal species in camera trap images, there is still a differ- ence between their performance on out-of-domain (Table 1) and in-domain (Table 3) geographical evaluation. 123 183 Page 12 of 16 International Journal of Computer Vision (2026) 134:183 Fig. 10 Failure cases of theWildIngmodel in camera trap image classi- fication.Casea:TheVLMgenerates an incorrect description containing details that do not match the input image.Case b:A blurry image leads to a vague and uninformative description.Case c:When the input image is highly unclear, theVLMproduces a randomand unrelated description Figure 10 illustrates the typical failure cases of WildIng. These misclassifications arise mostly when descriptions are inaccurate or ambiguous. To illustrate the model’s sensitivity to input descriptions, Figure 10 presents three examples: • In case a, the VLM generates hallucinated details that are not present in the image. For example, it identifies the animal as a horse, even though the input image does not match this description. This results in a completely incorrect prediction. • In case b, the input image is blurry and difficult to interpret. Consequently, the VLM produces a vague description with minimal useful information. Due to the lack of context, the model fails to make an accurate pre- diction. • In case c, the image is so unclear that the VLM generates a description unrelated to the input. This random descrip- tion further misleads the model, leading to an incorrect prediction. There are diverse Uncertainty Estimation (UE) strategies for handling hallucinated or low-quality captions (Li et al., 2025). Neighborhood consistency can be used to identify likely unreliable model responses (Khan & Fu, 2024). The method proposed in (Khan & Fu, 2024) introduces n vari- ations of an input prompt to the model (e.g., a VLM) and expects the evaluated model to produce consistent outputs for all n cases, highlighting whether the model’s determin- istic embeddings can effectively handle semantic variability. Another solution, more on the hidden embeddings of the model, is proposed in (Mushtaq et al., 2025),which combines hidden state representations from the model with token- level uncertainty to obtain a more comprehensive measure of reliability. In the future, these approaches could be further explored, as their inclusion may help mitigate issues related to inaccurate or ambiguous descriptions. 5 Conclusions In this research, we introduce WildIng, a new model that presents a geographical invariant representation to mitigate performance loss caused by geographical domain shifts in camera trap image recognition.WildIng addresses geograph- ical domain shifts by leveraging robust features extracted from the image and image-text encoder while also incorpo- rating enhanced category descriptions generated by an LLM. Our extensive experiments demonstrate that WildIng outper- forms state-of-the-artmodels in camera trap image classifica- tion, particularly when there are geographical domain shifts between the training and testing datasets, all while preserving its open-vocabulary capabilities. Appendix A Prompts In this section, we provide the prompts for the LLM and LLaVA used in WildIng. Appendix A.1 Prompt LLM The prompt used to generate the LLM description of the animal species follows a structured format based on the methodology described in (Pratt et al., 2023) and is presented below. 123 International Journal of Computer Vision (2026) 134:183 Page 13 of 16 183 You are an AI assistant specialized in biology and providing accurate and detailed descriptions of animal species. We are creating detailed and specific prompts to describe various species. The goal is to generate multiple sentences that capture different aspects of each species’ appearance and behavior. Please follow the structure and style shown in the examples below. Each species should have a set of descriptions that highlight key characteristics. Example Structure: Badger: • a badger is a mammal with a stout body and short sturdy legs. • a badger’s fur is coarse and typically grayish-black. • badgers often feature a white stripe running from the nose to the back of the head dividing into two stripes along the sides of the body to the base of the tail. • badgers have broad flat heads with small eyes and ears. • badger noses are elongated and tapered ending in a black muzzle. • badgers possess strong well-developed claws adapted for digging burrows. • overall badgers have a rugged and muscular appearance suited for their burrowing lifestyle. Appendix A.2 Prompt LLaVA The prompt used in LLaVA follows the approach described in (Fabian et al., 2023) and is structured as follows: SYSTEM: You are an AI assistant specialized in biology and providing accurate and detailed descriptions of animal species.\n � image � \n USER: You are given the description of an animal species. Provide a very detailed description of the appearance of the species and describe each body part of the animal in detail. Only include details that can be directly visible in a photograph of the animal. Only include information related to the appearance of the animal and nothing else. Make sure to only include information that is present in the species description and is certainly true for the given species. Do not include any information related to the sound or smell of the animal. Do not include any numerical information related to measurements in the text in units: m cm in inches ft feet km/h kg lb lbs. Remove any special characters such as unicode tags from the text. Return the answer as a single paragraph. Appendix B Hyperparameter Search Space To select the hyperparameters, we performed a Monte Carlo partitioning of the dataset. We generated three different par- titions, each created using a different random seed. For each partition, a subset of the development set (training + val- idation data) was randomly assigned to the training and validation sets. We then conducted a random search over a predefined hyperparameter space, testing different hyperparameter com- binations. For each combination, we trained the model on all three partitions separately and computed the accuracy for each setting. To determine the final performance of a configuration, we calculated the mean accuracy and stan- dard deviation across the three partitions. This process was repeated 30 times, and the best hyperparameter combination was selected based on the highest mean accuracy. The search space included the following hyperparameters: • Batch size: b ∈ {128, 256} • Hidden dimension: h ∈ {253 + 60k | k ∈ Z, 0 ≤ k ≤ 11} • Learning rate: η ∈ {0.01, 0.02, . . . , 0.09} • Momentum: m ∈ {0.80, 0.82, . . . , 0.98} • Number of epochs: e ∼ U(25, 100) (randomly sampled between 25 and 100) • Temperature (τ ): τ ∈ {0.1, 0.01, 0.001} • α: α ∈ {0.4, 0.5, 0.6} To evaluate the robustness of the selected hyperparam- eter combination, we further examined its consistency by computing the standard deviation in test accuracy across 100 different random seeds for the out-of-domain experiments and 3 different random seeds for in-domain experiments, due to the fact that in this experiment, we unfreeze the image encoder and fine-tune it. Appendix C Templates In this section, we provide examples of templates specifically designed for the camera trap image recognition task. These templates are adapted from the ImageNet templates used in CLIP (Radford et al., 2021) and are presented below: • a photo captured by a camera trap of a {}. • a camera trap photo of the {} captured in poor conditions. • a cropped camera trap image of the {}. • a camera trap image featuring a bright view of the {}. • a camera trap image of the {} captured in clean condi- tions. • a camera trap image of the {} captured in dirty conditions. • a camera trap image with low light conditions featuring the {}. • a black and white camera trap image of the {}. • a cropped camera trap image of a {}. • a blurry camera trap image of the {}. 123 183 Page 14 of 16 International Journal of Computer Vision (2026) 134:183 • a camera trap image of the {}. • a camera trap image of a single {}. • a camera trap image of a {}. • a camera trap image of a large {}. • a blurry camera trap image of a {}. • a pixelated camera trap image of a {}. • a camera trap image of the weird {}. • a camera trap image of the large {}. • a dark camera trap image of a {}. • a camera trap image of a small {}. For each template, we replace “{ }" by the specific category in CD . Acknowledgements This work was supported by Universidad de Antioquia - CODI (project 2024-73410), by the ANR (French National Research Agency) under the JCJC project DeSNAP (ANR-24-CE23- 1895-01), and by the Academic Grant from NVIDIA AI. Funding Open Access funding provided by Colombia Consortium Data Availability Snapshot Serengeti data and Terra Incognita are pub- licly available on LILA BC at [https://lila.science/datasets/snapshot- serengeti] and on Caltech Camera Traps at [https://beerys.github.io /CaltechCameraTraps/]. For reproducibility, we use the preprocessed image data of Snapshot Serengeti as provided by WildCLIP [https://doi.org/10.1007/s11263-024-02026-6]. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- tation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indi- cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, youwill need to obtain permission directly from the copy- right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. References Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A.A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., & Behl, H., et al. (2024). Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774. Archer, S. R., & Predick, K. I. (2008). Climate change and ecosystems of the southwestern united states. Rangelands, 30(3), 23–28. Awais,M.,Naseer,M.,Khan, S.,Anwer,R.M.,Cholakkal,H., Shah,M., Yang, M.-H., & Khan, F. S. (2025). Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4), 2245–2264. Beery, S., Morris, D., & Yang, S. (2019). Efficient pipeline for camera trap image review. arXiv preprint arXiv:1907.06772. Beery, S., Van Horn, G., & Perona, P. (2018). Recognition in Terra Incognita. In IEEE/CVF European Conference on Computer Vision, 456–473. Bendou, Y., Ouasfi, A., Gripon, V., &Boukhayma, A. (2025). Proker: A kernel perspective on few-shot adaptation of large vision-language models. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, 25092–25102. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language under- standing. In Conference of the North American Chapter of the Association for Computational Linguistics, 4171–4186. Duan, J., Chen, L., Tran, S., Yang, J., Xu, Y., Zeng, B., & Chilimbi, T. (2022). Multi-modal alignment using representation codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15651–15660. Fabian, Z.,Miao, Z., Li, C., Zhang,Y., Liu, Z., Hernandez, A., Arbelaez, P., Link,A.,Montes-Rojas,A.,&Escucha,R., et al. (2023).Knowl- edge augmented instruction tuning for zero-shot animal species recognition. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. Fang, Z., Lu, J., & Zhang, G. (2025). Out-of-distribution detection with non-semantic exploration. Information Sciences, 705, Arti- cle 121989. Gabeff, V., Rußwurm, M., Tuia, D., & Mathis, A. (2024). WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models. International Journal of Computer Vision, 132(9), 3770–3786. Gadot, T., Istrate, S., Kim, H., Morris, D., Beery, S., Birch, T., & Ahu- mada, J. (2024). To crop or not to crop: Comparing whole-image and cropped classification on a large dataset of camera trap images. IET Computer Vision, 18(8), 1193–1208. Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2024). Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132, 581–595. Garau-Luis, J. J., Bordes, P., Gonzalez, L., Roller, M., de Almeida, B., Blum, C., Hexemer, L., Laurent, S., Lang, M., Pierrot, T., et al. (2025). Multi-modal transfer learning between biological foundation models. In Advances in Neural Information Process- ing Systems, 37, 78431–78450. Giraldo, J. H., Salazar, A., Gomez-Villa, A., & Diaz-Pulido, A. (2019). Camera-trap images segmentation using multi-layer robust prin- cipal component analysis. The Visual Computer, 35, 335–347. Gomez-Villa, A., Salazar, A., & Vargas, F. (2017). Towards auto- matic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural net- works. Ecological informatics, 41, 24–32. Hernandez, A., Miao, Z., Vargas, L., Dodhia, R., & Lavista, J. (2024). Pytorch-Wildlife: A collaborative deep learning framework for conservation. arXiv preprint arXiv:2405.12930. Hogeweg, L. E., Gangireddy, R., Brunink, D., Kalkman, V.J., Cor- nelissen, L., & Kamminga, J.W. (2024). Cood: Combined out- of-distribution detection using multiple measures for anomaly & novel class detection in large-scale hierarchical classification. In IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 3971–3980. Huang, S., Dong, L.,Wang,W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O. K., Patra, B., et al. (2024). Language is not all you need: Aligning perception with languagemodels. In Advances in Neural Information Processing Systems, 36, 72096–72109. Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D. d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov,M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., 123 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://arxiv.org/abs/2404.14219 http://arxiv.org/abs/2303.08774 http://arxiv.org/abs/1907.06772 http://arxiv.org/abs/2405.12930 http://arxiv.org/abs/2310.06825 International Journal of Computer Vision (2026) 134:183 Page 15 of 16 183 et al. (2021). Highly accurate protein structure prediction with alphafold. Nature, 596, 583–589. Jung, S.J., Kim, H., & Jang, K.S. (2024). Llm based biological named entity recognition from scientific literature. In IEEE International Conference on Big Data and Smart Computing (BigComp), 433– 435. Kempf, E., Schrodi, S., Argus, M., & Brox, T. (2025). When and how does clip enable domain and compositional generalization? arXiv preprint arXiv:2502.09507. Khan, Z., & Fu, Y. (2024). Consistency and uncertainty: Identify- ing unreliable responses from black-box vision-language models for selective visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 10854–10863. Lam, H. Y. I., Ong, X. E., &Mutwil, M. (2024). Large language models in plant biology. Trends in Plant Science, 29, 1145–1155. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learn- ing, 19730–19742. Li, J., Li, Y., Fu, Y., Liu, J., Liu, Y., Yang, M., & King, I. (2025a). Clip-powered domain generalization and domain adaptation: A comprehensive survey. arXiv preprint arXiv:2504.14280. Li, S., Xu, X., Meng, W., Song, J., Peng, C., & Shen, H. T. (2025). Mitigating hallucinations in large vision-language models via reasoning uncertainty-guided refinement. IEEE Transactions on Multimedia, 27, 7380–7391. Li, X., Tian, H., Piao, Z., Wang, G., Xiao, Z., Sun, Y., Gao, E., & Holyoak, M. (2022). cameratrapr: An r package for estimating animal density using camera trappingdata.Ecological Informatics, 69, Article 101597. Liang, W., Mao, Y., Kwon, Y., Yang, X., & Zou, J. (2023). Accuracy on the curve: On the nonlinear correlation ofml performance between data subpopulations. In IEEE/CVF International Conference on Machine Learning, 20706–20724. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. In Advances in Neural Information Processing Systems, 36, 34892– 34916. Luo, G., Zhou, Y., Sun, X., Wu, Y., Gao, Y., & Ji, R. (2024). Towards language-guided visual recognition via dynamic convolutions. International Journal of Computer Vision, 132, 1–19. Mushtaq, E., Fabian, Z., Bakman, Y.F., Ramakrishna, A., Soltanolkotabi, M., & Avestimehr, S. (2025). Harmony: Hidden activation representations and model output-aware uncertainty estimation for vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 1663– 1668. Nguyen, T., Lyu, B., Ishwar, P., Scheutz,M., &Aeron, S. (2022). Trade- off between reconstruction loss and feature alignment for domain generalization. In 2022 21st IEEE International Conference on Machine Learning and Applications, 794–801. Norman, D. L., Bischoff, P. H., Wearn, O. R., Ewers, R. M., Rowcliffe, J. M., Evans, B., Sethi, S., Chapman, P. M., & Freeman, R. (2023). Can CNN-based species classification generalise across variation in habitat within a camera trap survey? Methods in Ecology and Evolution, 14(1), 242–251. Pantazis, O., Brostow, G., Jones, K., & Mac Aodha, O. (2022). SVL- Adapter: Self-SupervisedAdapter for Vision-Language Pretrained Models. In British Machine Vision Conference. Pollock, L. J., Kitzes, J., Beery, S., Gaynor, K. M., Jarzyna, M. A., Mac Aodha, O., Meyer, B., Rolnick, D., Taylor, G. W., Tuia, D., et al. (2025). Harnessing artificial intelligence to fill global shortfalls in biodiversity knowledge. Nature Reviews Biodiversity, 1, 166–182. Pratt, S., Covert, I., Liu, R., & Farhadi, A. (2023). What does a platy- pus look like? generating customized prompts for zero-shot image classification. In IEEE/CVF International Conference on Com- puter Vision, 15691–15701. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 139, 8748–8763. Reynolds, S. A., Beery, S., Burgess, N., Burgman, M., Butchart, S. H., Cooke, S. J., Coomes, D., Danielsen, F., Di Minin, E., Durán, A. P., et al. (2024). The potential for ai to revolutionize conservation: a horizon scan. Trends in Ecology & Evolution, 40, 191–207. Riz, L., Saltori, C., Wang, Y., Ricci, E., & Poiesi, F. (2024). Novel class discovery meets foundation models for 3d semantic segmentation. International Journal of Computer Vision, 133, 527–548. Santamaria, J.D., Isaza, C., &Giraldo, J.H. (2025). CATALOG: A cam- era trap language-guided contrastive learningmodel. In IEEE/CVF Winter Conference on Applications of Computer Vision, 1197– 1206,. Santamaria P, J.D., Giraldo, J.H., Diaz-Pulido, A., & Isaza, C. (2024). Audio vs. visual approach to monitor the critically endangered species atlapetes blancae: Developing deep learning models with limited data. In IARIA Annual Congress on Frontiers in Science, Technology, Services, and Applications, 72–80. Schneider, S., Greenberg, S., Taylor, G. W., & Kremer, S. C. (2020). Three critical factors affecting automated image species recogni- tion performance for camera traps. Ecology and Evolution, 10(7), 3503–3517. Simões, F., Bouveyron, C., & Precioso, F. (2023). DeepWILD:Wildlife identification, localisation and estimation on camera trap videos using deep learning. Ecological Informatics, 75, Article 102095. Stevens, S., Wu, J., Thompson, M.J., Campolongo, E.G., Song, C.H., Carlyn, D.E., Dong, L., Dahdul, W.M., Stewart, C., Berger-Wolf, T., et al. (2024). BioCLIP: A vision foundation model for the tree of life. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19412–19424. Swanson, A., Kosmala, M., Lintott, C., Simpson, R., Smith, A., & Packer, C. (2015). Data from: Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Tan, M. & Le, Q. (2021). Efficientnetv2: Smaller models and faster training. In IEEE/CVF International Conference on Machine Learning, 10096–10106. Tang, L., Jiang, P.-T., Xiao, H., & Li, B. (2025). Towards training-free open-world segmentation via image prompt foundation models. International Journal of Computer Vision, 133, 1–15. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., & Rodriguez, A. (2023). LLaMA: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971. Tuia,D.,Kellenberger,B.,Beery, S.,Costelloe,B.R., Zuffi, S.,Risse,B., Mathis, A., Mathis, M.W., van Langevelde, F., Burghardt, T., et al. (2022). Perspectives inmachine learning for wildlife conservation. Nature Communications, 13(1), 792. Wald, Y., Feder, A., Greenfeld, D., & Shalit, U. (2021). On calibration and out-of-domain generalization. In Advances in Neural Infor- mation Processing Systems, 34, 2215–2227. Wang, Q., Lin, Y., Chen, Y., Schmidt, L., Han, B., & Zhang, T. (2024). A sober look at the robustness of clips to spurious features. In Advances in Neural Information Processing Systems, 37, 122484– 122523. Wang, Y., & Kang, G. (2025). Attention head purification: A new per- spective to harness clip for domain generalization. Image and Vision Computing, 157, Article 105511. Wu, W., Sun, Z., Song, Y., Wang, J., & Ouyang, W. (2024). Transfer- ring vision-language models for visual recognition: A classifier perspective. International Journal of Computer Vision, 132, 392– 409. 123 http://arxiv.org/abs/2502.09507 http://arxiv.org/abs/2504.14280 http://arxiv.org/abs/2302.13971 183 Page 16 of 16 International Journal of Computer Vision (2026) 134:183 Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. (2024). Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. Yang, L., Zhang, R.-Y., Chen, Q., & Xie, X. (2025). Learning with enriched inductive biases for vision-language models. Interna- tional Journal of Computer Vision, 133, 3746–3761. Yang, Z., Tian, Y.,Wang, L., & Zhang, J. (2025). Enhancing generaliza- tion in camera trap image recognition: Fine-tuning visual language models. Neurocomputing, 634, Article 129826. Yu, H., Zhang, X., Xu, R., Liu, J., He, Y., & Cui, P. (2024). Rethinking the evaluation protocol of domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21897–21908. Yu,R., Liu, S., Yang,X.,&Wang,X. (2023).Distribution shift inversion for out-of-distribution prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3592–3602. Zanella, M., & Ben Ayed, I. (2024). Low-rank few-shot adaptation of vision-language models. In Proceedings of the IEEE/CVFConfer- ence on Computer Vision and Pattern Recognition, 1593–1603. Zang, Y., Li, W., Han, J., Zhou, K., & Loy, C. C. (2024). Contextual object detection with multimodal large language models. Interna- tional Journal of Computer Vision, 133, 825–843. Zhang, B., Zhang, P., Dong, X., Zang, Y., & Wang, J. (2024a). Long- CLIP: Unlocking the long-text capability of CLIP. In IEEE/CVF European Conference on Computer Vision, 310–325. Zhang, J., Huang, J., Jin, S., & Lu, S. (2024). Vision-Language Models forVision Tasks: ASurvey. IEEETransactions onPattern Analysis and Machine Intelligence, 46(8), 5625–5644. Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., & Li, H. (2022). Tip-adapter: Training-free adaption of clip for few-shot classification. In Proceedings of the IEEE/CVF European confer- ence on computer vision, 493–510. Zhu, L., Yin, W., Yang, Y., Wu, F., Zeng, Z., Gu, Q., Wang, X., Zhou, C., & Ye, N. (2024). Vision-language alignment learning under affinity and divergence principles for few-shot out-of-distribution generalization. International Journal of Computer Vision, 132, 3375–3407. Publisher’s Note Springer Nature remains neutral with regard to juris- dictional claims in published maps and institutional affiliations. 123 http://arxiv.org/abs/2412.15115 WildIng: A Wildlife Image Invariant Representation Model for Geographical Domain Shift Abstract 1 Introduction 2 Related Work 2.1 Foundation Models 2.2 Foundation Models for Biology 2.3 Foundation Models for Camera Trap Images 3 WildIng 3.1 Problem Definition 3.2 Overview of the Approach 3.3 Text Encoder 3.4 Image Encoder 3.4.1 Pre-processing 3.4.2 Image Embeddings 3.5 Image-text Encoder 3.6 Similarity Mechanism 3.7 Contrastive Loss 4 Experiments and Results 4.1 Datasets 4.2 Evaluation Protocol 4.2.1 Implementation Details 4.3 Quantitative Results 4.3.1 Comparison with the SOTA in Out-of-domain Evaluation 4.3.2 Trainable Parameters and Computational Cost 4.3.3 In-domain Performance Comparison in the Snapshot Serengeti Dataset 4.3.4 In-domain Performance Comparison in the Terra Incognita Dataset 4.4 Ablation Studies 4.4.1 Evaluating the Incorporation of a Template Set 4.4.2 Image Encoder, Image-text Encoder, and LLM 4.4.3 Evaluating different LLMs 4.5 Sensitivity to the Hyperparameter α 4.6 Sensitivity to the number of LLM-prompted sentences 4.7 Limitations 5 Conclusions Appendix A Prompts Appendix A.1 Prompt LLM Appendix A.2 Prompt LLaVA Appendix B Hyperparameter Search Space Appendix C Templates Acknowledgements References