TJS Articles & News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

Why Do Language Model Agents Whistleblow? AI updates on arXiv.org

Why Do Language Model Agents Whistleblow?cs.AI updates on arXiv.org arXiv:2511.17085v1 Announce Type: cross
Abstract: The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.

arXiv:2511.17085v1 Announce Type: cross
Abstract: The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work. Read More

LEARN MORE 4

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty? AI updates on arXiv.org

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?cs.AI updates on arXiv.org arXiv:2508.01109v2 Announce Type: replace
Abstract: We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

arXiv:2508.01109v2 Announce Type: replace
Abstract: We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts. Read More

LEARN MORE 3

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

Bootstrap Off-policy with World Model AI updates on arXiv.org

Bootstrap Off-policy with World Modelcs.AI updates on arXiv.org arXiv:2511.00423v2 Announce Type: replace-cross
Abstract: Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy’s actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner’s non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner’s action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.

arXiv:2511.00423v2 Announce Type: replace-cross
Abstract: Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy’s actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner’s non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner’s action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL. Read More

LEARN MORE 4

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attributioncs.AI updates on arXiv.org

Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attributioncs.AI updates on arXiv.org arXiv:2511.16541v2 Announce Type: replace-cross
Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical.
This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators.
With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.

arXiv:2511.16541v2 Announce Type: replace-cross
Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical.
This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators.
With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols. Read More

LEARN MORE 4

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

ISS-Geo142: A Benchmark for Geolocating Astronaut Photography from the International Space Station AI updates on arXiv.org

ISS-Geo142: A Benchmark for Geolocating Astronaut Photography from the International Space Stationcs.AI updates on arXiv.org arXiv:2504.21194v2 Announce Type: replace-cross
Abstract: This paper introduces ISS-Geo142, a curated benchmark for geolocating astronaut photography captured from the International Space Station (ISS). Although the ISS position at capture time is known precisely, the specific Earth locations depicted in these images are typically not directly georeferenced, making automated localization non-trivial. ISS-Geo142 consists of 142 images with associated metadata and manually determined geographic locations, spanning a range of spatial scales and scene types.
On top of this benchmark, we implement and evaluate three geolocation pipelines: a neural network based approach (NN-Geo) using VGG16 features and cross-correlation over map-derived Areas of Interest (AOIs), a Scale-Invariant Feature Transform based pipeline (SIFT-Match) using sliding-window feature matching on stitched high-resolution AOIs, and TerraByte, an AI system built around a GPT-4 model with vision capabilities that jointly reasons over image content and ISS coordinates. On ISS-Geo142, NN-Geo achieves a match for 75.52% of the images under our evaluation protocol, SIFT-Match attains high precision on structurally rich scenes at substantial computational cost, and TerraByte establishes the strongest overall baseline, correctly geolocating approximately 90% of the images while also producing human-readable geographic descriptions.
The methods and experiments were originally developed in 2023; this manuscript is a revised and extended version that situates the work relative to subsequent advances in cross-view geo-localization and remote-sensing vision–language models. Taken together, ISS-Geo142 and these three pipelines provide a concrete, historically grounded benchmark for future work on ISS image geolocation.

arXiv:2504.21194v2 Announce Type: replace-cross
Abstract: This paper introduces ISS-Geo142, a curated benchmark for geolocating astronaut photography captured from the International Space Station (ISS). Although the ISS position at capture time is known precisely, the specific Earth locations depicted in these images are typically not directly georeferenced, making automated localization non-trivial. ISS-Geo142 consists of 142 images with associated metadata and manually determined geographic locations, spanning a range of spatial scales and scene types.
On top of this benchmark, we implement and evaluate three geolocation pipelines: a neural network based approach (NN-Geo) using VGG16 features and cross-correlation over map-derived Areas of Interest (AOIs), a Scale-Invariant Feature Transform based pipeline (SIFT-Match) using sliding-window feature matching on stitched high-resolution AOIs, and TerraByte, an AI system built around a GPT-4 model with vision capabilities that jointly reasons over image content and ISS coordinates. On ISS-Geo142, NN-Geo achieves a match for 75.52% of the images under our evaluation protocol, SIFT-Match attains high precision on structurally rich scenes at substantial computational cost, and TerraByte establishes the strongest overall baseline, correctly geolocating approximately 90% of the images while also producing human-readable geographic descriptions.
The methods and experiments were originally developed in 2023; this manuscript is a revised and extended version that situates the work relative to subsequent advances in cross-view geo-localization and remote-sensing vision–language models. Taken together, ISS-Geo142 and these three pipelines provide a concrete, historically grounded benchmark for future work on ISS image geolocation. Read More

LEARN MORE 4

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

HiCL: Hippocampal-Inspired Continual Learning AI updates on arXiv.org

HiCL: Hippocampal-Inspired Continual Learningcs.AI updates on arXiv.org arXiv:2508.16651v2 Announce Type: replace-cross
Abstract: We propose HiCL, a novel hippocampal-inspired dual-memory continual learning architecture designed to mitigate catastrophic forgetting by using elements inspired by the hippocampal circuitry. Our system encodes inputs through a grid-cell-like layer, followed by sparse pattern separation using a dentate gyrus-inspired module with top-k sparsity. Episodic memory traces are maintained in a CA3-like autoassociative memory. Task-specific processing is dynamically managed via a DG-gated mixture-of-experts mechanism, wherein inputs are routed to experts based on cosine similarity between their normalized sparse DG representations and learned task-specific DG prototypes computed through online exponential moving averages. This biologically grounded yet mathematically principled gating strategy enables differentiable, scalable task-routing without relying on a separate gating network, and enhances the model’s adaptability and efficiency in learning multiple sequential tasks. Cortical outputs are consolidated using Elastic Weight Consolidation weighted by inter-task similarity. Crucially, we incorporate prioritized replay of stored patterns to reinforce essential past experiences. Evaluations on standard continual learning benchmarks demonstrate the effectiveness of our architecture in reducing task interference, achieving near state-of-the-art results in continual learning tasks at lower computational costs. Our code is available here https://github.com/kushalk173-sc/HiCL.

arXiv:2508.16651v2 Announce Type: replace-cross
Abstract: We propose HiCL, a novel hippocampal-inspired dual-memory continual learning architecture designed to mitigate catastrophic forgetting by using elements inspired by the hippocampal circuitry. Our system encodes inputs through a grid-cell-like layer, followed by sparse pattern separation using a dentate gyrus-inspired module with top-k sparsity. Episodic memory traces are maintained in a CA3-like autoassociative memory. Task-specific processing is dynamically managed via a DG-gated mixture-of-experts mechanism, wherein inputs are routed to experts based on cosine similarity between their normalized sparse DG representations and learned task-specific DG prototypes computed through online exponential moving averages. This biologically grounded yet mathematically principled gating strategy enables differentiable, scalable task-routing without relying on a separate gating network, and enhances the model’s adaptability and efficiency in learning multiple sequential tasks. Cortical outputs are consolidated using Elastic Weight Consolidation weighted by inter-task similarity. Crucially, we incorporate prioritized replay of stored patterns to reinforce essential past experiences. Evaluations on standard continual learning benchmarks demonstrate the effectiveness of our architecture in reducing task interference, achieving near state-of-the-art results in continual learning tasks at lower computational costs. Our code is available here https://github.com/kushalk173-sc/HiCL. Read More

LEARN MORE 4

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing AI updates on arXiv.org

DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealingcs.AI updates on arXiv.org arXiv:2511.17038v1 Announce Type: new
Abstract: From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. To clarify this structure, we reinterpret the role of diffusion in inverse problem solving as an initialization stage within an expectation–maximization (EM)–style framework, where the diffusion stage and the data-driven refinement are fully decoupled. We introduce textbf{DAPS++}, which allows the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, textbf{DAPS++} achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks.

arXiv:2511.17038v1 Announce Type: new
Abstract: From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. To clarify this structure, we reinterpret the role of diffusion in inverse problem solving as an initialization stage within an expectation–maximization (EM)–style framework, where the diffusion stage and the data-driven refinement are fully decoupled. We introduce textbf{DAPS++}, which allows the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, textbf{DAPS++} achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks. Read More

LEARN MORE 5

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios AI updates on arXiv.org

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarioscs.AI updates on arXiv.org arXiv:2511.16937v1 Announce Type: cross
Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond superficial statistics. Evaluations reveal performance average drop of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.

arXiv:2511.16937v1 Announce Type: cross
Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond superficial statistics. Evaluations reveal performance average drop of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks. Read More

LEARN MORE 4

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

FLUID: Training-Free Face De-identification via Latent Identity Substitution AI updates on arXiv.org

FLUID: Training-Free Face De-identification via Latent Identity Substitutioncs.AI updates on arXiv.org arXiv:2511.17005v1 Announce Type: cross
Abstract: We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.

arXiv:2511.17005v1 Announce Type: cross
Abstract: We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics. Read More

LEARN MORE 3

News

_ November 24, 2025_ Tech Jacks Solutions_ 0 Comments

MedImageInsight for Thoracic Cavity Health Classification from Chest X-rays AI updates on arXiv.org

MedImageInsight for Thoracic Cavity Health Classification from Chest X-rayscs.AI updates on arXiv.org arXiv:2511.17043v1 Announce Type: cross
Abstract: Chest radiography remains one of the most widely used imaging modalities for thoracic diagnosis, yet increasing imaging volumes and radiologist workload continue to challenge timely interpretation. In this work, we investigate the use of MedImageInsight, a medical imaging foundational model, for automated binary classification of chest X-rays into Normal and Abnormal categories. Two approaches were evaluated: (1) fine-tuning MedImageInsight for end-to-end classification, and (2) employing the model as a feature extractor for a transfer learning pipeline using traditional machine learning classifiers. Experiments were conducted using a combination of the ChestX-ray14 dataset and real-world clinical data sourced from partner hospitals. The fine-tuned classifier achieved the highest performance, with an ROC-AUC of 0.888 and superior calibration compared to the transfer learning models, demonstrating performance comparable to established architectures such as CheXNet. These results highlight the effectiveness of foundational medical imaging models in reducing task-specific training requirements while maintaining diagnostic reliability. The system is designed for integration into web-based and hospital PACS workflows to support triage and reduce radiologist burden. Future work will extend the model to multi-label pathology classification to provide preliminary diagnostic interpretation in clinical environments.

arXiv:2511.17043v1 Announce Type: cross
Abstract: Chest radiography remains one of the most widely used imaging modalities for thoracic diagnosis, yet increasing imaging volumes and radiologist workload continue to challenge timely interpretation. In this work, we investigate the use of MedImageInsight, a medical imaging foundational model, for automated binary classification of chest X-rays into Normal and Abnormal categories. Two approaches were evaluated: (1) fine-tuning MedImageInsight for end-to-end classification, and (2) employing the model as a feature extractor for a transfer learning pipeline using traditional machine learning classifiers. Experiments were conducted using a combination of the ChestX-ray14 dataset and real-world clinical data sourced from partner hospitals. The fine-tuned classifier achieved the highest performance, with an ROC-AUC of 0.888 and superior calibration compared to the transfer learning models, demonstrating performance comparable to established architectures such as CheXNet. These results highlight the effectiveness of foundational medical imaging models in reducing task-specific training requirements while maintaining diagnostic reliability. The system is designed for integration into web-based and hospital PACS workflows to support triage and reduce radiologist burden. Future work will extend the model to multi-label pathology classification to provide preliminary diagnostic interpretation in clinical environments. Read More

LEARN MORE 3

Gallery

Contacts

Why Do Language Model Agents Whistleblow? AI updates on arXiv.org

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty? AI updates on arXiv.org

Bootstrap Off-policy with World Model AI updates on arXiv.org

Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attributioncs.AI updates on arXiv.org

ISS-Geo142: A Benchmark for Geolocating Astronaut Photography from the International Space Station AI updates on arXiv.org

HiCL: Hippocampal-Inspired Continual Learning AI updates on arXiv.org

DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing AI updates on arXiv.org

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios AI updates on arXiv.org

FLUID: Training-Free Face De-identification via Latent Identity Substitution AI updates on arXiv.org

MedImageInsight for Thoracic Cavity Health Classification from Chest X-rays AI updates on arXiv.org

Our Address

Our Mailbox

Our Phone