Unsupervised anomaly detection using Bayesian flow networks: application to brain FDG PET in the context of Alzheimer’s diseasecs.AI updates on arXiv.orgon July 24, 2025 at 4:00 am arXiv:2507.17486v1 Announce Type: cross
Abstract: Unsupervised anomaly detection (UAD) plays a crucial role in neuroimaging for identifying deviations from healthy subject data and thus facilitating the diagnosis of neurological disorders. In this work, we focus on Bayesian flow networks (BFNs), a novel class of generative models, which have not yet been applied to medical imaging or anomaly detection. BFNs combine the strength of diffusion frameworks and Bayesian inference. We introduce AnoBFN, an extension of BFNs for UAD, designed to: i) perform conditional image generation under high levels of spatially correlated noise, and ii) preserve subject specificity by incorporating a recursive feedback from the input image throughout the generative process. We evaluate AnoBFN on the challenging task of Alzheimer’s disease-related anomaly detection in FDG PET images. Our approach outperforms other state-of-the-art methods based on VAEs (beta-VAE), GANs (f-AnoGAN), and diffusion models (AnoDDPM), demonstrating its effectiveness at detecting anomalies while reducing false positive rates.
arXiv:2507.17486v1 Announce Type: cross
Abstract: Unsupervised anomaly detection (UAD) plays a crucial role in neuroimaging for identifying deviations from healthy subject data and thus facilitating the diagnosis of neurological disorders. In this work, we focus on Bayesian flow networks (BFNs), a novel class of generative models, which have not yet been applied to medical imaging or anomaly detection. BFNs combine the strength of diffusion frameworks and Bayesian inference. We introduce AnoBFN, an extension of BFNs for UAD, designed to: i) perform conditional image generation under high levels of spatially correlated noise, and ii) preserve subject specificity by incorporating a recursive feedback from the input image throughout the generative process. We evaluate AnoBFN on the challenging task of Alzheimer’s disease-related anomaly detection in FDG PET images. Our approach outperforms other state-of-the-art methods based on VAEs (beta-VAE), GANs (f-AnoGAN), and diffusion models (AnoDDPM), demonstrating its effectiveness at detecting anomalies while reducing false positive rates. Read More
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Taskscs.AI updates on arXiv.orgon July 24, 2025 at 4:00 am arXiv:2507.01955v2 Announce Type: replace-cross
Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc).
The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework.
We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.
arXiv:2507.01955v2 Announce Type: replace-cross
Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc).
The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework.
We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments. Read More
How Not to Mislead with Your Data-Driven StoryTowards Data Scienceon July 23, 2025 at 7:10 pm Data storytelling can enlighten—but it can also deceive. When persuasive narratives meet biased framing, cherry-picked data, or misleading visuals, insights risk becoming illusions. This article explores the hidden biases embedded in data-driven storytelling—from the seduction of beautiful charts to the quiet influence of AI-generated insights—and offers practical strategies to tell stories that are not only compelling, but also credible, transparent, and grounded in truth.
The post How Not to Mislead with Your Data-Driven Story appeared first on Towards Data Science.
Data storytelling can enlighten—but it can also deceive. When persuasive narratives meet biased framing, cherry-picked data, or misleading visuals, insights risk becoming illusions. This article explores the hidden biases embedded in data-driven storytelling—from the seduction of beautiful charts to the quiet influence of AI-generated insights—and offers practical strategies to tell stories that are not only compelling, but also credible, transparent, and grounded in truth.
The post How Not to Mislead with Your Data-Driven Story appeared first on Towards Data Science. Read More
The Download: what’s next for AI agents, and how Trump protects US tech companies overseasMIT Technology Reviewon July 23, 2025 at 12:10 pm This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Navigating the rise of AI agents AI agents is a buzzy term that essentially refers to AI models and algorithms that can not only provide you with information, but take actions on your…
This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Navigating the rise of AI agents AI agents is a buzzy term that essentially refers to AI models and algorithms that can not only provide you with information, but take actions on your… Read More
Sam Altman: AI will cause job losses and national security threatsAI Newson July 23, 2025 at 10:57 am In the halls of power in Washington, OpenAI’s chief, Sam Altman, warned of total job losses from AI and how national security is being rewritten. Altman positions OpenAI as not just a participant, but as the essential architect of our destiny. Holding court at the Federal Reserve’s conference for large banks, Altman clearly stated how
The post Sam Altman: AI will cause job losses and national security threats appeared first on AI News.
In the halls of power in Washington, OpenAI’s chief, Sam Altman, warned of total job losses from AI and how national security is being rewritten. Altman positions OpenAI as not just a participant, but as the essential architect of our destiny. Holding court at the Federal Reserve’s conference for large banks, Altman clearly stated how
The post Sam Altman: AI will cause job losses and national security threats appeared first on AI News. Read More
Beyond Algorethics: Addressing the Ethical and Anthropological Challenges of AI Recommender Systemscs.AI updates on arXiv.orgon July 23, 2025 at 4:00 am arXiv:2507.16430v1 Announce Type: cross
Abstract: In this paper, I examine the ethical and anthropological challenges posed by AI-driven recommender systems (RSs), which have become central to shaping digital environments and social interactions. By curating personalized content, RSs do not merely reflect user preferences but actively construct individual experiences across social media, entertainment platforms, and e-commerce. Despite their ubiquity, the ethical implications of RSs remain insufficiently explored, even as concerns over privacy, autonomy, and mental well-being intensify. I argue that existing ethical approaches, including algorethics, the effort to embed ethical principles into algorithmic design, are necessary but ultimately inadequate. RSs inherently reduce human complexity to quantifiable dimensions, exploit user vulnerabilities, and prioritize engagement over well-being. Addressing these concerns requires moving beyond purely technical solutions. I propose a comprehensive framework for human-centered RS design, integrating interdisciplinary perspectives, regulatory strategies, and educational initiatives to ensure AI systems foster rather than undermine human autonomy and societal flourishing.
arXiv:2507.16430v1 Announce Type: cross
Abstract: In this paper, I examine the ethical and anthropological challenges posed by AI-driven recommender systems (RSs), which have become central to shaping digital environments and social interactions. By curating personalized content, RSs do not merely reflect user preferences but actively construct individual experiences across social media, entertainment platforms, and e-commerce. Despite their ubiquity, the ethical implications of RSs remain insufficiently explored, even as concerns over privacy, autonomy, and mental well-being intensify. I argue that existing ethical approaches, including algorethics, the effort to embed ethical principles into algorithmic design, are necessary but ultimately inadequate. RSs inherently reduce human complexity to quantifiable dimensions, exploit user vulnerabilities, and prioritize engagement over well-being. Addressing these concerns requires moving beyond purely technical solutions. I propose a comprehensive framework for human-centered RS design, integrating interdisciplinary perspectives, regulatory strategies, and educational initiatives to ensure AI systems foster rather than undermine human autonomy and societal flourishing. Read More
A Well-Designed Experiment Can Teach You More Than a Time Machine!Towards Data Scienceon July 23, 2025 at 2:50 am How experimentation is more powerful than knowing counterfactuals
The post A Well-Designed Experiment Can Teach You More Than a Time Machine! appeared first on Towards Data Science.
How experimentation is more powerful than knowing counterfactuals
The post A Well-Designed Experiment Can Teach You More Than a Time Machine! appeared first on Towards Data Science. Read More
When LLMs Try to Reason: Experiments in Text and Vision-Based AbstractionTowards Data Scienceon July 22, 2025 at 7:35 pm Can large language models learn to reason abstractly from just a few examples? In this piece, I explore this question by testing both text-based (o3-mini) and image-capable (gpt-4.1) models on abstract grid transformation tasks. These experiments reveal the extent to which current models rely on pattern matching, procedural heuristics, and symbolic shortcuts rather than robust generalization. Even with multimodal inputs, reasoning often breaks down in the face of subtle abstraction. The results offer a window into the current capabilities and limitations of in-context meta-learning with LLMs.
The post When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction appeared first on Towards Data Science.
Can large language models learn to reason abstractly from just a few examples? In this piece, I explore this question by testing both text-based (o3-mini) and image-capable (gpt-4.1) models on abstract grid transformation tasks. These experiments reveal the extent to which current models rely on pattern matching, procedural heuristics, and symbolic shortcuts rather than robust generalization. Even with multimodal inputs, reasoning often breaks down in the face of subtle abstraction. The results offer a window into the current capabilities and limitations of in-context meta-learning with LLMs.
The post When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction appeared first on Towards Data Science. Read More
The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineeringcs.AI updates on arXiv.orgon July 22, 2025 at 4:00 am arXiv:2507.15003v1 Announce Type: cross
Abstract: The future of software engineering–SE 3.0–is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents–OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code–across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development.
Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes–enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission–one developer submitted as many PRs in three days as they had in three years–these are structurally simpler (via code complexity metrics).
We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3.
> AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent
arXiv:2507.15003v1 Announce Type: cross
Abstract: The future of software engineering–SE 3.0–is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents–OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code–across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development.
Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes–enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission–one developer submitted as many PRs in three days as they had in three years–these are structurally simpler (via code complexity metrics).
We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3.
> AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent Read More
Benchmarking Foundation Models with Multimodal Public Electronic Health Recordscs.AI updates on arXiv.orgon July 22, 2025 at 4:00 am arXiv:2507.14824v1 Announce Type: cross
Abstract: Foundation models have emerged as a powerful approach for processing electronic health records (EHRs), offering flexibility to handle diverse medical data modalities. In this study, we present a comprehensive benchmark that evaluates the performance, fairness, and interpretability of foundation models, both as unimodal encoders and as multimodal learners, using the publicly available MIMIC-IV database. To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clinical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants. Our findings demonstrate that incorporating multiple data modalities leads to consistent improvements in predictive performance without introducing additional bias. Through this benchmark, we aim to support the development of effective and trustworthy multimodal artificial intelligence (AI) systems for real-world clinical applications. Our code is available at https://github.com/nliulab/MIMIC-Multimodal.
arXiv:2507.14824v1 Announce Type: cross
Abstract: Foundation models have emerged as a powerful approach for processing electronic health records (EHRs), offering flexibility to handle diverse medical data modalities. In this study, we present a comprehensive benchmark that evaluates the performance, fairness, and interpretability of foundation models, both as unimodal encoders and as multimodal learners, using the publicly available MIMIC-IV database. To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clinical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants. Our findings demonstrate that incorporating multiple data modalities leads to consistent improvements in predictive performance without introducing additional bias. Through this benchmark, we aim to support the development of effective and trustworthy multimodal artificial intelligence (AI) systems for real-world clinical applications. Our code is available at https://github.com/nliulab/MIMIC-Multimodal. Read More