A Well-Designed Experiment Can Teach You More Than a Time Machine!Towards Data Scienceon July 23, 2025 at 2:50 am How experimentation is more powerful than knowing counterfactuals
The post A Well-Designed Experiment Can Teach You More Than a Time Machine! appeared first on Towards Data Science.
How experimentation is more powerful than knowing counterfactuals
The post A Well-Designed Experiment Can Teach You More Than a Time Machine! appeared first on Towards Data Science. Read More
When LLMs Try to Reason: Experiments in Text and Vision-Based AbstractionTowards Data Scienceon July 22, 2025 at 7:35 pm Can large language models learn to reason abstractly from just a few examples? In this piece, I explore this question by testing both text-based (o3-mini) and image-capable (gpt-4.1) models on abstract grid transformation tasks. These experiments reveal the extent to which current models rely on pattern matching, procedural heuristics, and symbolic shortcuts rather than robust generalization. Even with multimodal inputs, reasoning often breaks down in the face of subtle abstraction. The results offer a window into the current capabilities and limitations of in-context meta-learning with LLMs.
The post When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction appeared first on Towards Data Science.
Can large language models learn to reason abstractly from just a few examples? In this piece, I explore this question by testing both text-based (o3-mini) and image-capable (gpt-4.1) models on abstract grid transformation tasks. These experiments reveal the extent to which current models rely on pattern matching, procedural heuristics, and symbolic shortcuts rather than robust generalization. Even with multimodal inputs, reasoning often breaks down in the face of subtle abstraction. The results offer a window into the current capabilities and limitations of in-context meta-learning with LLMs.
The post When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction appeared first on Towards Data Science. Read More
The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineeringcs.AI updates on arXiv.orgon July 22, 2025 at 4:00 am arXiv:2507.15003v1 Announce Type: cross
Abstract: The future of software engineering–SE 3.0–is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents–OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code–across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development.
Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes–enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission–one developer submitted as many PRs in three days as they had in three years–these are structurally simpler (via code complexity metrics).
We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3.
> AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent
arXiv:2507.15003v1 Announce Type: cross
Abstract: The future of software engineering–SE 3.0–is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents–OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code–across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development.
Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes–enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission–one developer submitted as many PRs in three days as they had in three years–these are structurally simpler (via code complexity metrics).
We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3.
> AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent Read More
Benchmarking Foundation Models with Multimodal Public Electronic Health Recordscs.AI updates on arXiv.orgon July 22, 2025 at 4:00 am arXiv:2507.14824v1 Announce Type: cross
Abstract: Foundation models have emerged as a powerful approach for processing electronic health records (EHRs), offering flexibility to handle diverse medical data modalities. In this study, we present a comprehensive benchmark that evaluates the performance, fairness, and interpretability of foundation models, both as unimodal encoders and as multimodal learners, using the publicly available MIMIC-IV database. To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clinical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants. Our findings demonstrate that incorporating multiple data modalities leads to consistent improvements in predictive performance without introducing additional bias. Through this benchmark, we aim to support the development of effective and trustworthy multimodal artificial intelligence (AI) systems for real-world clinical applications. Our code is available at https://github.com/nliulab/MIMIC-Multimodal.
arXiv:2507.14824v1 Announce Type: cross
Abstract: Foundation models have emerged as a powerful approach for processing electronic health records (EHRs), offering flexibility to handle diverse medical data modalities. In this study, we present a comprehensive benchmark that evaluates the performance, fairness, and interpretability of foundation models, both as unimodal encoders and as multimodal learners, using the publicly available MIMIC-IV database. To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clinical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants. Our findings demonstrate that incorporating multiple data modalities leads to consistent improvements in predictive performance without introducing additional bias. Through this benchmark, we aim to support the development of effective and trustworthy multimodal artificial intelligence (AI) systems for real-world clinical applications. Our code is available at https://github.com/nliulab/MIMIC-Multimodal. Read More
A Reproducibility Study of Product-side Fairness in Bundle Recommendationcs.AI updates on arXiv.orgon July 22, 2025 at 4:00 am arXiv:2507.14352v1 Announce Type: cross
Abstract: Recommender systems are known to exhibit fairness issues, particularly on the product side, where products and their associated suppliers receive unequal exposure in recommended results. While this problem has been widely studied in traditional recommendation settings, its implications for bundle recommendation (BR) remain largely unexplored. This emerging task introduces additional complexity: recommendations are generated at the bundle level, yet user satisfaction and product (or supplier) exposure depend on both the bundle and the individual items it contains. Existing fairness frameworks and metrics designed for traditional recommender systems may not directly translate to this multi-layered setting. In this paper, we conduct a comprehensive reproducibility study of product-side fairness in BR across three real-world datasets using four state-of-the-art BR methods. We analyze exposure disparities at both the bundle and item levels using multiple fairness metrics, uncovering important patterns. Our results show that exposure patterns differ notably between bundles and items, revealing the need for fairness interventions that go beyond bundle-level assumptions. We also find that fairness assessments vary considerably depending on the metric used, reinforcing the need for multi-faceted evaluation. Furthermore, user behavior plays a critical role: when users interact more frequently with bundles than with individual items, BR systems tend to yield fairer exposure distributions across both levels. Overall, our findings offer actionable insights for building fairer bundle recommender systems and establish a vital foundation for future research in this emerging domain.
arXiv:2507.14352v1 Announce Type: cross
Abstract: Recommender systems are known to exhibit fairness issues, particularly on the product side, where products and their associated suppliers receive unequal exposure in recommended results. While this problem has been widely studied in traditional recommendation settings, its implications for bundle recommendation (BR) remain largely unexplored. This emerging task introduces additional complexity: recommendations are generated at the bundle level, yet user satisfaction and product (or supplier) exposure depend on both the bundle and the individual items it contains. Existing fairness frameworks and metrics designed for traditional recommender systems may not directly translate to this multi-layered setting. In this paper, we conduct a comprehensive reproducibility study of product-side fairness in BR across three real-world datasets using four state-of-the-art BR methods. We analyze exposure disparities at both the bundle and item levels using multiple fairness metrics, uncovering important patterns. Our results show that exposure patterns differ notably between bundles and items, revealing the need for fairness interventions that go beyond bundle-level assumptions. We also find that fairness assessments vary considerably depending on the metric used, reinforcing the need for multi-faceted evaluation. Furthermore, user behavior plays a critical role: when users interact more frequently with bundles than with individual items, BR systems tend to yield fairer exposure distributions across both levels. Overall, our findings offer actionable insights for building fairer bundle recommender systems and establish a vital foundation for future research in this emerging domain. Read More
New to LLMs? Start Here Towards Data Scienceon May 23, 2025 at 7:51 pm A guide to Agents, LLMs, RAG, Fine-tuning, LangChain with practical examples to start building
The post New to LLMs? Start Here appeared first on Towards Data Science.
A guide to Agents, LLMs, RAG, Fine-tuning, LangChain with practical examples to start building
The post New to LLMs? Start Here appeared first on Towards Data Science. Read More
Estimating Product-Level Price Elasticities Using Hierarchical BayesianTowards Data Scienceon May 23, 2025 at 11:58 pm Using one model to personalize ML results
The post Estimating Product-Level Price Elasticities Using Hierarchical Bayesian appeared first on Towards Data Science.
Using one model to personalize ML results
The post Estimating Product-Level Price Elasticities Using Hierarchical Bayesian appeared first on Towards Data Science. Read More
How to Evaluate LLMs and Algorithms — The Right WayTowards Data Scienceon May 23, 2025 at 2:02 pm Never miss a new edition of The Variable, our weekly newsletter featuring a top-notch selection of editors’ picks, deep dives, community news, and more. Subscribe today! All the hard work it takes to integrate large language models and powerful algorithms into your workflows can go to waste if the outputs you see don’t live up to expectations.
The post How to Evaluate LLMs and Algorithms — The Right Way appeared first on Towards Data Science.
Never miss a new edition of The Variable, our weekly newsletter featuring a top-notch selection of editors’ picks, deep dives, community news, and more. Subscribe today! All the hard work it takes to integrate large language models and powerful algorithms into your workflows can go to waste if the outputs you see don’t live up to expectations.
The post How to Evaluate LLMs and Algorithms — The Right Way appeared first on Towards Data Science. Read More
Do More with NumPy Array Type Hints: Annotate & Validate Shape & DtypeTowards Data Scienceon May 23, 2025 at 6:43 pm Improve static analysis and run-time validation with full generic specification
The post Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype appeared first on Towards Data Science.
Improve static analysis and run-time validation with full generic specification
The post Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype appeared first on Towards Data Science. Read More
Prototyping Gradient Descent in Machine LearningTowards Data Scienceon May 24, 2025 at 1:12 am Mathematical theorem and credit transaction prediction using Stochastic / Batch GD
The post Prototyping Gradient Descent in Machine Learning appeared first on Towards Data Science.
Mathematical theorem and credit transaction prediction using Stochastic / Batch GD
The post Prototyping Gradient Descent in Machine Learning appeared first on Towards Data Science. Read More