From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generationcs.AI updates on arXiv.orgon August 14, 2025 at 4:00 am arXiv:2508.09497v1 Announce Type: cross
Abstract: Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.
arXiv:2508.09497v1 Announce Type: cross
Abstract: Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios. Read More
Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesiscs.AI updates on arXiv.orgon August 14, 2025 at 4:00 am arXiv:2508.09458v1 Announce Type: cross
Abstract: Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI ‘hallucinations’ (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick’s outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI accuracy depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights.
arXiv:2508.09458v1 Announce Type: cross
Abstract: Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI ‘hallucinations’ (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick’s outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI accuracy depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights. Read More
Your Coding Intent is Secretly in the Context and You Should Deliberately Infer It Before Completioncs.AI updates on arXiv.orgon August 14, 2025 at 4:00 am arXiv:2508.09537v1 Announce Type: cross
Abstract: Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions–such as docstrings–are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains.
arXiv:2508.09537v1 Announce Type: cross
Abstract: Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions–such as docstrings–are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains. Read More
DeepSeek reverts to Nvidia for R2 model after Huawei AI chip failsAI Newson August 14, 2025 at 4:04 pm DeepSeek’s plan to train its new AI model, R2, on Huawei’s Ascend chips has failed and forced a retreat to Nvidia while delaying launch. For months, the narrative pushed by Beijing has been one of unstoppable technological progress and a march towards self-sufficiency. However, reality has a habit of biting back. The recent troubles of
The post DeepSeek reverts to Nvidia for R2 model after Huawei AI chip fails appeared first on AI News.
DeepSeek’s plan to train its new AI model, R2, on Huawei’s Ascend chips has failed and forced a retreat to Nvidia while delaying launch. For months, the narrative pushed by Beijing has been one of unstoppable technological progress and a march towards self-sufficiency. However, reality has a habit of biting back. The recent troubles of
The post DeepSeek reverts to Nvidia for R2 model after Huawei AI chip fails appeared first on AI News. Read More
Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessmentcs.AI updates on arXiv.orgon August 13, 2025 at 4:00 am arXiv:2508.08629v1 Announce Type: cross
Abstract: Due to perceptions of efficiency and significant productivity gains, various organisations, including in education, are adopting Large Language Models (LLMs) into their workflows. Educator-facing, learner-facing, and institution-facing LLMs, collectively, Educational Large Language Models (eLLMs), complement and enhance the effectiveness of teaching, learning, and academic operations. However, their integration into an educational setting raises significant cybersecurity concerns. A comprehensive landscape of contemporary attacks on LLMs and their impact on the educational environment is missing. This study presents a generalised taxonomy of fifty attacks on LLMs, which are categorized as attacks targeting either models or their infrastructure. The severity of these attacks is evaluated in the educational sector using the DREAD risk assessment framework. Our risk assessment indicates that token smuggling, adversarial prompts, direct injection, and multi-step jailbreak are critical attacks on eLLMs. The proposed taxonomy, its application in the educational environment, and our risk assessment will help academic and industrial practitioners to build resilient solutions that protect learners and institutions.
arXiv:2508.08629v1 Announce Type: cross
Abstract: Due to perceptions of efficiency and significant productivity gains, various organisations, including in education, are adopting Large Language Models (LLMs) into their workflows. Educator-facing, learner-facing, and institution-facing LLMs, collectively, Educational Large Language Models (eLLMs), complement and enhance the effectiveness of teaching, learning, and academic operations. However, their integration into an educational setting raises significant cybersecurity concerns. A comprehensive landscape of contemporary attacks on LLMs and their impact on the educational environment is missing. This study presents a generalised taxonomy of fifty attacks on LLMs, which are categorized as attacks targeting either models or their infrastructure. The severity of these attacks is evaluated in the educational sector using the DREAD risk assessment framework. Our risk assessment indicates that token smuggling, adversarial prompts, direct injection, and multi-step jailbreak are critical attacks on eLLMs. The proposed taxonomy, its application in the educational environment, and our risk assessment will help academic and industrial practitioners to build resilient solutions that protect learners and institutions. Read More
Coconut: A Framework for Latent Reasoning in LLMsTowards Data Scienceon August 12, 2025 at 5:54 pm Explaining Coconut (Training Large Language Models to Reason in a Continuous Latent Space) in simple terms
The post Coconut: A Framework for Latent Reasoning in LLMs appeared first on Towards Data Science.
Explaining Coconut (Training Large Language Models to Reason in a Continuous Latent Space) in simple terms
The post Coconut: A Framework for Latent Reasoning in LLMs appeared first on Towards Data Science. Read More
Anthropic details its AI safety strategyAI Newson August 13, 2025 at 9:55 am Anthropic has detailed its safety strategy to try and keep its popular AI model, Claude, helpful while avoiding perpetuating harms. Central to this effort is Anthropic’s Safeguards team; who aren’t your average tech support group, they’re a mix of policy experts, data scientists, engineers, and threat analysts who know how bad actors think. However, Anthropic’s
The post Anthropic details its AI safety strategy appeared first on AI News.
Anthropic has detailed its safety strategy to try and keep its popular AI model, Claude, helpful while avoiding perpetuating harms. Central to this effort is Anthropic’s Safeguards team; who aren’t your average tech support group, they’re a mix of policy experts, data scientists, engineers, and threat analysts who know how bad actors think. However, Anthropic’s
The post Anthropic details its AI safety strategy appeared first on AI News. Read More
Do Biased Models Have Biased Thoughts?cs.AI updates on arXiv.orgon August 13, 2025 at 4:00 am arXiv:2508.06671v2 Announce Type: replace-cross
Abstract: The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: $textit{Do biased models have biased thoughts}$? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model’s thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.
arXiv:2508.06671v2 Announce Type: replace-cross
Abstract: The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: $textit{Do biased models have biased thoughts}$? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model’s thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts. Read More
Why Trump’s “golden dome” missile defense idea is another ripped straight from the moviesMIT Technology Reviewon August 13, 2025 at 10:00 am In 1940, a fresh-faced Ronald Reagan starred as US Secret Service agent Brass Bancroft in Murder in the Air, an action film centered on a fictional “superweapon” that could stop enemy aircraft midflight. A mock newspaper in the movie hails it as the “greatest peace argument ever invented.” The experimental weapon is “the exclusive property…
In 1940, a fresh-faced Ronald Reagan starred as US Secret Service agent Brass Bancroft in Murder in the Air, an action film centered on a fictional “superweapon” that could stop enemy aircraft midflight. A mock newspaper in the movie hails it as the “greatest peace argument ever invented.” The experimental weapon is “the exclusive property… Read More
How We Reduced LLM Costs by 90% with 5 Lines of CodeTowards Data Scienceon August 21, 2025 at 7:08 pm When clean code hides inefficiencies: what we learned from fixing a few lines of code and saving 90% in LLM cost.
The post How We Reduced LLM Costs by 90% with 5 Lines of Code appeared first on Towards Data Science.
When clean code hides inefficiencies: what we learned from fixing a few lines of code and saving 90% in LLM cost.
The post How We Reduced LLM Costs by 90% with 5 Lines of Code appeared first on Towards Data Science. Read More