Losing GPT-4o sent some people into mourning. That was predictable.MIT Technology Reviewon August 15, 2025 at 10:34 am June had no idea that GPT-5 was coming. The Norwegian student was enjoying a late-night writing session last Thursday when her ChatGPT collaborator started acting strange. “It started forgetting everything, and it wrote really badly,” she says. “It was like a robot.” June, who asked that we use only her first name for privacy reasons,…
June had no idea that GPT-5 was coming. The Norwegian student was enjoying a late-night writing session last Thursday when her ChatGPT collaborator started acting strange. “It started forgetting everything, and it wrote really badly,” she says. “It was like a robot.” June, who asked that we use only her first name for privacy reasons,… Read More
DeepSeek: The Chinese startup challenging Silicon ValleyAI Newson August 15, 2025 at 9:33 am Market disruption and shockwaves through Silicon Valley marked Chinese startup DeepSeek’s launch, challenging some of the fundamental assumptions of how artificial intelligence companies had operated and scaled. In less than a couple of years, the Beijing-based newcomer has accomplished what many thought impossible: creating AI models that compete with industry giants while spending only a
The post DeepSeek: The Chinese startup challenging Silicon Valley appeared first on AI News.
Market disruption and shockwaves through Silicon Valley marked Chinese startup DeepSeek’s launch, challenging some of the fundamental assumptions of how artificial intelligence companies had operated and scaled. In less than a couple of years, the Beijing-based newcomer has accomplished what many thought impossible: creating AI models that compete with industry giants while spending only a
The post DeepSeek: The Chinese startup challenging Silicon Valley appeared first on AI News. Read More
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methodscs. AI updates on arXiv.org
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methodscs.AI updates on arXiv.orgon August 15, 2025 at 4:00 am arXiv:2506.10236v2 Announce Type: replace-cross
Abstract: In this work, we demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis to assess the extent to which supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., prepending Hindi filler text to the original prompt recovers 57.3% accuracy). Our logit analysis further indicates that unlearned models are unlikely to hide knowledge through changes in answer formatting, given the strong correlation between output and logit accuracy. These findings challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between genuine knowledge removal and superficial output suppression. To facilitate further research, we publicly release our evaluation framework to easily evaluate prompting techniques to retrieve unlearned knowledge.
arXiv:2506.10236v2 Announce Type: replace-cross
Abstract: In this work, we demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis to assess the extent to which supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., prepending Hindi filler text to the original prompt recovers 57.3% accuracy). Our logit analysis further indicates that unlearned models are unlikely to hide knowledge through changes in answer formatting, given the strong correlation between output and logit accuracy. These findings challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between genuine knowledge removal and superficial output suppression. To facilitate further research, we publicly release our evaluation framework to easily evaluate prompting techniques to retrieve unlearned knowledge. Read More
What Does “Following Best Practices” Mean in the Age of AI?Towards Data Scienceon August 14, 2025 at 7:34 pm How data and ML practitioners should navigate a rapidly changing landscape
The post What Does “Following Best Practices” Mean in the Age of AI? appeared first on Towards Data Science.
How data and ML practitioners should navigate a rapidly changing landscape
The post What Does “Following Best Practices” Mean in the Age of AI? appeared first on Towards Data Science. Read More
“My biggest lesson was realizing that domain expertise matters more than algorithmic complexity.“Towards Data Scienceon August 14, 2025 at 1:59 pm Claudia Ng reflects on real-world ML lessons, mentoring newcomers, and her journey from corporate ML to freelance AI.
The post “My biggest lesson was realizing that domain expertise matters more than algorithmic complexity.“ appeared first on Towards Data Science.
Claudia Ng reflects on real-world ML lessons, mentoring newcomers, and her journey from corporate ML to freelance AI.
The post “My biggest lesson was realizing that domain expertise matters more than algorithmic complexity.“ appeared first on Towards Data Science. Read More
A Lightweight Learned Cardinality Estimation Modelcs.AI updates on arXiv.orgon August 14, 2025 at 4:00 am arXiv:2508.09602v1 Announce Type: cross
Abstract: Cardinality estimation is a fundamental task in database management systems, aiming to predict query results accurately without executing the queries. However, existing techniques either achieve low estimation accuracy or incur high inference latency. Simultaneously achieving high speed and accuracy becomes critical for the cardinality estimation problem. In this paper, we propose a novel data-driven approach called CoDe (Covering with Decompositions) to address this problem. CoDe employs the concept of covering design, which divides the table into multiple smaller, overlapping segments. For each segment, CoDe utilizes tensor decomposition to accurately model its data distribution. Moreover, CoDe introduces innovative algorithms to select the best-fitting distributions for each query, combining them to estimate the final result. By employing multiple models to approximate distributions, CoDe excels in effectively modeling discrete distributions and ensuring computational efficiency. Notably, experimental results show that our method represents a significant advancement in cardinality estimation, achieving state-of-the-art levels of both estimation accuracy and inference efficiency. Across various datasets, CoDe achieves absolute accuracy in estimating more than half of the queries.
arXiv:2508.09602v1 Announce Type: cross
Abstract: Cardinality estimation is a fundamental task in database management systems, aiming to predict query results accurately without executing the queries. However, existing techniques either achieve low estimation accuracy or incur high inference latency. Simultaneously achieving high speed and accuracy becomes critical for the cardinality estimation problem. In this paper, we propose a novel data-driven approach called CoDe (Covering with Decompositions) to address this problem. CoDe employs the concept of covering design, which divides the table into multiple smaller, overlapping segments. For each segment, CoDe utilizes tensor decomposition to accurately model its data distribution. Moreover, CoDe introduces innovative algorithms to select the best-fitting distributions for each query, combining them to estimate the final result. By employing multiple models to approximate distributions, CoDe excels in effectively modeling discrete distributions and ensuring computational efficiency. Notably, experimental results show that our method represents a significant advancement in cardinality estimation, achieving state-of-the-art levels of both estimation accuracy and inference efficiency. Across various datasets, CoDe achieves absolute accuracy in estimating more than half of the queries. Read More
From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generationcs.AI updates on arXiv.orgon August 14, 2025 at 4:00 am arXiv:2508.09497v1 Announce Type: cross
Abstract: Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.
arXiv:2508.09497v1 Announce Type: cross
Abstract: Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios. Read More
Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesiscs.AI updates on arXiv.orgon August 14, 2025 at 4:00 am arXiv:2508.09458v1 Announce Type: cross
Abstract: Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI ‘hallucinations’ (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick’s outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI accuracy depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights.
arXiv:2508.09458v1 Announce Type: cross
Abstract: Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI ‘hallucinations’ (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick’s outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI accuracy depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights. Read More
Your Coding Intent is Secretly in the Context and You Should Deliberately Infer It Before Completioncs.AI updates on arXiv.orgon August 14, 2025 at 4:00 am arXiv:2508.09537v1 Announce Type: cross
Abstract: Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions–such as docstrings–are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains.
arXiv:2508.09537v1 Announce Type: cross
Abstract: Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions–such as docstrings–are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains. Read More
DeepSeek reverts to Nvidia for R2 model after Huawei AI chip failsAI Newson August 14, 2025 at 4:04 pm DeepSeek’s plan to train its new AI model, R2, on Huawei’s Ascend chips has failed and forced a retreat to Nvidia while delaying launch. For months, the narrative pushed by Beijing has been one of unstoppable technological progress and a march towards self-sufficiency. However, reality has a habit of biting back. The recent troubles of
The post DeepSeek reverts to Nvidia for R2 model after Huawei AI chip fails appeared first on AI News.
DeepSeek’s plan to train its new AI model, R2, on Huawei’s Ascend chips has failed and forced a retreat to Nvidia while delaying launch. For months, the narrative pushed by Beijing has been one of unstoppable technological progress and a march towards self-sufficiency. However, reality has a habit of biting back. The recent troubles of
The post DeepSeek reverts to Nvidia for R2 model after Huawei AI chip fails appeared first on AI News. Read More