Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Daily AI News
AI News & Insights Featured Image

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents AI updates on arXiv.org

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agentscs.AI updates on arXiv.org arXiv:2510.07172v3 Announce Type: replace
Abstract: Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts – systematic alterations of canonical laws – to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

 arXiv:2510.07172v3 Announce Type: replace
Abstract: Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts – systematic alterations of canonical laws – to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery. Read More  

Daily AI News
AI News & Insights Featured Image

No One Size Fits All: QueryBandits for Hallucination Mitigation AI updates on arXiv.org

No One Size Fits All: QueryBandits for Hallucination Mitigationcs.AI updates on arXiv.org arXiv:2602.20332v1 Announce Type: cross
Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.

 arXiv:2602.20332v1 Announce Type: cross
Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation. Read More  

Daily AI News
AI News & Insights Featured Image

Multimodal Multi-Agent Empowered Legal Judgment Prediction AI updates on arXiv.org

Multimodal Multi-Agent Empowered Legal Judgment Predictioncs.AI updates on arXiv.org arXiv:2601.12815v5 Announce Type: cross
Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

 arXiv:2601.12815v5 Announce Type: cross
Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets. Read More  

Daily AI News
AI News & Insights Featured Image

Autonomous AI and Ownership Rules AI updates on arXiv.org

Autonomous AI and Ownership Rulescs.AI updates on arXiv.org arXiv:2602.20169v1 Announce Type: cross
Abstract: This Article examines the circumstances in which AI-generated outputs remain linked to their creators and the points at which they lose that connection, whether through accident, deliberate design, or emergent behavior. In cases where AI is traceable to an originator, accession doctrine provides an efficient means of assigning ownership, preserving investment incentives while maintaining accountability. When AI becomes untraceable — whether through carelessness, deliberate obfuscation, or emergent behavior — first possession rules can encourage reallocation to new custodians who are incentivized to integrate AI into productive use. The analysis further explores strategic ownership dissolution, where autonomous AI is intentionally designed to evade attribution, creating opportunities for tax arbitrage and regulatory avoidance. To counteract these inefficiencies, bounty systems, private incentives, and government subsidies are proposed as mechanisms to encourage AI capture and prevent ownerless AI from distorting markets.

 arXiv:2602.20169v1 Announce Type: cross
Abstract: This Article examines the circumstances in which AI-generated outputs remain linked to their creators and the points at which they lose that connection, whether through accident, deliberate design, or emergent behavior. In cases where AI is traceable to an originator, accession doctrine provides an efficient means of assigning ownership, preserving investment incentives while maintaining accountability. When AI becomes untraceable — whether through carelessness, deliberate obfuscation, or emergent behavior — first possession rules can encourage reallocation to new custodians who are incentivized to integrate AI into productive use. The analysis further explores strategic ownership dissolution, where autonomous AI is intentionally designed to evade attribution, creating opportunities for tax arbitrage and regulatory avoidance. To counteract these inefficiencies, bounty systems, private incentives, and government subsidies are proposed as mechanisms to encourage AI capture and prevent ownerless AI from distorting markets. Read More  

Daily AI News
Optimizing Token Generation in PyTorch Decoder Models Towards Data Science

Optimizing Token Generation in PyTorch Decoder Models Towards Data Science

Optimizing Token Generation in PyTorch Decoder ModelsTowards Data Science Hiding host-device synchronization via CUDA stream interleaving
The post Optimizing Token Generation in PyTorch Decoder Models appeared first on Towards Data Science.

 Hiding host-device synchronization via CUDA stream interleaving
The post Optimizing Token Generation in PyTorch Decoder Models appeared first on Towards Data Science. Read More  

Daily AI News
Anthropic: Claude faces ‘industrial-scale’ AI model distillation AI News

Anthropic: Claude faces ‘industrial-scale’ AI model distillation AI News

Anthropic: Claude faces ‘industrial-scale’ AI model distillationAI News Anthropic has detailed three “industrial-scale” AI model distillation campaigns by overseas labs designed to extract abilities from Claude. These competitors generated over 16 million exchanges using approximately 24,000 deceptive accounts. Their goal was to acquire proprietary logic to improve their competing platforms. The extraction technique, known as distillation, involves training a weaker system on the
The post Anthropic: Claude faces ‘industrial-scale’ AI model distillation appeared first on AI News.

 Anthropic has detailed three “industrial-scale” AI model distillation campaigns by overseas labs designed to extract abilities from Claude. These competitors generated over 16 million exchanges using approximately 24,000 deceptive accounts. Their goal was to acquire proprietary logic to improve their competing platforms. The extraction technique, known as distillation, involves training a weaker system on the
The post Anthropic: Claude faces ‘industrial-scale’ AI model distillation appeared first on AI News. Read More  

Daily AI News
Generate structured output from LLMs with Dottxt Outlines in AWS Artificial Intelligence

Generate structured output from LLMs with Dottxt Outlines in AWS Artificial Intelligence

Generate structured output from LLMs with Dottxt Outlines in AWSArtificial Intelligence This post explores the implementation of Dottxt’s Outlines framework as a practical approach to implementing structured outputs using AWS Marketplace in Amazon SageMaker.

 This post explores the implementation of Dottxt’s Outlines framework as a practical approach to implementing structured outputs using AWS Marketplace in Amazon SageMaker. Read More  

Daily AI News
5 Python Data Validation Libraries You Should Be Using KDnuggets

5 Python Data Validation Libraries You Should Be Using KDnuggets

5 Python Data Validation Libraries You Should Be UsingKDnuggets These five libraries approach validation from very different angles, which is exactly why they matter. Each one solves a specific class of problems that appear again and again in modern data and machine learning workflows.

 These five libraries approach validation from very different angles, which is exactly why they matter. Each one solves a specific class of problems that appear again and again in modern data and machine learning workflows. Read More  

Daily AI News
Optimizing Deep Learning Models with SAM Towards Data Science

Optimizing Deep Learning Models with SAM Towards Data Science

Optimizing Deep Learning Models with SAMTowards Data Science A deep dive into the Sharpness-Aware-Minimization (SAM) algorithm and how it improves the generalizability of modern deep learning models
The post Optimizing Deep Learning Models with SAM appeared first on Towards Data Science.

 A deep dive into the Sharpness-Aware-Minimization (SAM) algorithm and how it improves the generalizability of modern deep learning models
The post Optimizing Deep Learning Models with SAM appeared first on Towards Data Science. Read More