ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generationcs.AI updates on arXiv.org arXiv:2510.12047v3 Announce Type: replace
Abstract: Current code generation benchmarks measure functional correctness on well-formed inputs, as test cases are curated to satisfy input preconditions. This leaves a gap: generated programs may appear correct but fail to satisfy contracts — assertion-level validity constraints for rejecting ill-formed inputs. We introduce ContractEval, a benchmark for evaluating contract-satisfying assertions in code generation, i.e., whether code rejects contract-violating inputs by triggering intended assertions. Built on HumanEval+ and MBPP+, ContractEval augments each task with contract-violation tests derived from reference assertions. We synthesize these via a neuro-symbolic pipeline: an LLM converts assertion clauses into constraints, and an SMT solver enumerates satisfiable violation combinations to generate inputs that violate selected clauses while satisfying the rest. Across five code LLMs, standard prompting yields 0% contract satisfaction, while adding a few contract-violation examples boosts contract satisfaction to 49–53% while maintaining pass@1 by 92% of the original. Our code is available at https://github.com/suhanmen/ContractEval.
arXiv:2510.12047v3 Announce Type: replace
Abstract: Current code generation benchmarks measure functional correctness on well-formed inputs, as test cases are curated to satisfy input preconditions. This leaves a gap: generated programs may appear correct but fail to satisfy contracts — assertion-level validity constraints for rejecting ill-formed inputs. We introduce ContractEval, a benchmark for evaluating contract-satisfying assertions in code generation, i.e., whether code rejects contract-violating inputs by triggering intended assertions. Built on HumanEval+ and MBPP+, ContractEval augments each task with contract-violation tests derived from reference assertions. We synthesize these via a neuro-symbolic pipeline: an LLM converts assertion clauses into constraints, and an SMT solver enumerates satisfiable violation combinations to generate inputs that violate selected clauses while satisfying the rest. Across five code LLMs, standard prompting yields 0% contract satisfaction, while adding a few contract-violation examples boosts contract satisfaction to 49–53% while maintaining pass@1 by 92% of the original. Our code is available at https://github.com/suhanmen/ContractEval. Read More
Why Apple chose Google over OpenAI: What enterprise AI buyers can learn from the Gemini deal AI News
Why Apple chose Google over OpenAI: What enterprise AI buyers can learn from the Gemini dealAI News Apple’s multi-year agreement to integrate Google’s Gemini models into its revamped Siri marks more than just another Big Tech partnership. The deal, announced Monday, offers a rare window into how one of the world’s most selective technology companies evaluates foundation models—and the criteria should matter to any enterprise weighing similar decisions. The stakes were considerable. Apple had
The post Why Apple chose Google over OpenAI: What enterprise AI buyers can learn from the Gemini deal appeared first on AI News.
Apple’s multi-year agreement to integrate Google’s Gemini models into its revamped Siri marks more than just another Big Tech partnership. The deal, announced Monday, offers a rare window into how one of the world’s most selective technology companies evaluates foundation models—and the criteria should matter to any enterprise weighing similar decisions. The stakes were considerable. Apple had
The post Why Apple chose Google over OpenAI: What enterprise AI buyers can learn from the Gemini deal appeared first on AI News. Read More
The latency trap: Smart warehouses abandon cloud for edgeAI News While the enterprise world rushes to migrate everything to the cloud, the warehouse floor is moving in the opposite direction. This article explores why the future of automation relies on edge AI to solve the fatal “latency gap” in modern logistics. In the sterilised promotional videos for smart warehouses, autonomous mobile robots (AMRs) glide in
The post The latency trap: Smart warehouses abandon cloud for edge appeared first on AI News.
While the enterprise world rushes to migrate everything to the cloud, the warehouse floor is moving in the opposite direction. This article explores why the future of automation relies on edge AI to solve the fatal “latency gap” in modern logistics. In the sterilised promotional videos for smart warehouses, autonomous mobile robots (AMRs) glide in
The post The latency trap: Smart warehouses abandon cloud for edge appeared first on AI News. Read More
How to Maximize Claude Code EffectivenessTowards Data Science Learn how to get the most out of agentic coding
The post How to Maximize Claude Code Effectiveness appeared first on Towards Data Science.
Learn how to get the most out of agentic coding
The post How to Maximize Claude Code Effectiveness appeared first on Towards Data Science. Read More
Cybersecurity researchers have disclosed details of a new campaign dubbed SHADOW#REACTOR that employs an evasive multi-stage attack chain to deliver a commercially available remote administration tool called Remcos RAT and establish persistent, covert remote access. “The infection chain follows a tightly orchestrated execution path: an obfuscated VBS launcher executed via wscript.exe invokes a Read More
January 5th TJS Weekly Security Intelligence Briefing Week of January 5th, 2026Classification: TLP: PublicPrepared: January 5, 2026 SECTION A: EXECUTIVE OVERVIEW For Leadership and Management A.1 Executive Summary Risk Posture: ELEVATED This week’s threat landscape is defined by three operationally significant developments: Bottom Line: MongoDB patching is the highest-priority technical action. Phishing awareness requires immediate […]
University of Hawaii says a ransomware gang breached its Cancer Center in August 2025, stealing data of study participants, including documents from the 1990s containing Social Security numbers. […] Read More
Microsoft has started retiring the Microsoft Lens PDF scanner app for Android and iOS devices on Friday, January 9th, with plans to remove it from app stores next month. […] Read More
7 Must-Have Tools for Your Coding WorkflowKDnuggets My go-to tech stack that helps me code faster, stay organized, and ship with confidence.
My go-to tech stack that helps me code faster, stay organized, and ship with confidence. Read More
When Does Adding Fancy RAG Features Work?Towards Data Science Looking at the performance of different pipelines
The post When Does Adding Fancy RAG Features Work? appeared first on Towards Data Science.
Looking at the performance of different pipelines
The post When Does Adding Fancy RAG Features Work? appeared first on Towards Data Science. Read More