What is the difference between Nano Banana 2 and Veo 3.1?

Nano Banana 2 (Gemini 3.1 Flash Image, model gemini-3.1-flash-image-preview) is a text-to-image and image-editing model. Veo 3.1 is a text-to-video and image-to-video model that produces 4, 6, or 8 second clips with natively synchronized audio. They are complementary, not competing: use Nano Banana 2 for static graphics, storyboards, and infographics, and Veo 3.1 for cinematic video, animating a frame, or scenes with dialogue and sound.

How much does Veo 3.1 cost per second of video?

Veo 3.1 Standard is $0.40 per second at 720p and 1080p, and $0.60 per second at 4K. Veo 3.1 Fast is $0.15 per second at 720p and 1080p, and $0.35 per second at 4K. A Veo 3.1 Lite preview tier is estimated at roughly $0.05 per second. You are only charged when a video generates successfully.

Nano Banana 2 vs Veo 3.1

Nano Banana 2 vs Veo 3.1: Gemini Image vs Video Generation

Q: How much does Nano Banana 2 cost per image?

Text input is $0.25 per million tokens. Image output is billed as tokens, and Google's own pages show two different rates: a standard rate of $60 per million tokens (about $0.045 for a 0.5K image, $0.067 for 1K, $0.101 for 2K, and $0.151 for 4K) and a batch rate of $30 per million tokens (about $0.022 for 0.5K up to $0.076 for 4K). Verify the current rate at the Gemini API pricing page before forecasting cost.

Q: Do Nano Banana 2 images have a watermark?

Yes. Every image generated by Nano Banana 2 carries a mandatory SynthID watermark plus C2PA Content Credentials. This applies to all output and cannot be removed.

This is not a fight. Nano Banana 2 (Gemini 3.1 Flash Image) makes and edits still images. Veo 3.1 makes video clips with sound. They sit side by side in the same Gemini API, and choosing between them is a question of what you are producing, not which one is "better." This guide gives you the practitioner's version: what each one actually does, the real cost per image and per second of video, and a clear set of rules for which model to reach for on any given task.

Important Context

Both models are in Preview as of this writing. Nano Banana 2 (gemini-3.1-flash-image-preview) released February 26, 2026. Veo 3.1 went to general availability November 17, 2025, with a Lite preview tier added March 31, 2026. Preview models can change before they stabilize and carry more restrictive rate limits. Pricing was verified June 8, 2026 and changes often, so confirm current rates before you forecast cost.

The Short Answer

THE RULE

Pick By Output

Still image? Nano Banana 2. Moving clip? Veo 3.1.

Nano Banana 2 is for graphics, mockups, infographics, and storyboard frames. Veo 3.1 is for short cinematic video with synchronized audio, including animating a single image into motion. If you need both, they hand off cleanly: design a frame in Nano Banana 2, then feed it to Veo 3.1 as a first frame.

The rule: the output format decides the model. If the deliverable is a static asset (a thumbnail, an ad, a diagram, a product shot, a comic panel), Nano Banana 2 is the tool. If the deliverable moves and has sound (a product teaser, a social clip, an animated logo, a scene with dialogue), Veo 3.1 is the tool. There is no scenario where one substitutes for the other, because one produces pixels frozen in time and the other produces frames per second plus an audio track.

The detail that matters most for budgeting: images are cheap and priced per output, video is expensive and priced per second. The cost math below makes the gap concrete.

These are two different tools in one Gemini API: Nano Banana 2 makes images, Veo 3.1 makes video. We do not choose one over the other, we use each for its job.
Images run roughly $0.02 to $0.15 each; an 8-second 1080p Veo clip runs about $1.20 to $3.20. Video is one to two orders of magnitude more expensive.
Nano Banana 2 supports conversational editing, semantic masking, and legible text rendering, which makes it strong for infographics and mockups.
Veo 3.1 generates clips with natively synchronized audio and can animate a still image into motion.
Every Nano Banana 2 image carries a mandatory SynthID watermark; budget for review of AI-generated assets before publishing.

At a Glance

Still images

Output Type

Video + audio

0.5K / 1K / 2K / 4K

Resolution

720p / 1080p / 4K

~$0.02-$0.15 / image

Unit Cost

$0.05-$0.60 / sec

Yes (conversational)

Editing

Extend / add-remove

Up to 14 ref images

Input Guidance

First / last frame, up to 3 ref

SynthID + C2PA

Provenance

AI-generated media

Max Reference Images (Nano Banana 2)

Gemini API docs

8 sec

Max Veo 3.1 Clip Length

4 / 6 / 8 sec options

Top Resolution (Both)

Image and video

$0.40

Veo 3.1 Standard / Sec (1080p)

Gemini API pricing

SynthID

Watermark on Every Image

Mandatory, plus C2PA

One workflow note worth internalizing early. Because Veo 3.1 accepts a first-frame image, the two models are designed to chain. You can compose an exact opening shot in Nano Banana 2 (precise text, correct branding, the right character) and then hand that frame to Veo 3.1 to animate. That handoff is the single most useful pattern in this whole comparison, and it is the reason "versus" is the wrong frame for these two tools.

What Each Model Is

Image

The product name for Gemini 3.1 Flash Image (model gemini-3.1-flash-image-preview), released February 26, 2026. A text-to-image and image-editing model that brings the Gemini 3 series world knowledge to fast, high-volume visual generation. It does conversational multi-turn editing, semantic masking, legible text rendering in 10+ languages, and grounding with Google Image Search. There is a higher-fidelity sibling, Nano Banana Pro (gemini-3-pro-image-preview), for premium professional work.

Cost: Text input $0.25/1M tokens. Image output billed as tokens; Google's pages show both a $30/1M (batch) and $60/1M (standard) rate, working out to roughly $0.02 to $0.15 per image depending on resolution and tier.

ai.google.dev image docs

Video

Google's cinematic video generation model (veo-3.1-generate-001 and the Fast variant). Public preview October 15, 2025; general availability November 17, 2025. It produces 4, 6, or 8 second clips at 720p, 1080p, or 4K with natively synchronized audio. It supports text-to-video, image-to-video from a first frame, first-and-last-frame generation, extending existing Veo clips, and adding or removing objects. A cost-efficient Veo 3.1 Lite preview arrived March 31, 2026.

Cost: Standard $0.40/sec (720p/1080p), $0.60/sec (4K). Fast $0.15/sec (720p/1080p), $0.35/sec (4K). Lite is estimated near $0.05/sec. You are only billed for clips that generate successfully.

ai.google.dev pricing

The naming is the first thing that trips people up, so here it is plainly. "Nano Banana 2" is a nickname; the model you call in code is gemini-3.1-flash-image-preview. "Nano Banana Pro" is the higher tier, gemini-3-pro-image-preview. Veo 3.1 is a separate model family entirely, with its own veo-3.1 identifiers. All three are in Preview at the time of writing, which means the rate limits are tighter than production and the specs can move.

Use Case 1 -- Static Assets

When You Need Images: Nano Banana 2

If your deliverable does not move, this is the model. The practitioner reasons to reach for it are specific, and they are mostly about iteration speed and text fidelity rather than raw realism.

Conversational editing means you refine an image by chatting, the way you would brief a designer. Semantic masking lets you describe a region in words ("change the jacket to red") and edit only that part while lighting and unselected objects stay put. Text rendering is the standout: it generates legible, stylized text in 10+ languages and can translate or localize text inside an image, which is why it is strong for infographics, menus, mockups, and marketing assets. You can mix up to 14 reference images to guide a result, and maintain character and object consistency across a set (the API limitation page states up to 4 characters and 10 objects).

USE

High-volume graphics where you iterate fast: ad variants, thumbnails, social tiles. Storyboard frames before committing to video. Infographics and data visualizations where the text has to be correct and readable. Product mockups that need real branding and copy. Resolutions run 0.5K, 1K, 2K, and 4K, so you can draft cheap at 0.5K and finalize at 4K. Grounding with Google Image Search lets it pull real-world visual context, with attribution links returned in the response.

Two Things to Plan For

First, every generated image carries a mandatory SynthID watermark plus C2PA Content Credentials. That is good for provenance but means anyone can detect the asset as AI-generated, so plan your disclosure accordingly. Second, the model will not always honor the exact number of output images you request (it caps at 10 output images per request), and source documents disagree on the input context window (cited variously as 65,536, 128K, and 131,072 tokens). Treat the lower 65,536 figure as the safe planning number and verify against the live spec.

Use: Nano Banana 2 Anything static, especially when legible text, fast iteration, or character consistency across a set matters. It is the cheaper, faster half of the pair.

Use Case 2 -- Motion and Sound

When You Need Video: Veo 3.1

The moment your deliverable has to move or make sound, Nano Banana 2 is out of the running and Veo 3.1 is the answer. The defining feature is the audio: Veo 3.1 produces natively synchronized sound, not a silent clip you score later.

4 / 6 / 8s

Veo 3.1 generates clips of 4, 6, or 8 seconds at 720p, 1080p, or 4K, each with synchronized audio. Nano Banana 2 produces no video and no audio at all.

Text-to-video for generating a scene from a written prompt, and image-to-video for animating a still. The image-to-video modes are the practitioner's lever: provide a first frame to control exactly how the clip opens, or provide both first and last frames to constrain start and end. You can reference up to three images to steer the look, extend a previously generated Veo clip to build longer sequences, and add or remove objects within a scene.

USE

Short cinematic shots for social and ads, animated logos and intros, and any scene that needs integrated dialogue or sound effects. The first-frame workflow is where it pairs with Nano Banana 2: design the perfect opening still, then animate it. The billing model is forgiving in one specific way, which is that you are only charged when a clip generates successfully, so a failed generation (for example an audio processing error) does not cost you.

The Cost Reality

Video is expensive per unit of output, and the per-second model punishes long clips. An 8-second 4K Standard clip costs about $4.80 before you account for retries and iteration. Pick the lowest tier and resolution that meets the brief: Fast at 1080p is $0.15/sec, and the Lite preview is estimated near $0.05/sec. Generate at 720p or 1080p for drafts and reserve 4K for the final.

Use: Veo 3.1 Anything that moves or needs sound, and specifically when you want to animate a still image you already designed. It is the expensive, high-impact half of the pair.

Real Cost Math: Per Image vs Per Second

Here is the part most comparisons skip. The two models are priced on completely different units, so the only honest way to compare them is to translate both into "cost of a finished deliverable." Start with the discrepancy you need to know about.

Image Pricing Has Two Published Rates

Google's own pages show conflicting numbers for Nano Banana 2 image output. The standard tier lists image output at $60 per 1M tokens, equal to about $0.045 for a 0.5K image, $0.067 for 1K, $0.101 for 2K, and $0.151 for 4K. The batch tier lists $30 per 1M tokens, equal to about $0.022 for 0.5K up to $0.076 for 4K. A separate developer-guide table cites $0.067 per image. We are reporting all of these rather than picking one, because the source material genuinely conflicts. Confirm the rate that applies to your tier at the Gemini API pricing page before you forecast a real budget.

What 1,000 Images Costs

Resolution	Standard ($60/1M)	Batch ($30/1M)	Cost per 1,000
0.5K (512px)	~$0.045 / image	~$0.022 / image	$22 - $45
1K	~$0.067 / image	~$0.034 / image	$34 - $67
2K	~$0.101 / image	~$0.050 / image	$50 - $101
4K	~$0.151 / image	~$0.076 / image	$76 - $151

Note: image output is billed as tokens (0.5K = 747 tokens, 1K = 1,120, 2K = 1,680, 4K = 2,520). Text input is a separate $0.25/1M tokens. Figures rounded; verify at the live pricing page.

What an 8-Second Clip Costs

Veo 3.1 Tier	720p / 1080p	4K	8-sec clip (1080p)
Standard	$0.40 / sec	$0.60 / sec	$3.20
Fast	$0.15 / sec	$0.35 / sec	$1.20
Lite (preview, est.)	~$0.05 / sec	not published	~$0.40

Note: 4K Standard at 8 seconds is about $4.80. You are only charged for successful generations. Lite is an independent estimate; Google does not publish an exact per-second Lite rate.

The headline number: a single 8-second 1080p Standard video clip ($3.20) costs more than thirty 1K images at the standard rate, or roughly a hundred at the batch rate. If you are budgeting a campaign, model image volume in cents and video in dollars. The cheapest way to get a polished motion asset is often to perfect the key frame as an image first, then animate only the final approved frame.

Pricing Tiers Side by Side

Entry

~$0.022 / img

Nano Banana 2, 0.5K image, batch tier

~$0.05 / sec

Veo 3.1 Lite preview (estimated)

Mainstream

~$0.067 / img

Nano Banana 2, 1K image, standard tier

$0.15 / sec

Veo 3.1 Fast, 720p / 1080p

High-End

~$0.151 / img

Nano Banana 2, 4K image, standard tier

$0.40 / sec

Veo 3.1 Standard, 720p / 1080p

Top

$0.25 / 1M

Text input tokens (separate line item)

$0.60 / sec

Veo 3.1 Standard, 4K

Pricing verified: June 8, 2026. Verify at ai.google.dev/pricing before forecasting cost.

Two columns, two units. The left column is per image and the right is per second, which is the whole point. Image pricing carries the standard-versus-batch discrepancy described above, so the image figures here use the standard tier except where labeled. Video pricing is cleaner but climbs fast with resolution and clip length. Note that the Veo Lite tier is a preview and its per-second figure is an independent estimate, not a published Google rate. For developers building production systems, the practical move is to default to the cheapest tier that clears your quality bar and only step up for the final render.

Quick formulas for your spreadsheet: Nano Banana 2 image cost ≈ (output tokens for the resolution) × (rate per token), where rates are $30/1M batch or $60/1M standard, plus $0.25/1M for text input. Veo 3.1 clip cost = (seconds) × (per-second rate for the tier and resolution), billed only on success. For a 10,000-image-per-month graphics pipeline at 1K standard, plan around $670/month; for a hundred 8-second 1080p Fast clips, plan around $120/month. Always reconcile against the live pricing page.

Specification Scorecard

Side by side, capability by capability. This is a feature map, not a ranking, because the two models do different jobs.

Spec	Nano Banana 2	Veo 3.1
Model ID	gemini-3.1-flash-image-preview	veo-3.1-generate-001
Output	Still images	Video clips with audio
Resolution	0.5K, 1K, 2K, 4K	720p, 1080p, 4K
Duration	N/A (static)	4, 6, or 8 seconds
Audio	None	Natively synchronized
Editing	Conversational, semantic masking	Extend clips, add/remove objects
Text rendering	Yes, 10+ languages, legible	Not a primary feature
Input guidance	Up to 14 reference images	First / last frame, up to 3 refs
Unit price	~$0.02 - $0.15 / image	$0.05 - $0.60 / second
Provenance	SynthID + C2PA (mandatory)	AI-generated media labeling
Status	Preview (Feb 26, 2026)	GA (Nov 17, 2025)

Use Which When

Question 1 of 2

What is the final deliverable?

Pick what the audience actually sees

Nano Banana 2

High-volume graphics pipelines

Ad variants, thumbnails, and social tiles where you generate at scale and iterate fast. Cheap per image, and you can draft at 0.5K then finalize at 4K.

Nano Banana 2

Infographics and mockups with real text

When the copy inside the image has to be legible and correct, the text rendering and in-image translation are the deciding feature. This is the use case where image models usually fail and Nano Banana 2 is built to succeed.

Veo 3.1

Cinematic clips with integrated sound

Product teasers, social video, and any scene with dialogue or sound effects. Synchronized audio is the feature that makes this a finished asset rather than a silent draft.

Veo 3.1

Animating a single image into motion

Provide a first frame (ideally one you composed in Nano Banana 2) and let Veo animate it. First-and-last-frame mode gives you control over both ends of the shot.

Both

Storyboard then film

The most cost-effective video workflow: design and approve key frames cheaply in Nano Banana 2, then spend the expensive video budget only on animating the frames that made the cut. Explore the wider AI tools landscape to see where these fit a full pipeline.

Edge Cases and Gotchas

Nano Banana 2 is the Flash-speed, high-volume tier. For premium professional images, step up to Nano Banana Pro (gemini-3-pro-image-preview), which trades speed and cost for higher fidelity and supports up to 5 characters and 6 high-fidelity objects.

A single Veo 3.1 generation tops out at 8 seconds. To build longer sequences, use the extend feature to continue a previously generated clip, and budget each segment separately since billing is per second of generated video.

Every Nano Banana 2 image carries a mandatory SynthID watermark plus C2PA Content Credentials. You cannot opt out, so build provenance disclosure into your publishing workflow rather than treating it as optional. For teams formalizing this, see our AI governance frameworks.

Both models are in Preview with tighter rate limits than production, and specs can change. The upside for Veo: failed video generations are not billed. The upside for Nano Banana 2: it uses interim "thought images" that refine the result without being charged. Build retries into your pipeline and verify current limits before scaling.

Watch and Learn

YouTube Search

Nano Banana 2 image generation walkthroughs

YouTube Search

Veo 3.1 text-to-video and image-to-video demos

YouTube Search

Image-to-video handoff workflow (frame then animate)

YouTube Search

Go Deeper

Resources from across Tech Jacks Solutions

AI Governance Hub

Provenance, watermarking, and disclosure for AI-generated media

FREEAI Governance Charter

Set your organization's rules for generated content in one document

AI Glossary

Definitions for AI terms used in this article

Fact-checked against vendor documentation and official sources, June 2026. Verify current pricing at ai.google.dev/pricing before purchasing.

Freshness notice: Both models are in Preview and pricing changes rapidly. This comparison reflects data verified as of June 8, 2026. If you are reading this more than 90 days later, capabilities and rates may have shifted. Check our AI Tools Hub for the latest.

Google, Gemini, Nano Banana, and Veo are trademarks of Google LLC. SynthID is a Google DeepMind technology. C2PA is a project of the Coalition for Content Provenance and Authenticity. Tech Jacks Solutions is not affiliated with or endorsed by Google LLC.

Gallery

Contacts

Nano Banana 2 vs Veo 3.1: Gemini Image vs Video Generation

The Short Answer

At a Glance

What Each Model Is

When You Need Images: Nano Banana 2

When You Need Video: Veo 3.1

Real Cost Math: Per Image vs Per Second

What 1,000 Images Costs

What an 8-Second Clip Costs

Pricing Tiers Side by Side

Use Which When

Edge Cases and Gotchas

Go Deeper

Services

Learn

Company