embedl/Qwen3-1.7B-FlashHead-W4A16 neural network architecture graph

hfviewer renders an interactive architecture graph for the Hugging Face model embedl/Qwen3-1.7B-FlashHead-W4A16. The graph is built from the model's real structure: nodes carry source-faithful module, class, and operation names (embeddings, attention blocks, feed-forward layers, normalization, task heads), edges follow the actual forward dataflow, and repeated blocks are grouped with their true repeat counts. Where the model could be executed, the graph is trace-backed; otherwise it is derived from the reviewed configuration and source code.

Architecture components in this graph, each explained in the hfviewer glossary: Gated MLP (SwiGLU) · Grouped-query attention (GQA) · LM head (output projection) · QK-Norm · Residual (skip) connection · RMSNorm.

Related architecture graphs on hfviewer: embedl/Qwen3-0.6B-FlashHead · embedl/Qwen3-1.7B-FlashHead · Qwen/Qwen3-0.6B · Qwen/Qwen3-8B · Qwen/Qwen3-32B · Qwen/Qwen3-1.7B.

Browse more graphs from embedl or explore other models on the hfviewer home page.

embedl/Qwen3-1.7B-FlashHead-W4A16 neural network architecture graph

Related architecture graphs on hfviewer: embedl/Qwen3-0.6B-FlashHead · embedl/Qwen3-1.7B-FlashHead · Qwen/Qwen3-0.6B · Qwen/Qwen3-8B · Qwen/Qwen3-32B · Qwen/Qwen3-1.7B.

Browse more graphs from embedl or explore other models on the hfviewer home page.

hfviewer by embedl

Interactive model architecture

Architecture graph for embedl/Qwen3-1.7B-FlashHead-W4A16.

Interactive architecture graph for embedl/Qwen3-1.7B-FlashHead-W4A16, visualized from Hugging Face model metadata.

Paste a Hugging Face link to visualize it

No export step, no config hunt, no model surgery. Paste the link and inspect the graph.

Graph structure Understand the high-level graph structure of different transformer models. Quickstart guide

URL magic You can replace huggingface.co with hfviewer.com in the url to view it.

Chrome Extension! View each model directly on Hugging Face! Install extension

Embed in model card Embed the architecture graph directly in your Hugging Face model card. Add to your model card!

Hugging Face OAuth

Visualize your models

This allows you to:

Generate visualizations for your models
Add your models to the Community Showcase
Manage bookmarked models
Write your own interactive articles, linking nodes in the text to the hfviewer graph

Email me occasional hfviewer updates, new model visualizations, and product news.

You can unsubscribe at any time. Privacy Policy.

Editor - interactive article

Intro

Model Graph preview

Node

What detail would help?

Screenshots

Optional: draw a rectangle around the part that needs more detail.

Popular models

Click a model to open it.

MODEL PAGES WITH HFVIEWER

Community showcase.

Add to your model card!

Hugging Face authors are adding the hfviewer model card directly to their READMEs.

Nekochu/nanochat-d24

A 1.4B nanochat-style chat model in the Karpathy lineage, trained end to end on a single RTX 5090.

Open visualization View on HF

Quazim0t0/Escarda-86M-Base

An experimental SpikeWhale base language model for compact text generation, using custom Transformers code with MLA, JEPA-inspired design, and two-expert routing.

Open visualization View on HF

Quazim0t0/Byrne-86M-Base

A companion SpikeWhale base model for small text-generation experiments.

Open visualization View on HF

Quazim0t0/SpikeWhale-SNN-216M

A from-scratch spiking language model: four stacked leaky integrate-and-fire neuron layers trained with surrogate gradients.

Open visualization View on HF

Quazim0t0/Positronic-144M

A 144M-parameter causal LM whose channel mixing runs on Kuramoto oscillators instead of a plain MLP, on a conventional transformer backbone.

Open visualization View on HF

Quazim0t0/Mycel-LM-79M

A 79M-parameter research model whose channel mixing is a differentiable neighbour-sensing layer inspired by fungal colonies.

Open visualization View on HF

RobinsonLabs

RobinsonLabs includes architecture links to the hfviewer graph for each of their model families.

hfviewer card for RobinsonLabs/Qwen3.5-122B-A10B-REAP-20-abliterated

hfviewer card for RobinsonLabs/Qwen3.5-122B-A10B-REAP-30-abliterated

hfviewer card for RobinsonLabs/Qwen3.6-35B-A3B-abliterated

Open visualization View on HF

Bertug1911/BrtGPT-1-0719

A BrtGPT conversational text-generation checkpoint trained on LaMini-instruction, with code and math evaluations highlighted in the model card.

Open visualization View on HF

haris2k/PhyUS-Net

A physics-guided ultrasound segmentation suite covering UNet, UNet++, SegFormer, SwinUnet, TransUNet, and style-adaptation checkpoints.

Open visualization View on HF

Sandroeth/cali-0.1B

A compact CALI causal language model with custom blocks and evaluation-tracker coverage across ARC, HellaSwag, MMLU, TruthfulQA, and WinoGrande.

Open visualization View on HF

wop/Cosmos-T-80M

An 80M MiniGPT-style causal language model using the Qwen2.5 tokenizer, 12 decoder blocks, causal attention, and a tied language-model head.

Open visualization View on HF

AxiomicLabs/GPT-X2-125M

A 125M custom GPT-X2 language model with RoPE, SwiGLU, grouped-query attention, and curriculum-trained code/math normalization.

Open visualization View on HF

AxiomicLabs/GPT-S-5M

A compact 5M GPT-S language model using RoPE, SwiGLU, grouped-query attention, and exclusive shared attention for small-model reasoning experiments.

Open visualization View on HF

kalyan-ks/ettin-68m-nemotron-pii

A MIT-licensed Ettin encoder fine-tuned for PII token classification on Nemotron-PII, reporting 96.27 F1 on the test split.

Open visualization View on HF

BEE-spoke-data/mega-ar-126m-4k

A compact 126M language model built with MEGA rather than a standard transformer stack, with a 4096-token context length.

Open visualization View on HF

DominicTWHV/Horizon-1-Text-Large

A larger, more modern Constellation-One variant for Cockatoo, fine-tuned from answerdotai/ModernBERT-large.

Open visualization View on HF

BEE-spoke-data/pegasus-x-base-synthsumm_open-16k

A Pegasus-X based long-document summarization model fine-tuned on synthetic summaries for general summarization with long contexts.

Open visualization View on HF

juiceb0xc0de/bella-bartender-gemma-e2b

A Gemma 4 E2B based conversational model tuned toward a more direct, personal voice rather than a generic assistant cadence.

Open visualization View on HF

databoyface/bert-base-uncased-ome-v5

A BERT based OME v5 classifier for English emotion examples, fine-tuned over 26 categories and reporting 98.63 percent accuracy on eval.

Open visualization View on HF

databoyface/distilroberta-base-ome-v5.2

A DistilRoBERTa-based OME v5.2 emotion classifier for English text, fine-tuned over 26 categories and reporting 98.03 percent eval accuracy.

Open visualization View on HF

AINovice2005/ModernBERT-base-lora-cicflow-1m-r4

A LoRA fine-tune of ModernBERT-base for binary classification, designed for high recall with controlled false positives.

Open visualization View on HF

If you are interested in deploying these models to edge devices, check out our other products:

embedl deploy Quantization solved embedl hub Compliant MLOps embedl models optimized genai

Release note

Introducing hfviewer

The Hugging Face ecosystem already has model cards, Spaces, checkpoints, benchmarks, and demos. What it has still been missing is a fast general-purpose way to see how a model is put together. We built hfviewer.com to fill that gap: paste a Hugging Face model URL, open an interactive architecture graph in the browser, and move between overview and detail without installing anything.

Open Gemma 4 family page Open a sample model page Back Home

Why we are doing this

This is our way of giving back to the Hugging Face community.

Why we built it

We kept running into the same problem: a model card can tell you what a model is for, but it rarely helps you inspect the actual structure quickly. If you want to understand where the vision encoder enters, how the decoder repeats, whether the model routes through experts, or how a multimodal merge happens, you often end up reading config files, staring at code, or building your own mental graph from scattered clues.

hfviewer is meant to make that first architectural pass much faster. You can open a model directly from the Hugging Face URL, get a visual map in the browser, and then zoom from the broad system shape down into the more specific substructure that matters for understanding deployment, latency, and correctness.

What hfviewer does

Open models directly from Hugging Face

Paste a model URL or repo id and open the graph without a local setup or notebook workflow.

Switch between overview and detail

Granularity levels let you move from the high-level architecture down to more specific traced blocks and paths.

Compare model families

Family pages such as Gemma 4 let you compare multiple related models with synchronized interaction instead of isolated screenshots.

This loop shows embedl/Cosmos-Reason2-2B-W4A16-Edge2 as a natively rendered higher-resolution center-column crop at 2x speed, keeping the core granularity transition prominent while the info panel stays out of view.

A new kind of interactive blog

One of the most interesting things hfviewer enables is not just a prettier model page, but a new kind of technical article. On the Gemma 4 family page, the blog text and the graph are connected. You can read a section about a particular architectural decision, jump into the corresponding part of the graph, and then move back into the article with the surrounding context still intact.

That matters because model understanding is rarely linear. Sometimes you start from prose and need to verify it visually. Sometimes you see a node, a route, or a merge in the graph and want the editorial explanation immediately. We think that graph-to-text and text-to-graph loop is a better format for ML communication than a static diagram dropped into a long post.

Where to start

Open a familiar model such as Qwen/Qwen3.5-4B to get a feel for the main interaction model.
Jump to the Gemma 4 family page to see how the same interface can support a synchronized comparison and an editorial walkthrough.

We are releasing this because we think architecture understanding should be easier to share, easier to discuss, and easier to build on. Again: This is our way of giving back to the Hugging Face community.

Interactive model view

Model

Back Home Hugging Face

Node

Back to article

Node

Short description

Screenshots

Draw a rectangle around the incorrect part.

Email for fix notification (optional) Also send me occasional hfviewer updates, new model visualizations, and product news. You can unsubscribe at any time. Privacy Policy.

Technical overview

Understanding the Gemma 4 family

HANNES VON ESSEN APR 15, 2026

Gemma 4 is easiest to understand as one decoder-centered recipe adapted to three deployment problems. The E2B and E4B members are edge models built for tight memory and latency budgets. The 31B is the dense model for serious long-context and high-quality local or server inference. The 26B-A4B changes the economics instead, exposing far more total capacity while activating only a 3.8B subset per token. The family is therefore more useful to read by bottleneck than by parameter count alone. ¹

One decoder recipe, three bottlenecks

What keeps the family coherent is the shared attention backbone. Across the lineup, local sliding-window attention is interleaved with full global attention, and the final layer is always global. The edge models use 512-token sliding windows and 128K context, while 31B and 26B-A4B move to 1024-token windows and 256K context. That matters because it makes long context an architectural choice rather than just a larger tokenizer limit.

The expensive part of the stack is also where the main optimizations are concentrated. The global layers use unified Keys and Values and apply proportional RoPE, while the cheaper local layers keep the familiar standard RoPE regime. The point is not that every layer sees the whole sequence all the time; it is that the model restores global communication often enough to make the large window operationally meaningful. ¹

The edge models do more than add modalities

The smallest dense models are the most architecturally distinctive. E2B is listed as 2.3B effective parameters but 5.1B with embeddings, and E4B as 4.5B effective but 8B with embeddings. The difference comes from Per-Layer Embeddings, which give each decoder layer its own small embedding for every token instead of forcing a compact model to preserve all linguistic detail through one bottom-layer embedding alone. In the visible graph that extra path shows up as layer-specific text embeddings feeding a Per-layer projection that keeps token-specific text structure available deeper in the stack. ²

That design choice matters because these are also the most ambitious multimodal members at their size. All models accept image input, but only E2B and E4B add native audio, pairing roughly 150M-parameter vision encoders with roughly 300M-parameter audio encoders. In a compact multimodal decoder, reserved image and audio positions can easily erode language precision. Giving later layers a direct token-specific text signal is a clean way to preserve more of that structure. ¹

The vision path is also more flexible than a fixed square-image pipeline. The VisionEncoder preserves natural aspect ratio, uses a 2D positional scheme so height and width are represented separately, and exposes soft visual-token budgets of 70, 140, 280, 560, and 1120 tokens. At masked_scatter, projected image or audio features overwrite reserved placeholder positions in the language-side sequence. That turns visual detail into an explicit latency-quality knob: lower budgets make sense for captioning or video frames, while higher budgets are better suited to OCR, document parsing, and small text. After that replacement everything still goes through the same Decoder cycle. ¹

31B is the dense long-context member

The 31B is the cleanest dense expression of the recipe. It has 30.7B parameters, 60 layers, 1024-token sliding windows, 256K context, and a much larger ~550M vision encoder. There is no routing trick and no per-layer embedding trick here; the point is always-on capacity for long documents, repositories, codebases, and large multimodal contexts where dense quality matters more than the cheapest possible token. ¹

5 sliding + 1 full 1024-token local window 256K = 262,144 tokens

The upper grid magnifies the causal look-back band so the layer schedule stays legible. The ratio strip keeps the real scale visible: most 31B layers only read a 1,024-token local history, and every sixth layer is the expensive causal full pass that reconnects the entire 256K (262,144-token) context.

The deployment implication is straightforward. Loading the weights alone is about 58.3 GB in BF16 or 17.4 GB in Q4_0, before runtime overhead and KV cache. So 31B can be made local in quantized form, but its natural home is still a serious workstation or server GPU when long-context and dense-model headroom are the priority. ²

Position handling follows the same logic. Rotary embedding remains the positional backbone. RoPE is the mechanism that injects position into attention by rotating the query and key vectors with a position-dependent phase, so token order is represented inside the attention computation itself. Gemma 4 does not use one RoPE regime everywhere, however. Sliding-attention layers are annotated with standard RoPE, while full-attention layers are annotated with proportional RoPE. Gemma 4's proportional variant changes the rotation schedule for the long-range layers by using a much larger base period and rotating only part of the attention head dimension. The long-range layers therefore age more gracefully as sequence length grows, so the periodic full-attention passes remain useful even when the sequence is very long. The point is clear: long context is treated as an inference-systems problem as much as a modeling problem.

26B-A4B changes the cost model

The 26B-A4B asks a different question. It has 25.2B total parameters, 3.8B active parameters, 30 layers, 1024-token sliding windows, 256K context, and an expert layout of 8 active experts, 128 total experts, and 1 shared expert. Instead of sending every token through the same feed-forward path, it uses a Router to decide which Experts handle each token. The model therefore exposes more conditional capacity only where the token needs it. ¹

That makes 31B complementary rather than redundant with it. The dense model buys always-on depth. The MoE model buys conditional feed-forward capacity. The savings, however, are in active compute rather than residency: all 25.2B parameters still need to be loaded for routing, which is why the Q4_0 load footprint is still about 15.6 GB. That is close enough to 31B's 17.4 GB that 26B-A4B makes the most sense on a gaming GPU or workstation, where you want much more headroom than E4B without paying for a dense 31B-style forward pass on every token. ²

Speed, accuracy, and the new local frontier

The edge speed story is the clearest. Published device measurements put E2B at 52 GPU decode tokens per second on a Galaxy S26 Ultra, 57 on an iPhone 17 Pro, and 160 on a MacBook Pro M4 GPU. E4B lands at 22, 25, and 101 tokens per second on those same device classes. That is fast enough to make the E line feel genuinely interactive on phones and laptops rather than merely able to run locally. ³

Those speeds do come with a ceiling. On MMLU Pro and GPQA Diamond, E2B scores 60.0 and 43.4, E4B scores 69.4 and 58.6, 26B-A4B scores 82.6 and 82.3, and 31B scores 85.2 and 84.3. But the smaller models are still more capable than their size class would suggest: E4B already edges the earlier 27B dense baseline on MMLU Pro and outperforms it decisively on GPQA Diamond. That is a strong sign that the compact end of the family is not just cheaper, but genuinely better positioned on the quality curve than the previous generation. ¹

The more important comparison is the same-latency one. In a recent controlled benchmark, E4B reached 0.675 weighted accuracy at 5.458 seconds mean latency, while Qwen3-8B reached 0.322 at 5.041 seconds. E2B reached 0.493 at 4.913 seconds, versus Phi-4-reasoning at 0.427 and 4.857 seconds. That is the real win for the small models: they still give up headroom to the larger members, but they appear to deliver more accuracy at roughly the same latency as nearby alternatives. ⁴

A compact deployment snapshot based on those published figures looks like this.

Model	Best-fit hardware	Published speed / cost signal	Quality signal	What you are buying
E2B	Phone, browser, Raspberry Pi, lightweight laptop	52–57 GPU decode tok/s on flagship phones; 160 tok/s on MacBook Pro M4 GPU	MMLU Pro 60.0; GPQA 43.4; weighted accuracy 0.493 at 4.913s	Fastest local multimodal member, with the lowest reasoning ceiling.
E4B	Phone, laptop edge, offline assistant	22–25 GPU decode tok/s on flagship phones; 101 tok/s on MacBook Pro M4 GPU	MMLU Pro 69.4; GPQA 58.6; weighted accuracy 0.675 at 5.458s	The strongest small-model operating point.
26B-A4B	Gaming GPU, RTX-class workstation, high-end local box	3.8B active params per token; ~15.6 GB Q4_0 load	MMLU Pro 82.6; GPQA 82.3	Much more conditional capacity without dense 31B compute every pass.
31B	Workstation or server GPU	30.7B dense model; ~17.4 GB Q4_0 load or ~58.3 GB BF16 load	MMLU Pro 85.2; GPQA 84.3	Maximum dense quality and long-context headroom.

Values are a deployment snapshot rather than a single apples-to-apples benchmark across one toolchain and one hardware stack.

What Gemma 4 means

What makes the family interesting is that its members are not just scaled copies of one another. The E2B and E4B edge models add per-layer text scaffolding and audio because compact multimodal decoders need extra help preserving language precision under edge constraints. 31B stays dense because long-context quality benefits from always-on capacity. 26B-A4B uses routing because a workstation can often afford the residency of a larger model even when it cannot afford to spend dense 30B-class compute on every token. ⁵

That gives practitioners a cleaner selection rule than parameter count alone. Choose E2B or E4B when privacy, battery, latency, and offline multimodality are the constraint. Choose 26B-A4B when you have a gaming GPU or workstation and want a better capacity-per-token bargain. Choose 31B when dense quality and long-context reliability are worth building heavier hardware around. The family's real contribution is that each member moves a different bottleneck while still sharing the same architectural center. ¹

embedl/Qwen3-1.7B-FlashHead-W4A16 neural network architecture graph

Architecture graph for embedl/Qwen3-1.7B-FlashHead-W4A16.

Opening your dashboard.

Visualize your models

Model dashboard.

BOOKMARKS

NOT READY

GENERATING

READY

FEATURED

Bookmarked visualizations.

Featureyour model

Manual README.md edit

README.md preview

README.md

Review reports.

Create an interactive article around the graph.

Choose a model for your article.

Release watchlist.

Model article

Click a model to open it.

Viewable models

Interactive articles.

Community showcase.

Introducing hfviewer

Why we built it

What hfviewer does

Open models directly from Hugging Face

Switch between overview and detail

Compare model families

A new kind of interactive blog

Where to start

hfviewer blog.

Architecture trends over 5 years

LFM2.5-Audio: edge-first speech inference

DeepSeek V4 mHC Explained

Borealis - open recipe for training Audio LLM

Understanding the Gemma 4 family

How to visualize any Hugging Face model

Blog preview

Model

Understanding the Gemma 4 family

One decoder recipe, three bottlenecks

The edge models do more than add modalities

31B is the dense long-context member

26B-A4B changes the cost model

Speed, accuracy, and the new local frontier

What Gemma 4 means

Selected sources

Discover more interactive walk-throughs

Model article