Triage reported model articles and comments, hide abusive content, and resolve cases.
Sign in with the HannesVonEssen Hugging Face account to review reports.
No moderation reports match this filter.
Write model article
Model article
Create an interactive article around the graph.
Sign in with Hugging Face to write a model article where the text can point directly to graph nodes,
highlight architecture details, and keep the visualization beside the article.
Mention modules and node types inline with graph-aware autocomplete.
Publish an owner or community article next to the model page visualization.
Use the same interactive reading style as the Gemma 4 family article.
The Hugging Face ecosystem already has model cards, Spaces, checkpoints,
benchmarks, and demos. What it has still been missing is a fast general-purpose
way to see how a model is put together. We built
hfviewer.com to fill that gap:
paste a Hugging Face model URL, open an interactive architecture graph in the
browser, and move between overview and detail without installing anything.
This is our way of giving back to the Hugging Face community.
Why we built it
We kept running into the same problem: a model card can tell you
what a model is for, but it rarely helps you inspect the actual
structure quickly. If you want to understand where the vision encoder enters, how
the decoder repeats, whether the model routes through experts, or how a multimodal
merge happens, you often end up reading config files, staring at code, or building
your own mental graph from scattered clues.
hfviewer is meant to make that first architectural pass much
faster. You can open a model directly from the Hugging Face URL, get a visual map
in the browser, and then zoom from the broad system shape down into the more
specific substructure that matters for understanding deployment, latency, and
correctness.
What hfviewer does
Open models directly from Hugging Face
Paste a model URL or repo id and open the graph without a local setup or
notebook workflow.
Switch between overview and detail
Granularity levels let you move from the high-level architecture down to more
specific traced blocks and paths.
Compare model families
Family pages such as Gemma 4 let you compare
multiple related models with synchronized interaction instead of isolated
screenshots.
This loop shows
embedl/Cosmos-Reason2-2B-W4A16-Edge2
as a natively rendered higher-resolution center-column crop at 2x speed,
keeping the core granularity transition prominent while the info panel stays
out of view.
A new kind of interactive blog
One of the most interesting things hfviewer enables is not just a
prettier model page, but a new kind of technical article. On the
Gemma 4 family page, the blog text and the graph are
connected. You can read a section about a particular architectural decision, jump
into the corresponding part of the graph, and then move back into the article with
the surrounding context still intact.
That matters because model understanding is rarely linear. Sometimes you start from
prose and need to verify it visually. Sometimes you see a node, a route, or a merge
in the graph and want the editorial explanation immediately. We think that
graph-to-text and text-to-graph loop is a better format for ML communication than a
static diagram dropped into a long post.
Where to start
Open a familiar model such as Qwen/Qwen3.5-4B to
get a feel for the main interaction model.
Jump to the Gemma 4 family page to see how the
same interface can support a synchronized comparison and an editorial walkthrough.
We are releasing this because we think architecture understanding should be easier
to share, easier to discuss, and easier to build on. Again: This is our
way of giving back to the Hugging Face community.
The first request can take up to a few minutes while the server analyzes the model and creates the graph.
This is taking longer than usual. Leave your email and we’ll notify you when the model is ready.
While waiting, check out these models:Saved. While waiting, check out:
Technical overview
Understanding the Gemma 4 family
HANNES VON ESSEN
Gemma 4 is easiest to understand as one decoder-centered recipe adapted to three
deployment problems. The
E2B
and
E4B
members are edge models built for tight memory and latency budgets. The
31B
is the dense model for serious long-context and high-quality local or server
inference. The
26B-A4B
changes the economics instead, exposing far more total capacity while activating
only a 3.8B subset per token. The family is therefore more useful to read by
bottleneck than by parameter count alone.
1
One decoder recipe, three bottlenecks
What keeps the family coherent is the shared attention backbone. Across the
lineup,
local sliding-window attention
is interleaved with
full global attention,
and the final layer is always global. The edge models use 512-token sliding
windows and 128K context, while
31B
and
26B-A4B
move to 1024-token windows and 256K context. That matters because it makes
long context an architectural choice rather than just a larger tokenizer
limit.
The expensive part of the stack is also where the main optimizations are
concentrated. The global layers use unified Keys and Values and apply
proportional RoPE,
while the cheaper local layers keep the familiar
standard RoPE
regime. The point is not that every layer sees the whole sequence all the
time; it is that the model restores global communication often enough to make
the large window operationally meaningful.
1
The edge models do more than add modalities
The smallest dense models are the most architecturally distinctive.
E2B
is listed as 2.3B effective parameters but 5.1B with embeddings, and
E4B
as 4.5B effective but 8B with embeddings. The difference comes from
Per-Layer Embeddings,
which give each decoder layer its own small embedding for every token instead
of forcing a compact model to preserve all linguistic detail through one
bottom-layer embedding alone. In the visible graph that extra path shows up as
layer-specific text embeddings feeding a
Per-layer projection
that keeps token-specific text structure available deeper in the stack.
2
That design choice matters because these are also the most ambitious
multimodal members at their size. All models accept
image input,
but only
E2B
and
E4B
add native
audio,
pairing roughly 150M-parameter
vision encoders
with roughly 300M-parameter
audio encoders.
In a compact multimodal decoder, reserved image and audio positions can
easily erode language precision. Giving later layers a direct token-specific
text signal is a clean way to preserve more of that structure.
1
The vision path is also more flexible than a fixed square-image pipeline. The
VisionEncoder
preserves natural aspect ratio, uses a 2D positional scheme so height and
width are represented separately, and exposes soft visual-token budgets of 70,
140, 280, 560, and 1120 tokens. At
masked_scatter,
projected image or audio features overwrite reserved placeholder positions in
the language-side sequence. That turns visual detail into an explicit
latency-quality knob: lower budgets make sense for captioning or video frames,
while higher budgets are better suited to OCR, document parsing, and small
text. After that replacement everything still goes through the same
Decoder cycle.
1
31B is the dense long-context member
The
31B
is the cleanest dense expression of the recipe. It has 30.7B parameters, 60
layers,
1024-token sliding windows,
256K context, and a much larger
~550M vision encoder.
There is no routing trick and no per-layer embedding trick here; the point is
always-on capacity for long documents, repositories, codebases, and large
multimodal contexts where dense quality matters more than the cheapest
possible token.
1
5 sliding + 1 full1024-token local window256K = 262,144 tokens
The upper grid magnifies the causal look-back band so the layer schedule
stays legible. The ratio strip keeps the real scale visible: most 31B
layers only read a 1,024-token local history, and every sixth layer is the
expensive causal full pass that reconnects the entire 256K
(262,144-token) context.
The deployment implication is straightforward. Loading the weights alone is
about 58.3 GB in BF16 or 17.4 GB in Q4_0, before runtime overhead and KV
cache. So 31B can be made local in quantized form, but its natural home is
still a serious workstation or server GPU when long-context and dense-model
headroom are the priority.
2
Position handling follows the same logic.
Rotary embedding
remains the positional backbone. RoPE is the mechanism that injects position
into attention by rotating the query and key vectors with a position-dependent
phase, so token order is represented inside the attention computation itself.
Gemma 4 does not use one RoPE regime everywhere, however.
Sliding-attention layers are annotated with standard RoPE,
while
full-attention layers are annotated with proportional RoPE.
Gemma 4's proportional variant changes the rotation schedule for the
long-range layers by using a much larger base period and rotating only part
of the attention head dimension. The long-range layers therefore age more
gracefully as sequence length grows, so the periodic full-attention passes
remain useful even when the sequence is very long. The point is clear: long
context is treated as an inference-systems problem as much as a modeling
problem.
26B-A4B changes the cost model
The
26B-A4B
asks a different question. It has 25.2B total parameters, 3.8B active
parameters, 30 layers, 1024-token sliding windows, 256K context, and an
expert layout of 8 active experts, 128 total experts, and 1 shared expert.
Instead of sending every token through the same feed-forward path, it uses a
Router
to decide which
Experts
handle each token. The model therefore exposes more conditional capacity only
where the token needs it.
1
That makes
31B
complementary rather than redundant with it. The dense model buys always-on
depth. The MoE model buys conditional feed-forward capacity. The savings,
however, are in active compute rather than residency: all 25.2B parameters
still need to be loaded for routing, which is why the Q4_0 load footprint is
still about 15.6 GB. That is close enough to 31B's 17.4 GB that 26B-A4B
makes the most sense on a gaming GPU or workstation, where you want much more
headroom than E4B without paying for a dense 31B-style forward pass on every
token.
2
Speed, accuracy, and the new local frontier
The edge speed story is the clearest. Published device measurements put E2B
at 52 GPU decode tokens per second on a Galaxy S26 Ultra, 57 on an iPhone 17
Pro, and 160 on a MacBook Pro M4 GPU. E4B lands at 22, 25, and 101 tokens
per second on those same device classes. That is fast enough to make the E
line feel genuinely interactive on phones and laptops rather than merely able
to run locally.
3
Those speeds do come with a ceiling. On MMLU Pro and GPQA Diamond, E2B
scores 60.0 and 43.4, E4B scores 69.4 and 58.6, 26B-A4B scores 82.6 and
82.3, and 31B scores 85.2 and 84.3. But the smaller models are still more
capable than their size class would suggest: E4B already edges the earlier
27B dense baseline on MMLU Pro and outperforms it decisively on GPQA
Diamond. That is a strong sign that the compact end of the family is not just
cheaper, but genuinely better positioned on the quality curve than the
previous generation.
1
The more important comparison is the same-latency one. In a recent controlled
benchmark,
E4B
reached 0.675 weighted accuracy at 5.458 seconds mean latency, while
Qwen3-8B reached 0.322 at 5.041 seconds. E2B reached 0.493 at 4.913 seconds,
versus Phi-4-reasoning at 0.427 and 4.857 seconds. That is the real win for
the small models: they still give up headroom to the larger members, but they
appear to deliver more accuracy at roughly the same latency as nearby
alternatives.
4
A compact deployment snapshot based on those published figures looks like
this.
Values are a deployment snapshot rather than a single apples-to-apples
benchmark across one toolchain and one hardware stack.
What Gemma 4 means
What makes the family interesting is that its members are not just scaled
copies of one another. The
E2B
and
E4B
edge models add per-layer text scaffolding and audio because compact
multimodal decoders need extra help preserving language precision under edge
constraints.
31B
stays dense because long-context quality benefits from always-on capacity.
26B-A4B
uses routing because a workstation can often afford the residency of a larger
model even when it cannot afford to spend dense 30B-class compute on every
token.
5
That gives practitioners a cleaner selection rule than parameter count alone.
Choose
E2B
or
E4B
when privacy, battery, latency, and offline multimodality are the
constraint. Choose
26B-A4B
when you have a gaming GPU or workstation and want a better
capacity-per-token bargain. Choose
31B
when dense quality and long-context reliability are worth building heavier
hardware around. The family's real contribution is that each member moves a
different bottleneck while still sharing the same architectural center.
1