Open models directly from Hugging Face
Paste a model URL or repo id and open the graph without a local setup or notebook workflow.
Paste a Hugging Face URL or repo name and jump straight into a clear interactive model view.
If you are interested in deploying these models to edge devices, check out our other products:
The Hugging Face ecosystem already has model cards, Spaces, checkpoints, benchmarks, and demos. What it has still been missing is a fast general-purpose way to see how a model is put together. We built hfviewer.com to fill that gap: paste a Hugging Face model URL, open an interactive architecture graph in the browser, and move between overview and detail without installing anything.
This is our way of giving back to the Hugging Face community.
We kept running into the same problem: a model card can tell you what a model is for, but it rarely helps you inspect the actual structure quickly. If you want to understand where the vision encoder enters, how the decoder repeats, whether the model routes through experts, or how a multimodal merge happens, you often end up reading config files, staring at code, or building your own mental graph from scattered clues.
hfviewer is meant to make that first architectural pass much faster. You can open a model directly from the Hugging Face URL, get a visual map in the browser, and then zoom from the broad system shape down into the more specific substructure that matters for understanding deployment, latency, and correctness.
Paste a model URL or repo id and open the graph without a local setup or notebook workflow.
Granularity levels let you move from the high-level architecture down to more specific traced blocks and paths.
Family pages such as Gemma 4 let you compare multiple related models with synchronized interaction instead of isolated screenshots.
This loop shows embedl/Cosmos-Reason2-2B-W4A16-Edge2 as a natively rendered higher-resolution center-column crop at 2x speed, keeping the core granularity transition prominent while the info panel stays out of view.
One of the most interesting things hfviewer enables is not just a prettier model page, but a new kind of technical article. On the Gemma 4 family page, the blog text and the graph are connected. You can read a section about a particular architectural decision, jump into the corresponding part of the graph, and then move back into the article with the surrounding context still intact.
That matters because model understanding is rarely linear. Sometimes you start from prose and need to verify it visually. Sometimes you see a node, a route, or a merge in the graph and want the editorial explanation immediately. We think that graph-to-text and text-to-graph loop is a better format for ML communication than a static diagram dropped into a long post.
We are releasing this because we think architecture understanding should be easier to share, easier to discuss, and easier to build on. Again: This is our way of giving back to the Hugging Face community.
Processing model
The first request can take up to a few minutes while the server analyzes the model and creates the graph.
Gemma 4 is easiest to understand as one decoder-centered recipe adapted to three deployment problems. The E2B and E4B members are edge models built for tight memory and latency budgets. The 31B is the dense model for serious long-context and high-quality local or server inference. The 26B-A4B changes the economics instead, exposing far more total capacity while activating only a 3.8B subset per token. The family is therefore more useful to read by bottleneck than by parameter count alone. 1
What keeps the family coherent is the shared attention backbone. Across the lineup, local sliding-window attention is interleaved with full global attention, and the final layer is always global. The edge models use 512-token sliding windows and 128K context, while 31B and 26B-A4B move to 1024-token windows and 256K context. That matters because it makes long context an architectural choice rather than just a larger tokenizer limit.
The expensive part of the stack is also where the main optimizations are concentrated. The global layers use unified Keys and Values and apply proportional RoPE, while the cheaper local layers keep the familiar standard RoPE regime. The point is not that every layer sees the whole sequence all the time; it is that the model restores global communication often enough to make the large window operationally meaningful. 1
The smallest dense models are the most architecturally distinctive. E2B is listed as 2.3B effective parameters but 5.1B with embeddings, and E4B as 4.5B effective but 8B with embeddings. The difference comes from Per-Layer Embeddings, which give each decoder layer its own small embedding for every token instead of forcing a compact model to preserve all linguistic detail through one bottom-layer embedding alone. In the visible graph that extra path shows up as layer-specific text embeddings feeding a Per-layer projection that keeps token-specific text structure available deeper in the stack. 2
That design choice matters because these are also the most ambitious multimodal members at their size. All models accept image input, but only E2B and E4B add native audio, pairing roughly 150M-parameter vision encoders with roughly 300M-parameter audio encoders. In a compact multimodal decoder, reserved image and audio positions can easily erode language precision. Giving later layers a direct token-specific text signal is a clean way to preserve more of that structure. 1
The vision path is also more flexible than a fixed square-image pipeline. The VisionEncoder preserves natural aspect ratio, uses a 2D positional scheme so height and width are represented separately, and exposes soft visual-token budgets of 70, 140, 280, 560, and 1120 tokens. At masked_scatter, projected image or audio features overwrite reserved placeholder positions in the language-side sequence. That turns visual detail into an explicit latency-quality knob: lower budgets make sense for captioning or video frames, while higher budgets are better suited to OCR, document parsing, and small text. After that replacement everything still goes through the same Decoder cycle. 1
The 31B is the cleanest dense expression of the recipe. It has 30.7B parameters, 60 layers, 1024-token sliding windows, 256K context, and a much larger ~550M vision encoder. There is no routing trick and no per-layer embedding trick here; the point is always-on capacity for long documents, repositories, codebases, and large multimodal contexts where dense quality matters more than the cheapest possible token. 1
The upper grid magnifies the causal look-back band so the layer schedule stays legible. The ratio strip keeps the real scale visible: most 31B layers only read a 1,024-token local history, and every sixth layer is the expensive causal full pass that reconnects the entire 256K (262,144-token) context.
The deployment implication is straightforward. Loading the weights alone is about 58.3 GB in BF16 or 17.4 GB in Q4_0, before runtime overhead and KV cache. So 31B can be made local in quantized form, but its natural home is still a serious workstation or server GPU when long-context and dense-model headroom are the priority. 2
Position handling follows the same logic. Rotary embedding remains the positional backbone. RoPE is the mechanism that injects position into attention by rotating the query and key vectors with a position-dependent phase, so token order is represented inside the attention computation itself. Gemma 4 does not use one RoPE regime everywhere, however. Sliding-attention layers are annotated with standard RoPE, while full-attention layers are annotated with proportional RoPE. Gemma 4's proportional variant changes the rotation schedule for the long-range layers by using a much larger base period and rotating only part of the attention head dimension. The long-range layers therefore age more gracefully as sequence length grows, so the periodic full-attention passes remain useful even when the sequence is very long. The point is clear: long context is treated as an inference-systems problem as much as a modeling problem.
The 26B-A4B asks a different question. It has 25.2B total parameters, 3.8B active parameters, 30 layers, 1024-token sliding windows, 256K context, and an expert layout of 8 active experts, 128 total experts, and 1 shared expert. Instead of sending every token through the same feed-forward path, it uses a Router to decide which Experts handle each token. The model therefore exposes more conditional capacity only where the token needs it. 1
That makes 31B complementary rather than redundant with it. The dense model buys always-on depth. The MoE model buys conditional feed-forward capacity. The savings, however, are in active compute rather than residency: all 25.2B parameters still need to be loaded for routing, which is why the Q4_0 load footprint is still about 15.6 GB. That is close enough to 31B's 17.4 GB that 26B-A4B makes the most sense on a gaming GPU or workstation, where you want much more headroom than E4B without paying for a dense 31B-style forward pass on every token. 2
The edge speed story is the clearest. Published device measurements put E2B at 52 GPU decode tokens per second on a Galaxy S26 Ultra, 57 on an iPhone 17 Pro, and 160 on a MacBook Pro M4 GPU. E4B lands at 22, 25, and 101 tokens per second on those same device classes. That is fast enough to make the E line feel genuinely interactive on phones and laptops rather than merely able to run locally. 3
Those speeds do come with a ceiling. On MMLU Pro and GPQA Diamond, E2B scores 60.0 and 43.4, E4B scores 69.4 and 58.6, 26B-A4B scores 82.6 and 82.3, and 31B scores 85.2 and 84.3. But the smaller models are still more capable than their size class would suggest: E4B already edges the earlier 27B dense baseline on MMLU Pro and outperforms it decisively on GPQA Diamond. That is a strong sign that the compact end of the family is not just cheaper, but genuinely better positioned on the quality curve than the previous generation. 1
The more important comparison is the same-latency one. In a recent controlled benchmark, E4B reached 0.675 weighted accuracy at 5.458 seconds mean latency, while Qwen3-8B reached 0.322 at 5.041 seconds. E2B reached 0.493 at 4.913 seconds, versus Phi-4-reasoning at 0.427 and 4.857 seconds. That is the real win for the small models: they still give up headroom to the larger members, but they appear to deliver more accuracy at roughly the same latency as nearby alternatives. 4
A compact deployment snapshot based on those published figures looks like this.
| Model | Best-fit hardware | Published speed / cost signal | Quality signal | What you are buying |
|---|---|---|---|---|
| E2B | Phone, browser, Raspberry Pi, lightweight laptop | 52–57 GPU decode tok/s on flagship phones; 160 tok/s on MacBook Pro M4 GPU | MMLU Pro 60.0; GPQA 43.4; weighted accuracy 0.493 at 4.913s | Fastest local multimodal member, with the lowest reasoning ceiling. |
| E4B | Phone, laptop edge, offline assistant | 22–25 GPU decode tok/s on flagship phones; 101 tok/s on MacBook Pro M4 GPU | MMLU Pro 69.4; GPQA 58.6; weighted accuracy 0.675 at 5.458s | The strongest small-model operating point. |
| 26B-A4B | Gaming GPU, RTX-class workstation, high-end local box | 3.8B active params per token; ~15.6 GB Q4_0 load | MMLU Pro 82.6; GPQA 82.3 | Much more conditional capacity without dense 31B compute every pass. |
| 31B | Workstation or server GPU | 30.7B dense model; ~17.4 GB Q4_0 load or ~58.3 GB BF16 load | MMLU Pro 85.2; GPQA 84.3 | Maximum dense quality and long-context headroom. |
Values are a deployment snapshot rather than a single apples-to-apples benchmark across one toolchain and one hardware stack.
What makes the family interesting is that its members are not just scaled copies of one another. The E2B and E4B edge models add per-layer text scaffolding and audio because compact multimodal decoders need extra help preserving language precision under edge constraints. 31B stays dense because long-context quality benefits from always-on capacity. 26B-A4B uses routing because a workstation can often afford the residency of a larger model even when it cannot afford to spend dense 30B-class compute on every token. 5
That gives practitioners a cleaner selection rule than parameter count alone. Choose E2B or E4B when privacy, battery, latency, and offline multimodality are the constraint. Choose 26B-A4B when you have a gaming GPU or workstation and want a better capacity-per-token bargain. Choose 31B when dense quality and long-context reliability are worth building heavier hardware around. The family's real contribution is that each member moves a different bottleneck while still sharing the same architectural center. 1