Article

Unlock LLM Power: Custom Heads for Beyond Text Generation

Unlocking the True Power of LLMs: Beyond Text Generation with Custom Heads

In the rapidly evolving landscape of artificial general intelligence, a provocative statement has gained traction: "If your LLM model is used to generate text, you are not using it correctly." At first glance, this might seem absurd—after all, large language models like those powering ChatGPT or Llama have revolutionized content creation, coding assistance, and conversational AI. But the core insight here is spot-on: text generation, while versatile, often scratches only the surface of what these models can achieve. The real magic unfolds when you swap out the standard language modeling (LM) head for custom ones, transforming LLMs into specialized engines for classification, embeddings, reward modeling, and more. These adaptations leverage the frozen backbone of pretrained LLMs while adding lightweight, task-specific layers, enabling efficient, high-impact applications without full retraining.

Custom heads are essentially modular output layers attached to the LLM's hidden states, allowing you to repurpose billion-parameter models for non-generative tasks. They add minimal parameters—often just thousands to millions—while inheriting the model's deep understanding of language. This approach is particularly vibrant in 2025, as open-source ecosystems like Hugging Face explode with tools for head attachment, from linear probes to mixture-of-experts (MoE) setups. Drawing from recent advancements in models like DeepSeek-R1 and Snowflake Arctic Embed, we'll dive into practical usages, complete with pseudo-code and real-world examples. Get ready to supercharge your LLMs for tasks that demand precision, not prose.

If your LLM model is used to generate text, you are not using it correctly illustration

Reward Modeling: Aligning AI with Human (or AI) Preferences

One of the most transformative uses of custom heads is in reward modeling, crucial for reinforcement learning from human feedback (RLHF) or AI feedback (RLAIF). Instead of generating text, the LLM evaluates outputs, scoring them on scalars for helpfulness, harmlessness, or honesty. This powers alignment in models like Starling-7B, where a simple linear head outputs a single reward value.

Consider Starling-RM-7B-alpha, a Llama2-based reward model that takes prompt-response pairs and assigns higher scores to helpful, less harmful replies. Trained on datasets like Nectar (derived from GPT-4 preferences), it uses a Bradley-Terry loss to rank preferences. In practice, this head enables scalable RLAIF, bypassing costly human annotations—vital for enterprise fine-tuning.

Here's pseudo-code for implementing a reward scalar head:

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_llm):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.last_hidden_state[:, 0]  # CLS token
        reward = self.reward_head(pooled)
        return reward.squeeze(-1)  # Scalar per input

Training involves pairwise preference data: for prompts from sources like Anthropic's HH-RLHF, compute losses to favor chosen over rejected responses. Recent papers like RM-R1 (arXiv:2505.02387v1) elevate this by infusing reasoning traces, achieving SOTA on RewardBench with up to 13.8% gains over GPT-4o. For vibrant applications, integrate this into RLHF pipelines for custom chatbots—say, ensuring medical advice prioritizes accuracy over verbosity. References: Starling-RM-7B-alpha on Hugging Face.

Classification Tasks: From Toxicity Detection to Spam Filtering

Move beyond chit-chat: attach a classification head for lightning-fast decisions on sentiment, toxicity, or factuality. A linear layer projects the pooled hidden state to class logits, adding negligible VRAM (<1 MB at inference for 4096-dim inputs).

Widely deployed in 2025, these heads shine in content moderation. For instance, models like those in the Jigsaw Toxic Comment Classification detect categories such as toxic, obscene, or identity_hate across datasets like YouTube toxic comments (Kaggle). Fine-tune on UCI SMS Spam or SST-2 for spam/sentiment, achieving F1 scores >95% with LLMs like Llama-3-8B via in-context learning (ICL).

Pseudo-code example:

import torch
import torch.nn as nn

class LLMWithClassificationHead(nn.Module):
    def __init__(self, base_llm, num_classes):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_classes)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.pooler_output if hasattr(outputs, 'pooler_output') else outputs.last_hidden_state[:, 0]
        logits = self.classifier(pooled)
        return logits

In vibrant real-world use, platforms like Mastodon leverage ICL-personalized heads for user-specific toxicity blocking—add one example in prompts to adapt without retraining, hitting 87.5% F1 on wild data (arXiv:2511.05532v1). This empowers ethical AI, from forum moderation to personalized news feeds. See: Classification of Intent in Moderating Online Discussions.

Embeddings and Retrieval: Fueling Semantic Search

Embeddings turn LLMs into vector powerhouses for retrieval, reranking, and duplicate detection. An MLP head (e.g., 4096 → 1024 dims) pools hidden states via mean or CLS token, consuming 30-80 MB VRAM.

Snowflake's Arctic Embed L v2.0 exemplifies this: a 568M-param model optimized for 8192-token contexts, excelling in multilingual retrieval with Matryoshka Representation Learning for flexible dims (256-1024). Prepend "query: " to inputs for SOTA on BEIR benchmarks.

Pseudo-code:

import torch
import torch.nn as nn

class EmbeddingModel(nn.Module):
    def __init__(self, base_llm, embed_dim):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.embed_head = nn.Linear(hidden_size, embed_dim) if hidden_size != embed_dim else None

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
        if self.embed_head:
            embedding = self.embed_head(pooled)
        else:
            embedding = pooled
        return embedding

For contrastive setups like Siamese networks (2x 4-20M params), train on pairs for reranking—Contrastive Retrieval Heads (CoRe) boost BEIR by aggregating <1% of attention heads (arXiv:2510.02219v1). Vibrantly, this drives RAG systems: embed docs, retrieve top-k, and verify facts, slashing hallucinations in legal or medical QA. Explore: Snowflake Arctic Embed on Hugging Face.

Multi-Task Mastery with MoE Heads

For ultra-multi-tasking, MoE heads route inputs to specialized experts, activating subsets (e.g., 8 of 256) for 100+ tools like in Gorilla-1B. This scales without exploding params (100-300M total, 400MB-1GB VRAM).

OLMoE, a 7B total/1B active-param open model, pretrained on 5.1T tokens, outperforms 7B dense LLMs on MMLU while running on edge devices. Gate via softmax, blend outputs—pseudo-code:

import torch
import torch.nn as nn

class MoEHead(nn.Module):
    def __init__(self, input_dim, num_experts, expert_dim):
        super().__init__()
        self.gate = nn.Linear(input_dim, num_experts)
        self.experts = nn.ModuleList([nn.Linear(input_dim, expert_dim) for _ in range(num_experts)])

    def forward(self, x):
        gate_scores = nn.functional.softmax(self.gate(x), dim=-1)
        expert_outputs = [expert(x) for expert in self.experts]
        output = sum(gate_scores[:, i].unsqueeze(1) * expert_outputs[i] for i in range(len(self.experts)))
        return output

In 2025 deployments, MoE heads enable hybrid tasks: route to NER for one token, classification for another. DeepSeek-V3's 256-expert MoE accelerates this 2x over dense, ideal for on-device RAG. Reference: OLMoE on OpenReview.

Sequence Tagging and PII Redaction: Precision Extraction

Sequence tagging heads (e.g., CRF or per-token linear, <50 MB) label each token for NER, slot filling, or PII detection. Using IOB scheme on datasets like CoNLL-2003 or Snips, extract entities like names or dates.

Private AI's NER endpoint detects overlapping PII (e.g., "John Smith" as full name plus components) via token classification. Fine-tune BERT for 96% accuracy on invoices or dialogues.

Pseudo-code:

import torch
import torch.nn as nn

class SequenceTaggingModel(nn.Module):
    def __init__(self, base_llm, num_tags):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.tagger = nn.Linear(hidden_size, num_tags)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        hidden_states = outputs.last_hidden_state
        logits = self.tagger(hidden_states)  # Per token
        return logits

Vibrant use: Sanitize financial transcripts with Fin-ExBERT (arXiv:2509.23259v1), achieving >84% F1 on CreditCall12H for slot filling in multi-turn chats. This safeguards privacy in AGI apps. See: Token Classification in Hugging Face LLM Course.

Extractive QA and Span Extraction

For pinpointing answers in contexts, span extraction heads predict start/end logits (2x 4096 dims, <10 MB), as in SQuAD fine-tuning. BERT-based models hit 88.67 F1 on validation.

Fin-ExBERT extracts intent-relevant sentences from dialogues, outperforming LLMs on FinQA-10K (4.84/5 human score). Pseudo-code:

import torch
import torch.nn as nn

class SpanExtractionModel(nn.Module):
    def __init__(self, base_llm):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.start_head = nn.Linear(hidden_size, 1)
        self.end_head = nn.Linear(hidden_size, 1)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        hidden_states = outputs.last_hidden_state
        start_logits = self.start_head(hidden_states).squeeze(-1)
        end_logits = self.end_head(hidden_states).squeeze(-1)
        return start_logits, end_logits

Apply to legal docs: Retrieve spans for evidence, boosting RAG accuracy. Reference: Question Answering in Hugging Face LLM Course.

Confidence and Uncertainty: Building Trustworthy AI

Regression heads with uncertainty (2 outputs, negligible VRAM) calibrate confidence, outputting scalars plus variance for tasks like fact-checking.

FineCE (arXiv:2508.12040v1) uses Monte Carlo sampling for per-token scores, improving AUROC by 39.5% on GSM8K. Pseudo-code mirrors reward but with dual linear layers.

This vibrant edge detects hallucinations early in RAG pipelines, e.g., rejecting low-confidence medical claims.

Tool Calling and Verification: Agents Without the Fluff

Tool-calling heads (1-5 MB for 50-200 tools) output parallel logits for function selection, enabling ReAct-style inference in one pass. DeepSeek-R1 excels here, supporting JSON schemas for weather APIs or calendars.

Verification heads (8-20M params) for RAG fact-checking use entailment logits (entail/contradict/neutral), as in Atlas-1B.

Pseudo-code for tool calling:

import torch
import torch.nn as nn

class ToolCallingModel(nn.Module):
    def __init__(self, base_llm, num_tools):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.tool_head = nn.Linear(hidden_size, num_tools)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.last_hidden_state[:, 0]
        tool_logits = self.tool_head(pooled)
        return tool_logits

In 2025, this powers agentic workflows: Verify retrieved facts before generation, as in Evidence-backed Fact Checking (ACL Anthology). See: DeepSeek API Docs.

The Future: Custom Heads as AGI Building Blocks

Custom heads aren't just tweaks—they're the vibrant gateway to modular AGI. By ditching the LM head for these lean alternatives, you unlock efficiency (e.g., 80% param cuts via TARDIS, arXiv:2501.10054v1) and specialization, from personalized moderation to secure RAG. Libraries like transformer-heads (Reddit: r/LocalLLaMA) simplify attachment, while 2025 trends like MH-MoE (arXiv:2404.15045) promise even more scalability.

Experiment today: Fine-tune on Hugging Face, integrate into vLLM for inference. The era of text-only LLMs is over—embrace heads to wield true intelligence. For more on LLM architectures, check The Big LLM Architecture Comparison.