You trained on 2 trillion tokens, paid for eight H100s for three weeks, and the model still hallucinates on your domain. The problem probably was not the model. It was the data. Raw Common Crawl dumps contain near-duplicate web pages, spam, PII, low-quality ad copy, and benchmark test questions. Feed that to any model architecture and you get a model that memorises noise. NeMo Curator exists to fix that problem at GPU scale before you ever submit a training job.
What NeMo Curator Actually Is
NeMo Curator is an open-source Python framework — separate from the NeMo training framework — that builds repeatable, GPU-accelerated data pipelines. It is not a hosted service; you run it on your own cluster. The codebase lives at github.com/NVIDIA-NeMo/Curator and the docs at docs.nvidia.com/nemo/curator.
The current release as of mid-2026 is 26.04, which brought a simplified Resources API and upgraded the Ray runtime [VERIFY exact release date]. Under the hood, the compute relies on three NVIDIA RAPIDS libraries: cuDF (GPU DataFrame operations replacing pandas), cuML (GPU machine learning including clustering for semantic dedup), and cuGraph (graph algorithms used in connected-components dedup). Dask coordinates the distributed execution across nodes and GPUs; Ray is the newer alternative runtime used for multi-modal pipelines (image, video, audio). For text-only pre-training data, Dask+RAPIDS is the production-tested path.
The Three Deduplication Layers
Deduplication is where most teams underinvest and where NeMo Curator has the clearest performance story. There are three distinct layers, and you generally want to run all three in sequence.
Exact Deduplication
The simplest layer: SHA-256 or MD5 hash every document, group identical hashes, keep one. RAPIDS cuDF runs these hashes GPU-parallel across partitions. For a 1-trillion-token corpus this pass takes minutes on a small GPU cluster. The yield is usually 5-10% reduction on well-scraped web data and up to 40% on datasets that aggregate syndicated content like news aggregators or mirror sites.
Fuzzy Deduplication (MinHash + LSH)
Near-duplicates — the same article with a changed byline, a Wikipedia paragraph copied across 200 SEO pages — are the real pollution problem. NeMo Curator uses MinHash signatures with Locality Sensitive Hashing (LSH) buckets, then cuGraph connected-components to cluster near-identical documents. NVIDIA benchmarked this at deduplifying 1.1 trillion tokens from the RedPajama dataset in 1.8 hours using 64 A100 GPUs — a task that would take CPU-only tools several days. The advertised speedup versus CPU alternatives is 16x for this stage alone.
Semantic Deduplication
Two documents can be lexically different yet semantically identical — different paraphrases of the same Wikipedia fact appearing in training data will produce models that overfit to that fact. Semantic dedup generates embeddings with a crossfit-backed encoder, runs GPU-accelerated k-means clustering in cuML, then keeps one representative per cluster above a similarity threshold. It is computationally heavier than fuzzy dedup and is best applied after the other two passes have already reduced corpus size. Run it on the largest clusters you have; embedding generation is the bottleneck.
Quality Filtering: Heuristics and Classifiers
Deduplication removes redundancy. Quality filtering removes the documents that survived dedup but are still not worth training on: ad-stuffed pages, auto-generated listicles, malformed HTML text, and garbled OCR output. NeMo Curator runs two quality-filter tiers in sequence.
Heuristic Filters
These are deterministic, fast, and run on every document. NeMo Curator ships 30+ built-in heuristics including: word count thresholds, mean word length, fraction of lines ending in punctuation, fraction of alphabetic characters, symbol-to-word ratio, bullet-point density (to catch bad list spam), stop-word presence, and n-gram repetition score (documents with the same sentence repeated 50 times are usually spam). All run as cuDF operations on GPU — meaning a single A100 can process hundreds of gigabytes of text per hour at this stage.
Classifier-Based Filtering
After heuristics, you apply ML classifiers that score documents on subjective quality axes: is this text written at a native-speaker level, is it factually coherent, is it domain-appropriate? NeMo Curator ships a distributed classifier module using fastText models (n-gram bag-of-words classifiers trained on a high-quality reference corpus vs. raw Common Crawl). You can plug in your own domain classifier — just drop in a fastText or BERT-family model and use the DistributedDataClassifier wrapper, which handles sharding across GPUs automatically via Dask. For highest recall, teams often train a domain quality classifier on a small hand-labelled seed set of 1,000 to 5,000 documents.
Language Identification and PII Redaction
Language Identification
NeMo Curator uses the fastText lid.176.bin model (roughly 130MB) for language identification. It returns a language code and confidence score per document. You then filter to keep only the languages relevant to your training target, or bucket documents into per-language streams for multilingual training. One critical operational note: the fastText model file must be accessible from all Dask worker nodes, so you either pre-copy it to shared storage or embed it in the worker container image.
PII Redaction
PII in training data is both a legal liability and a model behavior risk — models trained on raw web text will regurgitate email addresses, phone numbers, and real names verbatim. NeMo Curator ships a GPU-accelerated PII detection and redaction module that uses named entity recognition (NER) models to identify and either remove or replace PII entities with typed placeholders (EMAIL_ADDRESS, PHONE_NUMBER, PERSON_NAME). The GPU acceleration here matters: NER inference is transformer-based and would be prohibitively slow at trillion-token scale on CPU.
One design decision teams often miss: redact before dedup or after? Redact after. Exact and fuzzy dedup rely on document content being consistent; redacting first can break hash-based dedup by replacing PII strings differently on near-duplicates that otherwise would have matched. Run dedup on raw content, then redact the surviving corpus.
Curation Stages at a Glance
| Stage | Purpose | GPU Tooling | Typical Retention |
|---|---|---|---|
| Download + Extract | Fetch WARC/JSONL; strip HTML to text | cuDF text ops | 100% (no filter) |
| Language ID | Keep target languages, discard rest | fastText (CPU), cuDF partitioning | Varies by language mix |
| Exact Dedup | Remove identical documents | cuDF hash + groupby | 90-95% |
| Fuzzy Dedup | Remove near-duplicate documents | cuDF MinHash + cuGraph LSH | 70-85% |
| Semantic Dedup | Remove semantically equivalent docs | crossfit encoder + cuML k-means | 85-95% of fuzzy-dedup output |
| Heuristic Filters | Remove spam, malformed, too-short docs | cuDF column ops (30+ rules) | 60-80% |
| Classifier Filter | Score + remove low-quality by ML model | DistributedDataClassifier on GPU | 40-70% |
| PII Redaction | Replace personal identifiers | GPU NER transformer model | ~100% (modify, not filter) |
| Blend + Shuffle | Combine sources; randomise document order | Dask shuffle across partitions | 100% of survivors |
GPU Acceleration: RAPIDS and Dask in Practice
The performance story is not just marketing. CPU-based curation at trillion-token scale typically requires multi-week runs on large CPU clusters. NeMo Curator with RAPIDS moves most operations onto GPU memory — a cuDF DataFrame holding a 10-million-document partition fits comfortably in the 80GB HBM3e of an H100 SXM. Hash operations, string comparisons, and filter predicates run as CUDA kernels instead of Python loops.
Dask wraps the cuDF DataFrames into distributed collections and schedules tasks across GPU workers. You define your pipeline as a DocumentDataset with chained operations; Dask builds a task graph and executes it lazily across all GPUs. Adding nodes is a matter of scaling the Dask cluster — the pipeline itself does not change. For multi-node operation, object storage (S3, MinIO, or NFS over RDMA) is the critical shared layer. Keep your Dask scheduler on a CPU head node and size your GPU workers with one GPU process per physical GPU.
In 2026, NeMo Curator also supports Ray as an alternative runtime — particularly for multi-modal pipelines covering image, video, and audio curation in addition to text. The Ray-based path is newer and slightly less battle-tested at trillion-token text scale than the Dask path, but for mixed-modality pre-training data it is the right choice. [VERIFY Ray stability at 1T+ token scale in 26.04]
A Real Pipeline: Dedup + Quality Filter + PII Redaction
Below is a minimal but complete NeMo Curator pipeline script that chains fuzzy deduplication, a heuristic quality filter, and PII redaction. HTML-escaped for Gutenberg.
import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
WordCountFilter,
MeanWordLengthFilter,
RepeatedLinesByCharFilter,
)
from nemo_curator.modifiers import PIIModifier
from nemo_curator.utils.distributed_utils import get_client
# --- 1. Start a Dask+RAPIDS GPU cluster ---
client = get_client(cluster_type="gpu", num_workers=8)
# --- 2. Load raw data from JSONL on shared storage ---
dataset = DocumentDataset.read_json(
"/data/raw/*.jsonl",
fields=["text", "url", "date"],
backend="cudf",
)
print(f"Loaded: {len(dataset)} documents")
# --- 3. Fuzzy deduplication (MinHash + LSH) ---
fuzzy_dedup = nc.FuzzyDuplicates(
cache_dir="/data/dedup_cache",
num_hashes=260,
hashes_per_band=13,
jaccard_threshold=0.8,
perform_removal=True,
)
dataset = fuzzy_dedup(dataset)
print(f"After fuzzy dedup: {len(dataset)} documents")
# --- 4. Heuristic quality filters ---
quality_pipeline = nc.Sequential([
nc.ScoreFilter(
WordCountFilter(min_words=80, max_words=100000),
score_field="word_count",
score_type=int,
invert=False,
),
nc.ScoreFilter(
MeanWordLengthFilter(min_mean_word_length=4.0, max_mean_word_length=10.0),
score_field="mean_word_length",
score_type=float,
invert=False,
),
nc.ScoreFilter(
RepeatedLinesByCharFilter(max_repeated_char_frac=0.2),
score_field="repeated_char_frac",
score_type=float,
invert=True,
),
])
dataset = quality_pipeline(dataset)
print(f"After quality filters: {len(dataset)} documents")
# --- 5. PII redaction ---
pii_modifier = PIIModifier(
supported_entities=[
"PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
"CREDIT_CARD", "IP_ADDRESS",
],
anonymize_action="replace",
device="gpu",
)
pii_pipeline = nc.Modify(pii_modifier)
dataset = pii_pipeline(dataset)
# --- 6. Write output ---
dataset.to_json("/data/curated/", write_to_filename=True)
print("Done. Curated dataset written to /data/curated/")
Expected Output
On a corpus of 50 million documents (~500GB JSONL), running on 8 x H100 80GB GPUs: fuzzy dedup completes in roughly 12-18 minutes; the quality filter pass takes 5-8 minutes; PII redaction takes 20-40 minutes (transformer inference is the bottleneck). Total wall time under 90 minutes.
Common failure mode — over-aggressive filtering shrinks corpus too far: if min_words=80 is too high for a code or technical dataset where many valid snippets are short, you will silently discard valid training examples. Watch the per-filter discard rate and log it. Add a write_to_filename=True intermediate checkpoint after each stage so you can diagnose which stage caused the unexpected corpus reduction.
Scaling Configuration: Single Node vs. Multi-Node
| Scale | Hardware | Corpus Size | Approx. Runtime (full pipeline) | Notes |
|---|---|---|---|---|
| Dev / Pilot | 1-2 x A100 or H100 | 10-50B tokens | 1-4 hours | Skip semantic dedup for speed |
| Mid-scale | 8 x H100 (DGX H100) | 100B-500B tokens | 4-12 hours | NVLink within-node; NFS for data |
| Large-scale | 32-64 x A100/H100 multi-node | 1-2T tokens | 1-4 hours (fuzzy dedup alone ~1.8h on 64xA100 at 1.1T tokens) | GPUDirect Storage; InfiniBand fabric |
| Very large | 32 x H100 (multi-node) | ~2T tokens | ~0.5 hours (NVIDIA benchmark) | H100 HBM3e bandwidth advantage |
Worked Example: Curating a 5-Billion-Document Corpus
Worked Example
Input: 5 billion documents, 4.8T tokens, sourced from Common Crawl WARC dumps (2022-2024 crawls). Stored as 120,000 JSONL shards on NFS-over-RDMA attached to a 32-node GPU cluster (256 x A100 80GB total). Each node has 8 GPUs and a 200Gb/s InfiniBand HDR uplink.
Language filter: Keep English only. Discards 58% of documents. Remaining: 2.1B docs, ~2.0T tokens. Runtime: 35 minutes.
Exact dedup: Removes 6% of English docs. Remaining: 1.97B docs. Runtime: 18 minutes.
Fuzzy dedup (Jaccard threshold 0.8, MinHash 260 hashes): Removes 22% of remaining docs. Remaining: 1.54B docs, ~1.47T tokens. Runtime: 2.1 hours on 256 A100s.
Heuristic filters (word count 80-100k, mean word length 4-10, repetition fraction <0.2): Removes 28%. Remaining: 1.11B docs, ~1.06T tokens. Runtime: 22 minutes.
Classifier filter (fastText quality model, threshold 0.7): Removes 38%. Remaining: 688M docs, ~655B tokens. Runtime: 55 minutes.
PII redaction: Modifies ~12% of surviving docs; no documents removed. Runtime: 3.5 hours (transformer NER is the pipeline bottleneck).
Final output: 688 million documents, 655 billion tokens — 13.6% of the original raw token count. Total wall time: ~7 hours on 256 x A100. The resulting corpus trained a 7B model that matched a 13B model trained on the unfiltered 2T-token dataset on domain benchmarks.
How Curated Data Feeds NeMo Customization
The output of a NeMo Curator pipeline is a JSONL dataset with a text field and metadata columns. This drops directly into a NeMo pre-training or fine-tuning run. For the fine-tuning case covered in Part 23 (LoRA, SFT, RLHF), you typically run a smaller, more aggressive curation pipeline focused on the specific domain — law, medicine, code, financial filings — rather than the broad heuristic pass used for pre-training data. For SFT specifically, you also want task decontamination: removing any document that contains text overlapping with your evaluation benchmarks. NeMo Curator has a built-in task decontamination module that takes a list of benchmark documents and flags training data that n-gram-matches them.
The blending stage is where domain mixing happens. If you are building a model that needs to be strong in code and in English prose, you blend a code corpus (GitHub, Stack Overflow post-curation) with the general web corpus at a ratio like 20:80. NeMo Curator’s blending module handles the weighted sampling and shuffles across partition boundaries so there are no domain-ordering artifacts in the final JSONL shards.
My Take: Invest in Curation Before More GPUs
The pattern I see repeatedly in production: a team trains a 7B or 13B model, is disappointed with the results, and immediately asks about upgrading to a bigger model or buying more H100s. In most cases, the right answer is to go back and curate the training data. Data quality is the highest-ROI intervention available before you hit the true data-limited regime.
Why NeMo Curator specifically: the GPU acceleration is the only reason it is practical at trillion-token scale without a multi-week CPU cluster job. The modular pipeline API means you can plug in your own filters and classifiers without rewriting the infrastructure. The RAPIDS backend means your curation throughput scales near-linearly as you add GPUs — within a node via NVLink, across nodes via InfiniBand.
When NOT to use it: if your dataset is under roughly 10 billion tokens and lives in a controlled source (internal documents, a curated knowledge base, structured PDFs), you do not need NeMo Curator’s scale. A simple pandas + spaCy pipeline on a CPU node will handle it in hours. NeMo Curator is designed for web-scale raw data where CPU-based tools are too slow and the data quality problem is severe.
What to validate first: before running the full pipeline on your entire corpus, curate a 1% sample end-to-end and manually review 200-300 of the retained and 200-300 of the discarded documents at each stage. Check your heuristic thresholds against domain-specific content. A word-count minimum that is sane for news articles will incorrectly discard half your medical case reports or all your legal citations. Get the filter calibration right on the sample before burning 10 hours on the full corpus. [AUTHOR: add anecdote about a real threshold calibration issue from a production curation run]
What Is Next
With a curated corpus in hand, the next challenge is running the actual training job at scale — across multiple nodes, with checkpoint fault tolerance and recovery. Part 25 covers multi-node training, scheduling, checkpointing, and fault tolerance in the NeMo framework.
For the deployment side — what to do with the model once it is trained — revisit Part 23 and the inference parts of this series. The full picture is in the NVIDIA AI Complete Guide. Have a question about calibrating curation for your specific domain? Drop it in the comments — I read every one.
References
- NVIDIA NeMo Curator Documentation Overview
- NeMo Curator GitHub Repository (NVIDIA-NeMo/Curator)
- Deduplication Concepts — NeMo Curator 25.09
- Language Identification — NeMo Curator
- Mastering LLM Techniques: Data Preprocessing (NVIDIA Technical Blog)
- Curating Custom Datasets for LLM Training with NeMo Curator (NVIDIA Technical Blog)
- NeMo Curator Container — NGC Catalog



