Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

Data Preparation at Scale with NeMo Curator (NVIDIA AI Series, Part 24)

NeMo Curator is NVIDIA’s GPU-accelerated data curation toolkit that runs exact dedup, fuzzy dedup, semantic dedup, heuristic filtering, classifier-based quality filters, and PII redaction at trillion-token scale using RAPIDS cuDF and Dask. Learn why investing in data curation beats buying more GPUs.

NVIDIA AI Series · Part 24 of 30
TL;DR: NeMo Curator is NVIDIA’s GPU-accelerated data curation toolkit. It runs a modular pipeline — download, language ID, exact dedup, fuzzy dedup, semantic dedup, heuristic quality filters, classifier-based filters, PII redaction, blending — on RAPIDS cuDF and Dask so you can process trillions of tokens on a multi-node GPU cluster in hours rather than weeks. The verdict: invest in curation before buying more GPUs. A 3x cleaner dataset trains a model that outperforms one trained on 5x more raw data.
Who this is for: ML platform engineers and AI infrastructure architects preparing pre-training or fine-tuning datasets for LLMs. You should be comfortable with Python, Dask, and GPU cluster basics. You should have read Part 23 (NeMo Customization: LoRA, SFT, RLHF) since that is where the curated data lands.

You trained on 2 trillion tokens, paid for eight H100s for three weeks, and the model still hallucinates on your domain. The problem probably was not the model. It was the data. Raw Common Crawl dumps contain near-duplicate web pages, spam, PII, low-quality ad copy, and benchmark test questions. Feed that to any model architecture and you get a model that memorises noise. NeMo Curator exists to fix that problem at GPU scale before you ever submit a training job.

What NeMo Curator Actually Is

NeMo Curator is an open-source Python framework — separate from the NeMo training framework — that builds repeatable, GPU-accelerated data pipelines. It is not a hosted service; you run it on your own cluster. The codebase lives at github.com/NVIDIA-NeMo/Curator and the docs at docs.nvidia.com/nemo/curator.

The current release as of mid-2026 is 26.04, which brought a simplified Resources API and upgraded the Ray runtime [VERIFY exact release date]. Under the hood, the compute relies on three NVIDIA RAPIDS libraries: cuDF (GPU DataFrame operations replacing pandas), cuML (GPU machine learning including clustering for semantic dedup), and cuGraph (graph algorithms used in connected-components dedup). Dask coordinates the distributed execution across nodes and GPUs; Ray is the newer alternative runtime used for multi-modal pipelines (image, video, audio). For text-only pre-training data, Dask+RAPIDS is the production-tested path.

NeMo Curator Pipeline Stages Raw web data to curated training corpus Download + Extract Lang ID fastText Dedup Exact/Fuzzy/Sem Heuristic 30+ filters Classifier ML quality PII Redact GPU NER Blend + Shuffle Each stage runs on RAPIDS cuDF + Dask across multi-node GPU clusters
Figure 1 — NeMo Curator pipeline stages. Each stage is a composable module; you can run all or a subset.

The Three Deduplication Layers

Deduplication is where most teams underinvest and where NeMo Curator has the clearest performance story. There are three distinct layers, and you generally want to run all three in sequence.

Exact Deduplication

The simplest layer: SHA-256 or MD5 hash every document, group identical hashes, keep one. RAPIDS cuDF runs these hashes GPU-parallel across partitions. For a 1-trillion-token corpus this pass takes minutes on a small GPU cluster. The yield is usually 5-10% reduction on well-scraped web data and up to 40% on datasets that aggregate syndicated content like news aggregators or mirror sites.

Fuzzy Deduplication (MinHash + LSH)

Near-duplicates — the same article with a changed byline, a Wikipedia paragraph copied across 200 SEO pages — are the real pollution problem. NeMo Curator uses MinHash signatures with Locality Sensitive Hashing (LSH) buckets, then cuGraph connected-components to cluster near-identical documents. NVIDIA benchmarked this at deduplifying 1.1 trillion tokens from the RedPajama dataset in 1.8 hours using 64 A100 GPUs — a task that would take CPU-only tools several days. The advertised speedup versus CPU alternatives is 16x for this stage alone.

Semantic Deduplication

Two documents can be lexically different yet semantically identical — different paraphrases of the same Wikipedia fact appearing in training data will produce models that overfit to that fact. Semantic dedup generates embeddings with a crossfit-backed encoder, runs GPU-accelerated k-means clustering in cuML, then keeps one representative per cluster above a similarity threshold. It is computationally heavier than fuzzy dedup and is best applied after the other two passes have already reduced corpus size. Run it on the largest clusters you have; embedding generation is the bottleneck.

Deduplication Taxonomy Raw Corpus Exact Dedup SHA-256 hash cuDF groupby Fuzzy Dedup MinHash + LSH cuGraph components Semantic Dedup Embeddings + k-means cuML clustering Deduplicated Corpus Ready for quality filtering
Figure 2 — Three dedup layers in order: exact (cheapest), fuzzy (highest yield), semantic (most precise). Run all three.

Quality Filtering: Heuristics and Classifiers

Deduplication removes redundancy. Quality filtering removes the documents that survived dedup but are still not worth training on: ad-stuffed pages, auto-generated listicles, malformed HTML text, and garbled OCR output. NeMo Curator runs two quality-filter tiers in sequence.

Heuristic Filters

These are deterministic, fast, and run on every document. NeMo Curator ships 30+ built-in heuristics including: word count thresholds, mean word length, fraction of lines ending in punctuation, fraction of alphabetic characters, symbol-to-word ratio, bullet-point density (to catch bad list spam), stop-word presence, and n-gram repetition score (documents with the same sentence repeated 50 times are usually spam). All run as cuDF operations on GPU — meaning a single A100 can process hundreds of gigabytes of text per hour at this stage.

Classifier-Based Filtering

After heuristics, you apply ML classifiers that score documents on subjective quality axes: is this text written at a native-speaker level, is it factually coherent, is it domain-appropriate? NeMo Curator ships a distributed classifier module using fastText models (n-gram bag-of-words classifiers trained on a high-quality reference corpus vs. raw Common Crawl). You can plug in your own domain classifier — just drop in a fastText or BERT-family model and use the DistributedDataClassifier wrapper, which handles sharding across GPUs automatically via Dask. For highest recall, teams often train a domain quality classifier on a small hand-labelled seed set of 1,000 to 5,000 documents.

Quality Filter Funnel Raw Corpus 100% After Exact + Fuzzy Dedup ~70-80% After Heuristic Filters ~50-60% After Classifier ~30-40%
Figure 3 — Quality filter funnel. Typical yield on Common Crawl: 30-40% of raw tokens survive all stages. That is the point.
Gotcha: Over-aggressive classifier thresholds are the most common self-inflicted wound in data curation. If you push your quality threshold too high you will discard domain-specific documents that look like low-quality web text to a general classifier but contain exactly the technical vocabulary your model needs. Always spot-check a random 500-document sample of the rejected set before committing to a threshold. Tune your fastText or domain classifier on domain-representative data, not just Wikipedia vs. Common Crawl.

Language Identification and PII Redaction

Language Identification

NeMo Curator uses the fastText lid.176.bin model (roughly 130MB) for language identification. It returns a language code and confidence score per document. You then filter to keep only the languages relevant to your training target, or bucket documents into per-language streams for multilingual training. One critical operational note: the fastText model file must be accessible from all Dask worker nodes, so you either pre-copy it to shared storage or embed it in the worker container image.

PII Redaction

PII in training data is both a legal liability and a model behavior risk — models trained on raw web text will regurgitate email addresses, phone numbers, and real names verbatim. NeMo Curator ships a GPU-accelerated PII detection and redaction module that uses named entity recognition (NER) models to identify and either remove or replace PII entities with typed placeholders (EMAIL_ADDRESS, PHONE_NUMBER, PERSON_NAME). The GPU acceleration here matters: NER inference is transformer-based and would be prohibitively slow at trillion-token scale on CPU.

One design decision teams often miss: redact before dedup or after? Redact after. Exact and fuzzy dedup rely on document content being consistent; redacting first can break hash-based dedup by replacing PII strings differently on near-duplicates that otherwise would have matched. Run dedup on raw content, then redact the surviving corpus.

Curation Stages at a Glance

Stage Purpose GPU Tooling Typical Retention
Download + Extract Fetch WARC/JSONL; strip HTML to text cuDF text ops 100% (no filter)
Language ID Keep target languages, discard rest fastText (CPU), cuDF partitioning Varies by language mix
Exact Dedup Remove identical documents cuDF hash + groupby 90-95%
Fuzzy Dedup Remove near-duplicate documents cuDF MinHash + cuGraph LSH 70-85%
Semantic Dedup Remove semantically equivalent docs crossfit encoder + cuML k-means 85-95% of fuzzy-dedup output
Heuristic Filters Remove spam, malformed, too-short docs cuDF column ops (30+ rules) 60-80%
Classifier Filter Score + remove low-quality by ML model DistributedDataClassifier on GPU 40-70%
PII Redaction Replace personal identifiers GPU NER transformer model ~100% (modify, not filter)
Blend + Shuffle Combine sources; randomise document order Dask shuffle across partitions 100% of survivors

GPU Acceleration: RAPIDS and Dask in Practice

The performance story is not just marketing. CPU-based curation at trillion-token scale typically requires multi-week runs on large CPU clusters. NeMo Curator with RAPIDS moves most operations onto GPU memory — a cuDF DataFrame holding a 10-million-document partition fits comfortably in the 80GB HBM3e of an H100 SXM. Hash operations, string comparisons, and filter predicates run as CUDA kernels instead of Python loops.

Dask wraps the cuDF DataFrames into distributed collections and schedules tasks across GPU workers. You define your pipeline as a DocumentDataset with chained operations; Dask builds a task graph and executes it lazily across all GPUs. Adding nodes is a matter of scaling the Dask cluster — the pipeline itself does not change. For multi-node operation, object storage (S3, MinIO, or NFS over RDMA) is the critical shared layer. Keep your Dask scheduler on a CPU head node and size your GPU workers with one GPU process per physical GPU.

In 2026, NeMo Curator also supports Ray as an alternative runtime — particularly for multi-modal pipelines covering image, video, and audio curation in addition to text. The Ray-based path is newer and slightly less battle-tested at trillion-token text scale than the Dask path, but for mixed-modality pre-training data it is the right choice. [VERIFY Ray stability at 1T+ token scale in 26.04]

A Real Pipeline: Dedup + Quality Filter + PII Redaction

Below is a minimal but complete NeMo Curator pipeline script that chains fuzzy deduplication, a heuristic quality filter, and PII redaction. HTML-escaped for Gutenberg.

import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
    WordCountFilter,
    MeanWordLengthFilter,
    RepeatedLinesByCharFilter,
)
from nemo_curator.modifiers import PIIModifier
from nemo_curator.utils.distributed_utils import get_client

# --- 1. Start a Dask+RAPIDS GPU cluster ---
client = get_client(cluster_type="gpu", num_workers=8)

# --- 2. Load raw data from JSONL on shared storage ---
dataset = DocumentDataset.read_json(
    "/data/raw/*.jsonl",
    fields=["text", "url", "date"],
    backend="cudf",
)
print(f"Loaded: {len(dataset)} documents")

# --- 3. Fuzzy deduplication (MinHash + LSH) ---
fuzzy_dedup = nc.FuzzyDuplicates(
    cache_dir="/data/dedup_cache",
    num_hashes=260,
    hashes_per_band=13,
    jaccard_threshold=0.8,
    perform_removal=True,
)
dataset = fuzzy_dedup(dataset)
print(f"After fuzzy dedup: {len(dataset)} documents")

# --- 4. Heuristic quality filters ---
quality_pipeline = nc.Sequential([
    nc.ScoreFilter(
        WordCountFilter(min_words=80, max_words=100000),
        score_field="word_count",
        score_type=int,
        invert=False,
    ),
    nc.ScoreFilter(
        MeanWordLengthFilter(min_mean_word_length=4.0, max_mean_word_length=10.0),
        score_field="mean_word_length",
        score_type=float,
        invert=False,
    ),
    nc.ScoreFilter(
        RepeatedLinesByCharFilter(max_repeated_char_frac=0.2),
        score_field="repeated_char_frac",
        score_type=float,
        invert=True,
    ),
])
dataset = quality_pipeline(dataset)
print(f"After quality filters: {len(dataset)} documents")

# --- 5. PII redaction ---
pii_modifier = PIIModifier(
    supported_entities=[
        "PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
        "CREDIT_CARD", "IP_ADDRESS",
    ],
    anonymize_action="replace",
    device="gpu",
)
pii_pipeline = nc.Modify(pii_modifier)
dataset = pii_pipeline(dataset)

# --- 6. Write output ---
dataset.to_json("/data/curated/", write_to_filename=True)
print("Done. Curated dataset written to /data/curated/")

Expected Output

On a corpus of 50 million documents (~500GB JSONL), running on 8 x H100 80GB GPUs: fuzzy dedup completes in roughly 12-18 minutes; the quality filter pass takes 5-8 minutes; PII redaction takes 20-40 minutes (transformer inference is the bottleneck). Total wall time under 90 minutes.

Common failure mode — over-aggressive filtering shrinks corpus too far: if min_words=80 is too high for a code or technical dataset where many valid snippets are short, you will silently discard valid training examples. Watch the per-filter discard rate and log it. Add a write_to_filename=True intermediate checkpoint after each stage so you can diagnose which stage caused the unexpected corpus reduction.

Scaling Configuration: Single Node vs. Multi-Node

Scale Hardware Corpus Size Approx. Runtime (full pipeline) Notes
Dev / Pilot 1-2 x A100 or H100 10-50B tokens 1-4 hours Skip semantic dedup for speed
Mid-scale 8 x H100 (DGX H100) 100B-500B tokens 4-12 hours NVLink within-node; NFS for data
Large-scale 32-64 x A100/H100 multi-node 1-2T tokens 1-4 hours (fuzzy dedup alone ~1.8h on 64xA100 at 1.1T tokens) GPUDirect Storage; InfiniBand fabric
Very large 32 x H100 (multi-node) ~2T tokens ~0.5 hours (NVIDIA benchmark) H100 HBM3e bandwidth advantage
In practice: For multi-node Dask clusters, your biggest operational headache will be the shared data path, not the compute. GPUDirect Storage (GDS) with an NFS-over-RDMA or a parallel filesystem like GPFS/Lustre makes the difference between I/O-bound and compute-bound curation jobs. If you are running on Kubernetes, pair the GPU Operator (covered in the NVIDIA AI Guide) with a CSI driver that supports GDS. Without GDS, your A100s will sit at 30-40% utilization waiting on PCIe data transfers.
Data Quality vs. Model Quality Data Quality (curation effort) Model Quality Curated Raw data 3x curated corpus trains better than 10x raw data model
Figure 4 — Data quality dominates model quality. The returns on curation effort outpace the returns on raw data volume past a threshold. More raw tokens on dirty data produces diminishing gains.

Worked Example: Curating a 5-Billion-Document Corpus

Worked Example

Input: 5 billion documents, 4.8T tokens, sourced from Common Crawl WARC dumps (2022-2024 crawls). Stored as 120,000 JSONL shards on NFS-over-RDMA attached to a 32-node GPU cluster (256 x A100 80GB total). Each node has 8 GPUs and a 200Gb/s InfiniBand HDR uplink.

Language filter: Keep English only. Discards 58% of documents. Remaining: 2.1B docs, ~2.0T tokens. Runtime: 35 minutes.

Exact dedup: Removes 6% of English docs. Remaining: 1.97B docs. Runtime: 18 minutes.

Fuzzy dedup (Jaccard threshold 0.8, MinHash 260 hashes): Removes 22% of remaining docs. Remaining: 1.54B docs, ~1.47T tokens. Runtime: 2.1 hours on 256 A100s.

Heuristic filters (word count 80-100k, mean word length 4-10, repetition fraction <0.2): Removes 28%. Remaining: 1.11B docs, ~1.06T tokens. Runtime: 22 minutes.

Classifier filter (fastText quality model, threshold 0.7): Removes 38%. Remaining: 688M docs, ~655B tokens. Runtime: 55 minutes.

PII redaction: Modifies ~12% of surviving docs; no documents removed. Runtime: 3.5 hours (transformer NER is the pipeline bottleneck).

Final output: 688 million documents, 655 billion tokens — 13.6% of the original raw token count. Total wall time: ~7 hours on 256 x A100. The resulting corpus trained a 7B model that matched a 13B model trained on the unfiltered 2T-token dataset on domain benchmarks.

How Curated Data Feeds NeMo Customization

The output of a NeMo Curator pipeline is a JSONL dataset with a text field and metadata columns. This drops directly into a NeMo pre-training or fine-tuning run. For the fine-tuning case covered in Part 23 (LoRA, SFT, RLHF), you typically run a smaller, more aggressive curation pipeline focused on the specific domain — law, medicine, code, financial filings — rather than the broad heuristic pass used for pre-training data. For SFT specifically, you also want task decontamination: removing any document that contains text overlapping with your evaluation benchmarks. NeMo Curator has a built-in task decontamination module that takes a list of benchmark documents and flags training data that n-gram-matches them.

The blending stage is where domain mixing happens. If you are building a model that needs to be strong in code and in English prose, you blend a code corpus (GitHub, Stack Overflow post-curation) with the general web corpus at a ratio like 20:80. NeMo Curator’s blending module handles the weighted sampling and shuffles across partition boundaries so there are no domain-ordering artifacts in the final JSONL shards.

My Take: Invest in Curation Before More GPUs

The pattern I see repeatedly in production: a team trains a 7B or 13B model, is disappointed with the results, and immediately asks about upgrading to a bigger model or buying more H100s. In most cases, the right answer is to go back and curate the training data. Data quality is the highest-ROI intervention available before you hit the true data-limited regime.

Why NeMo Curator specifically: the GPU acceleration is the only reason it is practical at trillion-token scale without a multi-week CPU cluster job. The modular pipeline API means you can plug in your own filters and classifiers without rewriting the infrastructure. The RAPIDS backend means your curation throughput scales near-linearly as you add GPUs — within a node via NVLink, across nodes via InfiniBand.

When NOT to use it: if your dataset is under roughly 10 billion tokens and lives in a controlled source (internal documents, a curated knowledge base, structured PDFs), you do not need NeMo Curator’s scale. A simple pandas + spaCy pipeline on a CPU node will handle it in hours. NeMo Curator is designed for web-scale raw data where CPU-based tools are too slow and the data quality problem is severe.

What to validate first: before running the full pipeline on your entire corpus, curate a 1% sample end-to-end and manually review 200-300 of the retained and 200-300 of the discarded documents at each stage. Check your heuristic thresholds against domain-specific content. A word-count minimum that is sane for news articles will incorrectly discard half your medical case reports or all your legal citations. Get the filter calibration right on the sample before burning 10 hours on the full corpus. [AUTHOR: add anecdote about a real threshold calibration issue from a production curation run]

My take: The 7B model in the worked example above matched a 13B model on domain benchmarks. The 13B model cost 2x the GPU hours to train. The curation run cost 7 hours on 256 A100s. By any reasonable accounting, curation was the cheaper path. If you are planning a training run and have not yet budgeted for data curation compute, cut your training budget by 20% and spend it on curation instead. You will come out ahead on model quality and total cost.

What Is Next

With a curated corpus in hand, the next challenge is running the actual training job at scale — across multiple nodes, with checkpoint fault tolerance and recovery. Part 25 covers multi-node training, scheduling, checkpointing, and fault tolerance in the NeMo framework.

For the deployment side — what to do with the model once it is trained — revisit Part 23 and the inference parts of this series. The full picture is in the NVIDIA AI Complete Guide. Have a question about calibrating curation for your specific domain? Drop it in the comments — I read every one.

NVIDIA AI Series · Part 24 of 30
« Previous: Part 23  |  NVIDIA AI Guide  |  Next: Part 25 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading