The FineWeb Concept Atlas

Abstract blue translucent shape on a white background.

Author: Nathaniel Monson, Founding Research Scientist
Published: March 05, 2026

We are releasing FineWeb Atlas, a concept-annotated version of the FineWeb-Edu dataset: a 10.18B-token corpus with sub-document-level, human-understandable topic annotations. Each document is broken into chunks, and each chunk is annotated with 4 types of concepts that capture the chunk’s primary content, tone, key entities, and purpose. Built using an improved version of our ATLAS pipeline, this dataset should enable new directions in LLM model training, steering, and auditing.

FineWeb Atlas annotates 14,868,862 documents (95,486,049 chunks, 10,183,028,973 tokens) with 16,790 human-understandable concepts. The full release is available on HuggingFace and consists of four core artifacts:

Together, these artifacts let anyone explore pretraining data at the concept level, querying, filtering, and analyzing a 10.18B-token corpus with human-understandable concepts.

Below are five examples from the corpus, showing how documents break into chunks and the concept annotations our pipeline assigns to each. These examples span technical how-to content, reviews, fan/community writing, and pop-culture discussion.

Curated webtext example | WordPress to Salesforce integration
Curated webtext example | Master Data Services learning guide
Curated webtext example | indie film review and reflection
Curated webtext example | community wiki maintenance update
Curated webtext example | pop-culture discussion prompt

Figure: Five examples from different corners of webtext. Each panel shows raw text with content, document, tone, and entity labels, illustrating how the annotation stack captures both topic and style across varied writing formats.

The FineWeb Concept Atlas

The overall concept library spans topics like NASCAR to Hepatitis C, from specific entities like Fort Wayne, Indiana to Stockholm, from blood transfusion protocols to GPS caching. The library includes 16,790 concepts reflecting the full breadth of what appears in web-scale text corpora.

Loading visualization...

Figure: 2D UMAP projection of roughly 3,000 sampled concepts. Nearby points are semantically related concepts, and colors show taxonomy groups, making it easier to see where domains form tight clusters versus broad overlap.

Colors indicate taxonomy groups; use the legend to isolate or hide groups.

At the annotation stage we seek to annotate text along 4 dimensions:

Figure: Concept frequency vs mean LLM-judge score. Horizontal position reflects concept prevalence, vertical position reflects label quality, and color highlights where quality failures concentrate across the frequency spectrum (yellow: <= 2, blue: > 2). This is a floor estimate from the shared sampled set (>=10 LLM ratings per concept): we also apply additional quality-improving curation steps (for example, dropping clearly worst concepts), but we do not yet have a robust measurement of how much those steps improve these metrics.

On average, each chunk of text receives 14.73 concept labels: 5.68 content, 6.90 tone, 1.52 document, and 0.63 entity. The density is intentional; a single chunk about GPS caching in a tutorial might carry content labels for the technology, tone labels for its instructional style, a document label marking it as a tutorial, and an entity label for a specific platform or standard. The concept frequency distribution follows a familiar power law pattern: a small number of high-coverage concepts dominate, while a long tail of specific concepts each appear in a small fraction of chunks. For example, “matter-of-fact” covers 83.58% of chunks and “Informational” covers 79.82%. At the tail, thousands of concepts capture niche subjects that matter for specific domains.

The co-occurrence matrix reveals which concepts tend to appear together. Some pairings are unsurprising but confirm expected correlations. The strongest overall pairing, “Informational” and “matter-of-fact”, co-occurs 72 million times covering 75.53% of all chunks. This reflects the dominant educational tone of FineWeb-Edu’s 10.18B-token corpus. In another case, “Christianity” and “Religion” co-occur in 2,197,944 chunks.

How FineWeb Atlas Was Produced

FineWeb Atlas was built using an improved version of the ATLAS pipeline, which produces concept annotations in three stages:

Annotation Label Quality

We evaluate annotation quality using the same framework from the original ATLAS release: an LLM judge scores each chunk-concept assignment on a 1–5 scale, where a score of 2 or higher counts as a successful annotation. We report quality for both the ground-truth labels and the trained annotator’s predictions.

Ground truth vs model quality distributions (yellow bins are scores <= 2, blue bins are > 2). This single 4-panel figure shows:

Figure: Score-distribution comparison between ground truth and the final KNN annotation model, each aggregated two ways (by chunks and by concepts). Panels are normalized to proportions with matched y-axes so distribution-shape differences are directly comparable.

Research We Are Excited to See

We built FineWeb Atlas because we needed concept-level annotations for interpretable model development. But the release is designed to be useful well beyond our own work. Here are some directions we’re excited to see the community explore:

The full dataset is available on HuggingFace. If you’re interested in our broader work on interpretable model development, read our other blog posts.