When introducing the FineWeb corpus this week, I claimed it had been derived from the C4 corpus; the latter is distributed by Common Crawl.
But I checked the original paper an I got this wrong. FineWeb is sourced from 96 Common Crawl web snapshots, and is therefore much larger than C4.
Thanx to the student who pointed out to me the size mismatch.