Training and fine-tuning language models requires large, high-quality text datasets. YouTube contains an enormous amount of spoken content on every topic imaginable, but extracting this text at scale is technically challenging. Existing scraping approaches are brittle and unreliable.
YouTubeTranscripts.co provides a production-grade API for extracting transcripts at scale. Build domain-specific training datasets from YouTube channels in any niche. Our batch API and AI fallback ensure comprehensive coverage, and the structured JSON output makes data preparation straightforward.
import httpx
import json
API_KEY = "YOUR_API_KEY"
# Collect transcripts for a training dataset
video_urls = [
# Your list of video URLs for the dataset
"https://youtube.com/watch?v=VIDEO1",
"https://youtube.com/watch?v=VIDEO2",
]
dataset = []
for i in range(0, len(video_urls), 25):
batch = video_urls[i:i+25]
resp = httpx.post(
"https://api.youtubetranscripts.co/v1/batch",
json={"urls": batch},
headers={"x-api-key": API_KEY},
timeout=120,
)
for item in resp.json()["transcripts"]:
if not item.get("error"):
dataset.append({
"text": " ".join(s["text"] for s in item["transcript"]),
"metadata": {
"title": item["title"],
"channel": item["channel"],
"duration": item["duration"],
},
})
# Save as JSONL for training
with open("training_data.jsonl", "w") as f:
for entry in dataset:
f.write(json.dumps(entry) + "\n")
print(f"Dataset: {len(dataset)} documents")Build domain-specific text corpora from YouTube
JSONL export for LLM fine-tuning workflows
Batch API for efficient large-scale collection
AI fallback covers videos without captions
Metadata for filtering and categorizing training data
Structured output compatible with Hugging Face datasets
Sign up in 30 seconds and get 150 free API requests. No credit card required. Start building your training datasets solution today.