LangChain is the most popular framework for building LLM-powered applications. One of its core concepts is the Document, a piece of text with metadata that can be chunked, embedded, and retrieved. This tutorial shows you how to load YouTube video transcripts as LangChain Documents using the YouTubeTranscripts.co API, creating a reusable loader for your RAG pipelines.
Why a Custom Loader
LangChain includes a built-in YouTubeLoader, but it relies on scraping YouTube pages and the youtube-transcript-api package, which can break when YouTube changes their frontend. Using YouTubeTranscripts.co as your transcript source gives you a stable, production-grade API with AI fallback for videos without captions.
Install Dependencies
Install the required packages.
pip install langchain httpxBuild the Custom Loader
Create a LangChain-compatible document loader that fetches transcripts from our API.
import httpx
from langchain.schema import Document
from langchain.document_loaders.base import BaseLoader
from typing import List
class YouTubeTranscriptLoader(BaseLoader):
def __init__(self, urls: list[str], api_key: str):
self.urls = urls
self.api_key = api_key
def load(self) -> List[Document]:
documents = []
for url in self.urls:
response = httpx.get(
"https://api.youtubetranscripts.co/v1/transcript",
params={"url": url, "format": "text"},
headers={"x-api-key": self.api_key},
)
response.raise_for_status()
data = response.json()
documents.append(
Document(
page_content=data["text"],
metadata={
"title": data["title"],
"channel": data["channel"],
"source": url,
"duration": data["duration"],
},
)
)
return documents
# Usage
loader = YouTubeTranscriptLoader(
urls=["https://youtube.com/watch?v=V1", "https://youtube.com/watch?v=V2"],
api_key="YOUR_API_KEY",
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")Chunk and Embed
Once loaded as Documents, use LangChain's text splitters and vector stores as usual.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# Embed and store
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
print(f"Indexed {len(chunks)} chunks from {len(docs)} videos")Query the Knowledge Base
Run similarity searches or build QA chains over your YouTube knowledge base.
# Similarity search
results = vectorstore.similarity_search("machine learning basics", k=3)
for r in results:
print(f"[{r.metadata['title']}] {r.page_content[:200]}...")
# QA chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o"),
retriever=vectorstore.as_retriever(),
)
answer = qa.invoke("Summarize the key concepts discussed across all videos")
print(answer["result"])Conclusion
You now have a reusable LangChain document loader for YouTube transcripts. This loader is more reliable than the built-in option because it uses a dedicated API instead of web scraping. Use it to build RAG chatbots, knowledge bases, and AI agents over YouTube content. Get your API key at youtubetranscripts.co.
Ready to start extracting YouTube transcripts?
Get 150 free API requests. No credit card required.
Get Your Free API Key