Back to Blog

Load YouTube Transcripts as LangChain Documents

8 min read

LangChain is the most popular framework for building LLM-powered applications. One of its core concepts is the Document, a piece of text with metadata that can be chunked, embedded, and retrieved. This tutorial shows you how to load YouTube video transcripts as LangChain Documents using the YouTubeTranscripts.co API, creating a reusable loader for your RAG pipelines.

Why a Custom Loader

LangChain includes a built-in YouTubeLoader, but it relies on scraping YouTube pages and the youtube-transcript-api package, which can break when YouTube changes their frontend. Using YouTubeTranscripts.co as your transcript source gives you a stable, production-grade API with AI fallback for videos without captions.

Install Dependencies

Install the required packages.

pip install langchain httpx

Build the Custom Loader

Create a LangChain-compatible document loader that fetches transcripts from our API.

import httpx
from langchain.schema import Document
from langchain.document_loaders.base import BaseLoader
from typing import List

class YouTubeTranscriptLoader(BaseLoader):
    def __init__(self, urls: list[str], api_key: str):
        self.urls = urls
        self.api_key = api_key

    def load(self) -> List[Document]:
        documents = []
        for url in self.urls:
            response = httpx.get(
                "https://api.youtubetranscripts.co/v1/transcript",
                params={"url": url, "format": "text"},
                headers={"x-api-key": self.api_key},
            )
            response.raise_for_status()
            data = response.json()
            documents.append(
                Document(
                    page_content=data["text"],
                    metadata={
                        "title": data["title"],
                        "channel": data["channel"],
                        "source": url,
                        "duration": data["duration"],
                    },
                )
            )
        return documents

# Usage
loader = YouTubeTranscriptLoader(
    urls=["https://youtube.com/watch?v=V1", "https://youtube.com/watch?v=V2"],
    api_key="YOUR_API_KEY",
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")

Chunk and Embed

Once loaded as Documents, use LangChain's text splitters and vector stores as usual.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Embed and store
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
print(f"Indexed {len(chunks)} chunks from {len(docs)} videos")

Query the Knowledge Base

Run similarity searches or build QA chains over your YouTube knowledge base.

# Similarity search
results = vectorstore.similarity_search("machine learning basics", k=3)
for r in results:
    print(f"[{r.metadata['title']}] {r.page_content[:200]}...")

# QA chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(),
)
answer = qa.invoke("Summarize the key concepts discussed across all videos")
print(answer["result"])

Conclusion

You now have a reusable LangChain document loader for YouTube transcripts. This loader is more reliable than the built-in option because it uses a dedicated API instead of web scraping. Use it to build RAG chatbots, knowledge bases, and AI agents over YouTube content. Get your API key at youtubetranscripts.co.

Ready to start extracting YouTube transcripts?

Get 150 free API requests. No credit card required.

Get Your Free API Key