AI YouTube Automation: A Seven-Stage End-to-End Pipeline for Autonomous Video Generation and Publishing with RAG-Enhanced Scripting
AI YouTube Automation: A Seven-Stage End-to-End Pipeline for Autonomous Video Generation and Publishing with RAG-Enhanced Scripting
Authors:
Adit Jaywant Ghodke, Sahil Arun Sawant, Anurag Devprakash Singh, Aban Khalid Khan
Department of Computer Science and Engineering
Universal College of Engineering, Mumbai, Maharashtra, India
ghodkejaywantadit@gmail.com, sahilsawant736@gmail.com, anuragsingh7250@gmail.com, abankhanak44@gmail.com
March 2026
Abstract: Content creation for video platforms such as YouTube remains a resource-intensive process, demanding expertise across writing, audio production, video editing, and distribution. Existing automation tools address individual stages of this workflow in isolation, leaving creators to manually integrate outputs across tools. This paper presents AI YouTube Automation, a fully automated, seven-stage pipeline that transforms a single text prompt into a published YouTube video without human intervention at any intermediate stage. The system combines Retrieval-Augmented Generation (RAG) for factually grounded script generation using the Groq API (Llama 3.3 70B), Microsoft Edge-TTS for neural narration, per-segment stock footage retrieval from the Pexels API, background music mixing, AI thumbnail generation, and autonomous YouTube upload via the YouTube Data API v3. The pipeline was developed using the Vibe Coding methodology — a structured AI-assisted development approach — and comprises ten Python modules totalling 758 lines of core logic. Across 15 or more real-world production runs, the system achieved a mean end-to-end execution time of 258.9 seconds for a 90-second video, with successful YouTube uploads in every run. A key architectural contribution is a graceful degradation mechanism that maintains pipeline continuity despite partial failures in auxiliary services. The system represents a practical and reproducible approach to fully autonomous educational video production at scale.
Keywords: video automation, retrieval-augmented generation, large language models, text-to-speech synthesis, YouTube API, content generation pipeline, Vibe Coding, segment-matched video assembly, graceful degradation, Groq AI