4.7 KiB
Content Extractor 🔥
Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
Features
- YouTube Videos: Extract title, description, transcript, author, duration, views
- Instagram Reels: Extract caption, author, engagement metrics (via browser automation)
- Blog Posts/Articles: Extract title, author, content, tags, publish date
- Auto-save to Obsidian: Notes are automatically formatted and saved to your Obsidian vault
- Smart Summaries: Generates key points from extracted content
Installation
# Navigate to the content-extractor directory
cd ~/Desktop/itsthatnewshit/content-extractor
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers (for Instagram extraction)
playwright install
Usage
Basic Usage
# Extract from YouTube video
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
# Extract from Instagram reel
python main.py "https://www.instagram.com/reel/REEL_ID"
# Extract from blog post
python main.py "https://example.com/article"
Advanced Options
# Specify Obsidian vault path
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
# Custom output filename
python main.py <url> --output "my-note-title"
# Save to specific folder in Obsidian
python main.py <url> --folder "Learning/YouTube"
# Only print content, don't save to Obsidian
python main.py <url> --no-save
# Generate summary
python main.py <url> --summarize
Examples
# Save YouTube tutorial to Learning folder
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
# Extract Instagram reel without saving
python main.py "https://instagram.com/reel/xyz789" --no-save
# Extract blog post to default vault
python main.py "https://medium.com/article" --folder "Articles"
Configuration
Create a .env file in the project directory to customize settings:
cp .env.example .env
Edit .env with your preferences:
# Obsidian vault path
OBSIDIAN_VAULT_PATH=~/Obsidian Vault
# Browser settings (for Instagram)
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000
# Content extraction
MAX_CONTENT_LENGTH=10000
GENERATE_SUMMARY=true
# OpenAI/OpenRouter
OPENAI_API_KEY=your_key_here
OPENAI_URL=https://openrouter.ai/api/v1/chat/completions
OPENAI_MODEL=gpt-4o-mini
OPENAI_TIMEOUT=30
# YouTube
YOUTUBE_LANGUAGE=en
# Instagram
INSTAGRAM_WAIT_TIME=5
Output Format
Notes are saved in markdown format with:
- Title and metadata (source, URL, extraction date)
- Author, duration, views (when available)
- Description/summary
- Full content (transcript or article text)
- Key points
- Tags for easy organization
Example output:
# How to Build AI Agents
## Metadata
- **Source**: Youtube
- **URL**: https://youtube.com/watch?v=abc123
- **Extracted**: 2026-02-21 15:30:00
- **Author**: Tech Channel
- **Duration**: 12:34
- **Views**: 1.2M
## Description
Learn how to build AI agents from scratch...
## Content
[Full transcript or article text...]
## Key Points
- Point 1 from the content
- Point 2 from the content
- Point 3 from the content
---
## Tags
#youtube #video #ai #agents #notes
Troubleshooting
Instagram extraction fails
Instagram requires browser automation. Make sure you've run:
playwright install
If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
YouTube transcript not available
Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
Obsidian vault not found
By default, the tool looks for ~/Obsidian Vault. If your vault is elsewhere, use the --obsidian-path flag or set OBSIDIAN_VAULT_PATH in your .env file.
Project Structure
content-extractor/
├── main.py # Main entry point
├── config.py # Configuration settings
├── obsidian_writer.py # Obsidian note writer
├── requirements.txt # Python dependencies
├── .env.example # Example environment file
├── README.md # This file
└── extractors/
├── __init__.py
├── youtube_extractor.py # YouTube extraction
├── instagram_extractor.py # Instagram extraction
└── blog_extractor.py # Blog/article extraction
Future Enhancements
- AI-powered summarization (using LLMs)
- Podcast/audio extraction (whisper transcription)
- Twitter/X thread extraction
- LinkedIn post extraction
- Batch processing (extract from multiple URLs)
- Web interface
- Automatic tagging based on content
License
MIT License - Feel free to use and modify!
Built with 🔥 by RUBIUS for naki