# Content Extractor 🔥 Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically. ## Features - **YouTube Videos**: Extract title, description, transcript, author, duration, views - **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation) - **Blog Posts/Articles**: Extract title, author, content, tags, publish date - **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault - **Smart Summaries**: Generates key points from extracted content ## Installation ```bash # Navigate to the content-extractor directory cd ~/Desktop/itsthatnewshit/content-extractor # Install dependencies pip install -r requirements.txt # Install Playwright browsers (for Instagram extraction) playwright install ``` ## Usage ### Basic Usage ```bash # Extract from YouTube video python main.py "https://www.youtube.com/watch?v=VIDEO_ID" # Extract from Instagram reel python main.py "https://www.instagram.com/reel/REEL_ID" # Extract from blog post python main.py "https://example.com/article" ``` ### Advanced Options ```bash # Specify Obsidian vault path python main.py --obsidian-path "/path/to/Obsidian Vault" # Custom output filename python main.py --output "my-note-title" # Save to specific folder in Obsidian python main.py --folder "Learning/YouTube" # Only print content, don't save to Obsidian python main.py --no-save # Generate summary python main.py --summarize ``` ### Examples ```bash # Save YouTube tutorial to Learning folder python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial" # Extract Instagram reel without saving python main.py "https://instagram.com/reel/xyz789" --no-save # Extract blog post to default vault python main.py "https://medium.com/article" --folder "Articles" ``` ## Configuration Create a `.env` file in the project directory to customize settings: ```bash cp .env.example .env ``` Edit `.env` with your preferences: ```env # Obsidian vault path OBSIDIAN_VAULT_PATH=~/Obsidian Vault # Browser settings (for Instagram) BROWSER_HEADLESS=true BROWSER_TIMEOUT=30000 # Content extraction MAX_CONTENT_LENGTH=10000 GENERATE_SUMMARY=true # YouTube YOUTUBE_LANGUAGE=en # Instagram INSTAGRAM_WAIT_TIME=5 ``` ## Output Format Notes are saved in markdown format with: - Title and metadata (source, URL, extraction date) - Author, duration, views (when available) - Description/summary - Full content (transcript or article text) - Key points - Tags for easy organization Example output: ```markdown # How to Build AI Agents ## Metadata - **Source**: Youtube - **URL**: https://youtube.com/watch?v=abc123 - **Extracted**: 2026-02-21 15:30:00 - **Author**: Tech Channel - **Duration**: 12:34 - **Views**: 1.2M ## Description Learn how to build AI agents from scratch... ## Content [Full transcript or article text...] ## Key Points - Point 1 from the content - Point 2 from the content - Point 3 from the content --- ## Tags #youtube #video #ai #agents #notes ``` ## Troubleshooting ### Instagram extraction fails Instagram requires browser automation. Make sure you've run: ```bash playwright install ``` If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info. ### YouTube transcript not available Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only. ### Obsidian vault not found By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file. ## Project Structure ``` content-extractor/ ├── main.py # Main entry point ├── config.py # Configuration settings ├── obsidian_writer.py # Obsidian note writer ├── requirements.txt # Python dependencies ├── .env.example # Example environment file ├── README.md # This file └── extractors/ ├── __init__.py ├── youtube_extractor.py # YouTube extraction ├── instagram_extractor.py # Instagram extraction └── blog_extractor.py # Blog/article extraction ``` ## Future Enhancements - [ ] AI-powered summarization (using LLMs) - [ ] Podcast/audio extraction (whisper transcription) - [ ] Twitter/X thread extraction - [ ] LinkedIn post extraction - [ ] Batch processing (extract from multiple URLs) - [ ] Web interface - [ ] Automatic tagging based on content ## License MIT License - Feel free to use and modify! --- Built with 🔥 by RUBIUS for naki