feat: Initial commit - Content Extractor for YouTube, Instagram, and blogs

- YouTube extraction with transcript support - Instagram reel extraction via browser automation - Blog/article web scraping - Auto-save to Obsidian vaults - Smart key point generation - Configurable via .env file - Quick extract shell script Tech stack: Python, requests, beautifulsoup4, playwright, youtube-transcript-api
2026-03-05 13:02:58 +05:30
commit c997e764b5
12 changed files with 1302 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,192 @@
+# Content Extractor 🔥
+
+Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
+
+## Features
+
+- **YouTube Videos**: Extract title, description, transcript, author, duration, views
+- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
+- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
+- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
+- **Smart Summaries**: Generates key points from extracted content
+
+## Installation
+
+```bash
+# Navigate to the content-extractor directory
+cd ~/Desktop/itsthatnewshit/content-extractor
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Install Playwright browsers (for Instagram extraction)
+playwright install
+```
+
+## Usage
+
+### Basic Usage
+
+```bash
+# Extract from YouTube video
+python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
+
+# Extract from Instagram reel
+python main.py "https://www.instagram.com/reel/REEL_ID"
+
+# Extract from blog post
+python main.py "https://example.com/article"
+```
+
+### Advanced Options
+
+```bash
+# Specify Obsidian vault path
+python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
+
+# Custom output filename
+python main.py <url> --output "my-note-title"
+
+# Save to specific folder in Obsidian
+python main.py <url> --folder "Learning/YouTube"
+
+# Only print content, don't save to Obsidian
+python main.py <url> --no-save
+
+# Generate summary
+python main.py <url> --summarize
+```
+
+### Examples
+
+```bash
+# Save YouTube tutorial to Learning folder
+python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
+
+# Extract Instagram reel without saving
+python main.py "https://instagram.com/reel/xyz789" --no-save
+
+# Extract blog post to default vault
+python main.py "https://medium.com/article" --folder "Articles"
+```
+
+## Configuration
+
+Create a `.env` file in the project directory to customize settings:
+
+```bash
+cp .env.example .env
+```
+
+Edit `.env` with your preferences:
+
+```env
+# Obsidian vault path
+OBSIDIAN_VAULT_PATH=~/Obsidian Vault
+
+# Browser settings (for Instagram)
+BROWSER_HEADLESS=true
+BROWSER_TIMEOUT=30000
+
+# Content extraction
+MAX_CONTENT_LENGTH=10000
+GENERATE_SUMMARY=true
+
+# YouTube
+YOUTUBE_LANGUAGE=en
+
+# Instagram
+INSTAGRAM_WAIT_TIME=5
+```
+
+## Output Format
+
+Notes are saved in markdown format with:
+
+- Title and metadata (source, URL, extraction date)
+- Author, duration, views (when available)
+- Description/summary
+- Full content (transcript or article text)
+- Key points
+- Tags for easy organization
+
+Example output:
+
+```markdown
+# How to Build AI Agents
+
+## Metadata
+- **Source**: Youtube
+- **URL**: https://youtube.com/watch?v=abc123
+- **Extracted**: 2026-02-21 15:30:00
+- **Author**: Tech Channel
+- **Duration**: 12:34
+- **Views**: 1.2M
+
+## Description
+Learn how to build AI agents from scratch...
+
+## Content
+[Full transcript or article text...]
+
+## Key Points
+- Point 1 from the content
+- Point 2 from the content
+- Point 3 from the content
+
+---
+
+## Tags
+#youtube #video #ai #agents #notes
+```
+
+## Troubleshooting
+
+### Instagram extraction fails
+Instagram requires browser automation. Make sure you've run:
+```bash
+playwright install
+```
+
+If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
+
+### YouTube transcript not available
+Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
+
+### Obsidian vault not found
+By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
+
+## Project Structure
+
+```
+content-extractor/
+├── main.py                 # Main entry point
+├── config.py              # Configuration settings
+├── obsidian_writer.py     # Obsidian note writer
+├── requirements.txt       # Python dependencies
+├── .env.example          # Example environment file
+├── README.md             # This file
+└── extractors/
+    ├── __init__.py
+    ├── youtube_extractor.py    # YouTube extraction
+    ├── instagram_extractor.py  # Instagram extraction
+    └── blog_extractor.py       # Blog/article extraction
+```
+
+## Future Enhancements
+
+- [ ] AI-powered summarization (using LLMs)
+- [ ] Podcast/audio extraction (whisper transcription)
+- [ ] Twitter/X thread extraction
+- [ ] LinkedIn post extraction
+- [ ] Batch processing (extract from multiple URLs)
+- [ ] Web interface
+- [ ] Automatic tagging based on content
+
+## License
+
+MIT License - Feel free to use and modify!
+
+---
+
+Built with 🔥 by RUBIUS for naki