feat: Initial commit - Content Extractor for YouTube, Instagram, and blogs

- YouTube extraction with transcript support
- Instagram reel extraction via browser automation
- Blog/article web scraping
- Auto-save to Obsidian vaults
- Smart key point generation
- Configurable via .env file
- Quick extract shell script

Tech stack: Python, requests, beautifulsoup4, playwright, youtube-transcript-api
This commit is contained in:
naki
2026-03-05 13:02:58 +05:30
commit c997e764b5
12 changed files with 1302 additions and 0 deletions

192
README.md Normal file
View File

@@ -0,0 +1,192 @@
# Content Extractor 🔥
Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
## Features
- **YouTube Videos**: Extract title, description, transcript, author, duration, views
- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
- **Smart Summaries**: Generates key points from extracted content
## Installation
```bash
# Navigate to the content-extractor directory
cd ~/Desktop/itsthatnewshit/content-extractor
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers (for Instagram extraction)
playwright install
```
## Usage
### Basic Usage
```bash
# Extract from YouTube video
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
# Extract from Instagram reel
python main.py "https://www.instagram.com/reel/REEL_ID"
# Extract from blog post
python main.py "https://example.com/article"
```
### Advanced Options
```bash
# Specify Obsidian vault path
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
# Custom output filename
python main.py <url> --output "my-note-title"
# Save to specific folder in Obsidian
python main.py <url> --folder "Learning/YouTube"
# Only print content, don't save to Obsidian
python main.py <url> --no-save
# Generate summary
python main.py <url> --summarize
```
### Examples
```bash
# Save YouTube tutorial to Learning folder
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
# Extract Instagram reel without saving
python main.py "https://instagram.com/reel/xyz789" --no-save
# Extract blog post to default vault
python main.py "https://medium.com/article" --folder "Articles"
```
## Configuration
Create a `.env` file in the project directory to customize settings:
```bash
cp .env.example .env
```
Edit `.env` with your preferences:
```env
# Obsidian vault path
OBSIDIAN_VAULT_PATH=~/Obsidian Vault
# Browser settings (for Instagram)
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000
# Content extraction
MAX_CONTENT_LENGTH=10000
GENERATE_SUMMARY=true
# YouTube
YOUTUBE_LANGUAGE=en
# Instagram
INSTAGRAM_WAIT_TIME=5
```
## Output Format
Notes are saved in markdown format with:
- Title and metadata (source, URL, extraction date)
- Author, duration, views (when available)
- Description/summary
- Full content (transcript or article text)
- Key points
- Tags for easy organization
Example output:
```markdown
# How to Build AI Agents
## Metadata
- **Source**: Youtube
- **URL**: https://youtube.com/watch?v=abc123
- **Extracted**: 2026-02-21 15:30:00
- **Author**: Tech Channel
- **Duration**: 12:34
- **Views**: 1.2M
## Description
Learn how to build AI agents from scratch...
## Content
[Full transcript or article text...]
## Key Points
- Point 1 from the content
- Point 2 from the content
- Point 3 from the content
---
## Tags
#youtube #video #ai #agents #notes
```
## Troubleshooting
### Instagram extraction fails
Instagram requires browser automation. Make sure you've run:
```bash
playwright install
```
If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
### YouTube transcript not available
Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
### Obsidian vault not found
By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
## Project Structure
```
content-extractor/
├── main.py # Main entry point
├── config.py # Configuration settings
├── obsidian_writer.py # Obsidian note writer
├── requirements.txt # Python dependencies
├── .env.example # Example environment file
├── README.md # This file
└── extractors/
├── __init__.py
├── youtube_extractor.py # YouTube extraction
├── instagram_extractor.py # Instagram extraction
└── blog_extractor.py # Blog/article extraction
```
## Future Enhancements
- [ ] AI-powered summarization (using LLMs)
- [ ] Podcast/audio extraction (whisper transcription)
- [ ] Twitter/X thread extraction
- [ ] LinkedIn post extraction
- [ ] Batch processing (extract from multiple URLs)
- [ ] Web interface
- [ ] Automatic tagging based on content
## License
MIT License - Feel free to use and modify!
---
Built with 🔥 by RUBIUS for naki