- YouTube extraction with transcript support - Instagram reel extraction via browser automation - Blog/article web scraping - Auto-save to Obsidian vaults - Smart key point generation - Configurable via .env file - Quick extract shell script Tech stack: Python, requests, beautifulsoup4, playwright, youtube-transcript-api
193 lines
4.6 KiB
Markdown
193 lines
4.6 KiB
Markdown
# Content Extractor 🔥
|
|
|
|
Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
|
|
|
|
## Features
|
|
|
|
- **YouTube Videos**: Extract title, description, transcript, author, duration, views
|
|
- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
|
|
- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
|
|
- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
|
|
- **Smart Summaries**: Generates key points from extracted content
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Navigate to the content-extractor directory
|
|
cd ~/Desktop/itsthatnewshit/content-extractor
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Install Playwright browsers (for Instagram extraction)
|
|
playwright install
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Extract from YouTube video
|
|
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
|
|
|
|
# Extract from Instagram reel
|
|
python main.py "https://www.instagram.com/reel/REEL_ID"
|
|
|
|
# Extract from blog post
|
|
python main.py "https://example.com/article"
|
|
```
|
|
|
|
### Advanced Options
|
|
|
|
```bash
|
|
# Specify Obsidian vault path
|
|
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
|
|
|
|
# Custom output filename
|
|
python main.py <url> --output "my-note-title"
|
|
|
|
# Save to specific folder in Obsidian
|
|
python main.py <url> --folder "Learning/YouTube"
|
|
|
|
# Only print content, don't save to Obsidian
|
|
python main.py <url> --no-save
|
|
|
|
# Generate summary
|
|
python main.py <url> --summarize
|
|
```
|
|
|
|
### Examples
|
|
|
|
```bash
|
|
# Save YouTube tutorial to Learning folder
|
|
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
|
|
|
|
# Extract Instagram reel without saving
|
|
python main.py "https://instagram.com/reel/xyz789" --no-save
|
|
|
|
# Extract blog post to default vault
|
|
python main.py "https://medium.com/article" --folder "Articles"
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Create a `.env` file in the project directory to customize settings:
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
Edit `.env` with your preferences:
|
|
|
|
```env
|
|
# Obsidian vault path
|
|
OBSIDIAN_VAULT_PATH=~/Obsidian Vault
|
|
|
|
# Browser settings (for Instagram)
|
|
BROWSER_HEADLESS=true
|
|
BROWSER_TIMEOUT=30000
|
|
|
|
# Content extraction
|
|
MAX_CONTENT_LENGTH=10000
|
|
GENERATE_SUMMARY=true
|
|
|
|
# YouTube
|
|
YOUTUBE_LANGUAGE=en
|
|
|
|
# Instagram
|
|
INSTAGRAM_WAIT_TIME=5
|
|
```
|
|
|
|
## Output Format
|
|
|
|
Notes are saved in markdown format with:
|
|
|
|
- Title and metadata (source, URL, extraction date)
|
|
- Author, duration, views (when available)
|
|
- Description/summary
|
|
- Full content (transcript or article text)
|
|
- Key points
|
|
- Tags for easy organization
|
|
|
|
Example output:
|
|
|
|
```markdown
|
|
# How to Build AI Agents
|
|
|
|
## Metadata
|
|
- **Source**: Youtube
|
|
- **URL**: https://youtube.com/watch?v=abc123
|
|
- **Extracted**: 2026-02-21 15:30:00
|
|
- **Author**: Tech Channel
|
|
- **Duration**: 12:34
|
|
- **Views**: 1.2M
|
|
|
|
## Description
|
|
Learn how to build AI agents from scratch...
|
|
|
|
## Content
|
|
[Full transcript or article text...]
|
|
|
|
## Key Points
|
|
- Point 1 from the content
|
|
- Point 2 from the content
|
|
- Point 3 from the content
|
|
|
|
---
|
|
|
|
## Tags
|
|
#youtube #video #ai #agents #notes
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Instagram extraction fails
|
|
Instagram requires browser automation. Make sure you've run:
|
|
```bash
|
|
playwright install
|
|
```
|
|
|
|
If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
|
|
|
|
### YouTube transcript not available
|
|
Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
|
|
|
|
### Obsidian vault not found
|
|
By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
content-extractor/
|
|
├── main.py # Main entry point
|
|
├── config.py # Configuration settings
|
|
├── obsidian_writer.py # Obsidian note writer
|
|
├── requirements.txt # Python dependencies
|
|
├── .env.example # Example environment file
|
|
├── README.md # This file
|
|
└── extractors/
|
|
├── __init__.py
|
|
├── youtube_extractor.py # YouTube extraction
|
|
├── instagram_extractor.py # Instagram extraction
|
|
└── blog_extractor.py # Blog/article extraction
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
- [ ] AI-powered summarization (using LLMs)
|
|
- [ ] Podcast/audio extraction (whisper transcription)
|
|
- [ ] Twitter/X thread extraction
|
|
- [ ] LinkedIn post extraction
|
|
- [ ] Batch processing (extract from multiple URLs)
|
|
- [ ] Web interface
|
|
- [ ] Automatic tagging based on content
|
|
|
|
## License
|
|
|
|
MIT License - Feel free to use and modify!
|
|
|
|
---
|
|
|
|
Built with 🔥 by RUBIUS for naki
|