Content Extractor 🔥

Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.

Features

  • YouTube Videos: Extract title, description, transcript, author, duration, views
  • Instagram Reels: Extract caption, author, engagement metrics (via browser automation)
  • Blog Posts/Articles: Extract title, author, content, tags, publish date
  • Auto-save to Obsidian: Notes are automatically formatted and saved to your Obsidian vault
  • Smart Summaries: Generates key points from extracted content

Installation

# Navigate to the content-extractor directory
cd ~/Desktop/itsthatnewshit/content-extractor

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers (for Instagram extraction)
playwright install

Usage

Basic Usage

# Extract from YouTube video
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"

# Extract from Instagram reel
python main.py "https://www.instagram.com/reel/REEL_ID"

# Extract from blog post
python main.py "https://example.com/article"

Advanced Options

# Specify Obsidian vault path
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"

# Custom output filename
python main.py <url> --output "my-note-title"

# Save to specific folder in Obsidian
python main.py <url> --folder "Learning/YouTube"

# Only print content, don't save to Obsidian
python main.py <url> --no-save

# Generate summary
python main.py <url> --summarize

Examples

# Save YouTube tutorial to Learning folder
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"

# Extract Instagram reel without saving
python main.py "https://instagram.com/reel/xyz789" --no-save

# Extract blog post to default vault
python main.py "https://medium.com/article" --folder "Articles"

Configuration

Create a .env file in the project directory to customize settings:

cp .env.example .env

Edit .env with your preferences:

# Obsidian vault path
OBSIDIAN_VAULT_PATH=~/Obsidian Vault

# Browser settings (for Instagram)
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000

# Content extraction
MAX_CONTENT_LENGTH=10000
GENERATE_SUMMARY=true

# YouTube
YOUTUBE_LANGUAGE=en

# Instagram
INSTAGRAM_WAIT_TIME=5

Output Format

Notes are saved in markdown format with:

  • Title and metadata (source, URL, extraction date)
  • Author, duration, views (when available)
  • Description/summary
  • Full content (transcript or article text)
  • Key points
  • Tags for easy organization

Example output:

# How to Build AI Agents

## Metadata
- **Source**: Youtube
- **URL**: https://youtube.com/watch?v=abc123
- **Extracted**: 2026-02-21 15:30:00
- **Author**: Tech Channel
- **Duration**: 12:34
- **Views**: 1.2M

## Description
Learn how to build AI agents from scratch...

## Content
[Full transcript or article text...]

## Key Points
- Point 1 from the content
- Point 2 from the content
- Point 3 from the content

---

## Tags
#youtube #video #ai #agents #notes

Troubleshooting

Instagram extraction fails

Instagram requires browser automation. Make sure you've run:

playwright install

If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.

YouTube transcript not available

Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.

Obsidian vault not found

By default, the tool looks for ~/Obsidian Vault. If your vault is elsewhere, use the --obsidian-path flag or set OBSIDIAN_VAULT_PATH in your .env file.

Project Structure

content-extractor/
├── main.py                 # Main entry point
├── config.py              # Configuration settings
├── obsidian_writer.py     # Obsidian note writer
├── requirements.txt       # Python dependencies
├── .env.example          # Example environment file
├── README.md             # This file
└── extractors/
    ├── __init__.py
    ├── youtube_extractor.py    # YouTube extraction
    ├── instagram_extractor.py  # Instagram extraction
    └── blog_extractor.py       # Blog/article extraction

Future Enhancements

  • AI-powered summarization (using LLMs)
  • Podcast/audio extraction (whisper transcription)
  • Twitter/X thread extraction
  • LinkedIn post extraction
  • Batch processing (extract from multiple URLs)
  • Web interface
  • Automatic tagging based on content

License

MIT License - Feel free to use and modify!


Built with 🔥 by RUBIUS for naki

Description
No description provided
Readme 208 KiB
Languages
Python 95.2%
Nix 4.8%