backblaze-invoices-downloader/README.md

# Content Extractor 🔥

Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.

## Features

- **YouTube Videos**: Extract title, description, transcript, author, duration, views
- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
- **Smart Summaries**: Generates key points from extracted content

## Installation

```bash
# Navigate to the content-extractor directory
cd ~/Desktop/itsthatnewshit/content-extractor

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers (for Instagram extraction)
playwright install
```

## Usage

### Basic Usage

```bash
# Extract from YouTube video
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"

# Extract from Instagram reel
python main.py "https://www.instagram.com/reel/REEL_ID"

# Extract from blog post
python main.py "https://example.com/article"
```

### Advanced Options

```bash
# Specify Obsidian vault path
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"

# Custom output filename
python main.py <url> --output "my-note-title"

# Save to specific folder in Obsidian
python main.py <url> --folder "Learning/YouTube"

# Only print content, don't save to Obsidian
python main.py <url> --no-save

# Generate summary
python main.py <url> --summarize
```

### Examples

```bash
# Save YouTube tutorial to Learning folder
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"

# Extract Instagram reel without saving
python main.py "https://instagram.com/reel/xyz789" --no-save

# Extract blog post to default vault
python main.py "https://medium.com/article" --folder "Articles"
```

## Configuration

Create a `.env` file in the project directory to customize settings:

```bash
cp .env.example .env
```

Edit `.env` with your preferences:

```env
# Obsidian vault path
OBSIDIAN_VAULT_PATH=~/Obsidian Vault

# Browser settings (for Instagram)
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000

# Content extraction
MAX_CONTENT_LENGTH=10000
GENERATE_SUMMARY=true

# YouTube
YOUTUBE_LANGUAGE=en

# Instagram
INSTAGRAM_WAIT_TIME=5
```

## Output Format

Notes are saved in markdown format with:

- Title and metadata (source, URL, extraction date)
- Author, duration, views (when available)
- Description/summary
- Full content (transcript or article text)
- Key points
- Tags for easy organization

Example output:

```markdown
# How to Build AI Agents

## Metadata
- **Source**: Youtube
- **URL**: https://youtube.com/watch?v=abc123
- **Extracted**: 2026-02-21 15:30:00
- **Author**: Tech Channel
- **Duration**: 12:34
- **Views**: 1.2M

## Description
Learn how to build AI agents from scratch...

## Content
[Full transcript or article text...]

## Key Points
- Point 1 from the content
- Point 2 from the content
- Point 3 from the content

---

## Tags
#youtube #video #ai #agents #notes
```

## Troubleshooting

### Instagram extraction fails
Instagram requires browser automation. Make sure you've run:
```bash
playwright install
```

If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.

### YouTube transcript not available
Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.

### Obsidian vault not found
By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.

## Project Structure

```
content-extractor/
├── main.py                 # Main entry point
├── config.py              # Configuration settings
├── obsidian_writer.py     # Obsidian note writer
├── requirements.txt       # Python dependencies
├── .env.example          # Example environment file
├── README.md             # This file
└── extractors/
    ├── __init__.py
    ├── youtube_extractor.py    # YouTube extraction
    ├── instagram_extractor.py  # Instagram extraction
    └── blog_extractor.py       # Blog/article extraction
```

## Future Enhancements

- [ ] AI-powered summarization (using LLMs)
- [ ] Podcast/audio extraction (whisper transcription)
- [ ] Twitter/X thread extraction
- [ ] LinkedIn post extraction
- [ ] Batch processing (extract from multiple URLs)
- [ ] Web interface
- [ ] Automatic tagging based on content

## License

MIT License - Feel free to use and modify!

---

Built with 🔥 by RUBIUS for naki