convert to backblaze fetcher

This commit is contained in:
Jan Bader
2026-04-05 22:01:46 +02:00
parent 66e1c9e0e0
commit a9bb2460c6
15 changed files with 333 additions and 1620 deletions

202
README.md
View File

@@ -1,198 +1,64 @@
# Content Extractor 🔥
# Backblaze Invoice Downloader
Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
Download Backblaze invoices as PDF using browser automation.
## Features
Backblaze only provides invoices via a web page that must be printed — this tool automates that process using Playwright, filling in configurable fields (VAT ID, document type, company, notes) and exporting each invoice to PDF.
- **YouTube Videos**: Extract title, description, transcript, author, duration, views
- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
- **Smart Summaries**: Generates key points from extracted content
## Installation
## Setup
```bash
# Navigate to the content-extractor directory
cd ~/Desktop/itsthatnewshit/content-extractor
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers (for Instagram extraction)
playwright install
playwright install chromium
```
## Usage
### Basic Usage
Or with Nix:
```bash
# Extract from YouTube video
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
# Extract from Instagram reel
python main.py "https://www.instagram.com/reel/REEL_ID"
# Extract from blog post
python main.py "https://example.com/article"
```
### Advanced Options
```bash
# Specify Obsidian vault path
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
# Custom output filename
python main.py <url> --output "my-note-title"
# Save to specific folder in Obsidian
python main.py <url> --folder "Learning/YouTube"
# Only print content, don't save to Obsidian
python main.py <url> --no-save
# Generate summary
python main.py <url> --summarize
```
### Examples
```bash
# Save YouTube tutorial to Learning folder
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
# Extract Instagram reel without saving
python main.py "https://instagram.com/reel/xyz789" --no-save
# Extract blog post to default vault
python main.py "https://medium.com/article" --folder "Articles"
nix develop
```
## Configuration
Create a `.env` file in the project directory to customize settings:
```bash
cp .env.example .env
```
Edit `.env` with your preferences:
Create a `.env` file (see `.env.example`):
```env
# Obsidian vault path
OBSIDIAN_VAULT_PATH=~/Obsidian Vault
BACKBLAZE_EMAIL=you@example.com
BACKBLAZE_PASSWORD=your_password
# Browser settings (for Instagram)
INVOICE_VAT_ID=DE123456789
INVOICE_DOCUMENT_TYPE=Invoice
INVOICE_COMPANY=My Company GmbH
INVOICE_NOTES=Internal ref: 12345
OUTPUT_DIR=./invoices
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000
# Content extraction
MAX_CONTENT_LENGTH=10000
GENERATE_SUMMARY=true
# OpenAI/OpenRouter
OPENAI_API_KEY=your_key_here
OPENAI_URL=https://openrouter.ai/api/v1/chat/completions
OPENAI_MODEL=gpt-4o-mini
OPENAI_TIMEOUT=30
# YouTube
YOUTUBE_LANGUAGE=en
# Instagram
INSTAGRAM_WAIT_TIME=5
```
## Output Format
## Usage
Notes are saved in markdown format with:
- Title and metadata (source, URL, extraction date)
- Author, duration, views (when available)
- Description/summary
- Full content (transcript or article text)
- Key points
- Tags for easy organization
Example output:
```markdown
# How to Build AI Agents
## Metadata
- **Source**: Youtube
- **URL**: https://youtube.com/watch?v=abc123
- **Extracted**: 2026-02-21 15:30:00
- **Author**: Tech Channel
- **Duration**: 12:34
- **Views**: 1.2M
## Description
Learn how to build AI agents from scratch...
## Content
[Full transcript or article text...]
## Key Points
- Point 1 from the content
- Point 2 from the content
- Point 3 from the content
---
## Tags
#youtube #video #ai #agents #notes
```
## Troubleshooting
### Instagram extraction fails
Instagram requires browser automation. Make sure you've run:
```bash
playwright install
python main.py
```
If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
### YouTube transcript not available
Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
### Obsidian vault not found
By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
## Project Structure
### Options
```
content-extractor/
├── main.py # Main entry point
├── config.py # Configuration settings
├── obsidian_writer.py # Obsidian note writer
├── requirements.txt # Python dependencies
├── .env.example # Example environment file
├── README.md # This file
└── extractors/
├── __init__.py
├── youtube_extractor.py # YouTube extraction
├── instagram_extractor.py # Instagram extraction
└── blog_extractor.py # Blog/article extraction
-o, --output DIR Output directory (default: ./invoices)
--headless Run browser headless
--no-headless Show browser window (useful for debugging)
--vat-id ID VAT ID to fill on invoices
--document-type TYPE Document type to select
--company NAME Company name to fill
--notes TEXT Notes to fill on invoices
-v, --verbose Verbose logging
```
## Future Enhancements
CLI arguments override `.env` values.
- [ ] AI-powered summarization (using LLMs)
- [ ] Podcast/audio extraction (whisper transcription)
- [ ] Twitter/X thread extraction
- [ ] LinkedIn post extraction
- [ ] Batch processing (extract from multiple URLs)
- [ ] Web interface
- [ ] Automatic tagging based on content
## How it works
## License
MIT License - Feel free to use and modify!
---
Built with 🔥 by RUBIUS for naki
1. Logs in to `secure.backblaze.com`
2. Navigates to the billing page
3. Iterates over all billing groups and years
4. For each invoice, opens the invoice page, fills the configured fields, and exports to PDF
5. Skips already-downloaded invoices