convert to backblaze fetcher

2026-04-05 22:01:46 +02:00
parent 66e1c9e0e0
commit a9bb2460c6
15 changed files with 333 additions and 1620 deletions
--- a/README.md
+++ b/README.md
@@ -1,198 +1,64 @@
-# Content Extractor 🔥
+# Backblaze Invoice Downloader

-Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
+Download Backblaze invoices as PDF using browser automation.

-## Features
+Backblaze only provides invoices via a web page that must be printed — this tool automates that process using Playwright, filling in configurable fields (VAT ID, document type, company, notes) and exporting each invoice to PDF.

- **YouTube Videos**: Extract title, description, transcript, author, duration, views
- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
- **Smart Summaries**: Generates key points from extracted content
-
-## Installation
+## Setup

 ```bash
-# Navigate to the content-extractor directory
-cd ~/Desktop/itsthatnewshit/content-extractor
-
-# Install dependencies
 pip install -r requirements.txt
-
-# Install Playwright browsers (for Instagram extraction)
-playwright install
+playwright install chromium
 ```

-## Usage
-
-### Basic Usage
+Or with Nix:

 ```bash
-# Extract from YouTube video
-python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
-
-# Extract from Instagram reel
-python main.py "https://www.instagram.com/reel/REEL_ID"
-
-# Extract from blog post
-python main.py "https://example.com/article"
-```
-
-### Advanced Options
-
-```bash
-# Specify Obsidian vault path
-python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
-
-# Custom output filename
-python main.py <url> --output "my-note-title"
-
-# Save to specific folder in Obsidian
-python main.py <url> --folder "Learning/YouTube"
-
-# Only print content, don't save to Obsidian
-python main.py <url> --no-save
-
-# Generate summary
-python main.py <url> --summarize
-```
-
-### Examples
-
-```bash
-# Save YouTube tutorial to Learning folder
-python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
-
-# Extract Instagram reel without saving
-python main.py "https://instagram.com/reel/xyz789" --no-save
-
-# Extract blog post to default vault
-python main.py "https://medium.com/article" --folder "Articles"
+nix develop
 ```

 ## Configuration

-Create a `.env` file in the project directory to customize settings:
-
-```bash
-cp .env.example .env
-```
-
-Edit `.env` with your preferences:
+Create a `.env` file (see `.env.example`):

 ```env
-# Obsidian vault path
-OBSIDIAN_VAULT_PATH=~/Obsidian Vault
+BACKBLAZE_EMAIL=you@example.com
+BACKBLAZE_PASSWORD=your_password

-# Browser settings (for Instagram)
+INVOICE_VAT_ID=DE123456789
+INVOICE_DOCUMENT_TYPE=Invoice
+INVOICE_COMPANY=My Company GmbH
+INVOICE_NOTES=Internal ref: 12345
+
+OUTPUT_DIR=./invoices
 BROWSER_HEADLESS=true
-BROWSER_TIMEOUT=30000
-
-# Content extraction
-MAX_CONTENT_LENGTH=10000
-GENERATE_SUMMARY=true
-
-# OpenAI/OpenRouter
-OPENAI_API_KEY=your_key_here
-OPENAI_URL=https://openrouter.ai/api/v1/chat/completions
-OPENAI_MODEL=gpt-4o-mini
-OPENAI_TIMEOUT=30
-
-# YouTube
-YOUTUBE_LANGUAGE=en
-
-# Instagram
-INSTAGRAM_WAIT_TIME=5
 ```

-## Output Format
+## Usage

-Notes are saved in markdown format with:
-
- Title and metadata (source, URL, extraction date)
- Author, duration, views (when available)
- Description/summary
- Full content (transcript or article text)
- Key points
- Tags for easy organization
-
-Example output:
-
-```markdown
-# How to Build AI Agents
-
-## Metadata
- **Source**: Youtube
- **URL**: https://youtube.com/watch?v=abc123
- **Extracted**: 2026-02-21 15:30:00
- **Author**: Tech Channel
- **Duration**: 12:34
- **Views**: 1.2M
-
-## Description
-Learn how to build AI agents from scratch...
-
-## Content
-[Full transcript or article text...]
-
-## Key Points
- Point 1 from the content
- Point 2 from the content
- Point 3 from the content
-
---
-
-## Tags
-#youtube #video #ai #agents #notes
-```
-
-## Troubleshooting
-
-### Instagram extraction fails
-Instagram requires browser automation. Make sure you've run:
 ```bash
-playwright install
+python main.py
 ```

-If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
-
-### YouTube transcript not available
-Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
-
-### Obsidian vault not found
-By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
-
-## Project Structure
+### Options

 ```
-content-extractor/
-├── main.py                 # Main entry point
-├── config.py              # Configuration settings
-├── obsidian_writer.py     # Obsidian note writer
-├── requirements.txt       # Python dependencies
-├── .env.example          # Example environment file
-├── README.md             # This file
-└── extractors/
-    ├── __init__.py
-    ├── youtube_extractor.py    # YouTube extraction
-    ├── instagram_extractor.py  # Instagram extraction
-    └── blog_extractor.py       # Blog/article extraction
+-o, --output DIR       Output directory (default: ./invoices)
+--headless             Run browser headless
+--no-headless          Show browser window (useful for debugging)
+--vat-id ID            VAT ID to fill on invoices
+--document-type TYPE   Document type to select
+--company NAME         Company name to fill
+--notes TEXT           Notes to fill on invoices
+-v, --verbose          Verbose logging
 ```

-## Future Enhancements
+CLI arguments override `.env` values.

- [ ] AI-powered summarization (using LLMs)
- [ ] Podcast/audio extraction (whisper transcription)
- [ ] Twitter/X thread extraction
- [ ] LinkedIn post extraction
- [ ] Batch processing (extract from multiple URLs)
- [ ] Web interface
- [ ] Automatic tagging based on content
+## How it works

-## License
-
-MIT License - Feel free to use and modify!
-
---
-
-Built with 🔥 by RUBIUS for naki
+1. Logs in to `secure.backblaze.com`
+2. Navigates to the billing page
+3. Iterates over all billing groups and years
+4. For each invoice, opens the invoice page, fills the configured fields, and exports to PDF
+5. Skips already-downloaded invoices