convert to backblaze fetcher
This commit is contained in:
202
README.md
202
README.md
@@ -1,198 +1,64 @@
|
||||
# Content Extractor 🔥
|
||||
# Backblaze Invoice Downloader
|
||||
|
||||
Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
|
||||
Download Backblaze invoices as PDF using browser automation.
|
||||
|
||||
## Features
|
||||
Backblaze only provides invoices via a web page that must be printed — this tool automates that process using Playwright, filling in configurable fields (VAT ID, document type, company, notes) and exporting each invoice to PDF.
|
||||
|
||||
- **YouTube Videos**: Extract title, description, transcript, author, duration, views
|
||||
- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
|
||||
- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
|
||||
- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
|
||||
- **Smart Summaries**: Generates key points from extracted content
|
||||
|
||||
## Installation
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
# Navigate to the content-extractor directory
|
||||
cd ~/Desktop/itsthatnewshit/content-extractor
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Install Playwright browsers (for Instagram extraction)
|
||||
playwright install
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
Or with Nix:
|
||||
|
||||
```bash
|
||||
# Extract from YouTube video
|
||||
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
|
||||
|
||||
# Extract from Instagram reel
|
||||
python main.py "https://www.instagram.com/reel/REEL_ID"
|
||||
|
||||
# Extract from blog post
|
||||
python main.py "https://example.com/article"
|
||||
```
|
||||
|
||||
### Advanced Options
|
||||
|
||||
```bash
|
||||
# Specify Obsidian vault path
|
||||
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
|
||||
|
||||
# Custom output filename
|
||||
python main.py <url> --output "my-note-title"
|
||||
|
||||
# Save to specific folder in Obsidian
|
||||
python main.py <url> --folder "Learning/YouTube"
|
||||
|
||||
# Only print content, don't save to Obsidian
|
||||
python main.py <url> --no-save
|
||||
|
||||
# Generate summary
|
||||
python main.py <url> --summarize
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Save YouTube tutorial to Learning folder
|
||||
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
|
||||
|
||||
# Extract Instagram reel without saving
|
||||
python main.py "https://instagram.com/reel/xyz789" --no-save
|
||||
|
||||
# Extract blog post to default vault
|
||||
python main.py "https://medium.com/article" --folder "Articles"
|
||||
nix develop
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Create a `.env` file in the project directory to customize settings:
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
Edit `.env` with your preferences:
|
||||
Create a `.env` file (see `.env.example`):
|
||||
|
||||
```env
|
||||
# Obsidian vault path
|
||||
OBSIDIAN_VAULT_PATH=~/Obsidian Vault
|
||||
BACKBLAZE_EMAIL=you@example.com
|
||||
BACKBLAZE_PASSWORD=your_password
|
||||
|
||||
# Browser settings (for Instagram)
|
||||
INVOICE_VAT_ID=DE123456789
|
||||
INVOICE_DOCUMENT_TYPE=Invoice
|
||||
INVOICE_COMPANY=My Company GmbH
|
||||
INVOICE_NOTES=Internal ref: 12345
|
||||
|
||||
OUTPUT_DIR=./invoices
|
||||
BROWSER_HEADLESS=true
|
||||
BROWSER_TIMEOUT=30000
|
||||
|
||||
# Content extraction
|
||||
MAX_CONTENT_LENGTH=10000
|
||||
GENERATE_SUMMARY=true
|
||||
|
||||
# OpenAI/OpenRouter
|
||||
OPENAI_API_KEY=your_key_here
|
||||
OPENAI_URL=https://openrouter.ai/api/v1/chat/completions
|
||||
OPENAI_MODEL=gpt-4o-mini
|
||||
OPENAI_TIMEOUT=30
|
||||
|
||||
# YouTube
|
||||
YOUTUBE_LANGUAGE=en
|
||||
|
||||
# Instagram
|
||||
INSTAGRAM_WAIT_TIME=5
|
||||
```
|
||||
|
||||
## Output Format
|
||||
## Usage
|
||||
|
||||
Notes are saved in markdown format with:
|
||||
|
||||
- Title and metadata (source, URL, extraction date)
|
||||
- Author, duration, views (when available)
|
||||
- Description/summary
|
||||
- Full content (transcript or article text)
|
||||
- Key points
|
||||
- Tags for easy organization
|
||||
|
||||
Example output:
|
||||
|
||||
```markdown
|
||||
# How to Build AI Agents
|
||||
|
||||
## Metadata
|
||||
- **Source**: Youtube
|
||||
- **URL**: https://youtube.com/watch?v=abc123
|
||||
- **Extracted**: 2026-02-21 15:30:00
|
||||
- **Author**: Tech Channel
|
||||
- **Duration**: 12:34
|
||||
- **Views**: 1.2M
|
||||
|
||||
## Description
|
||||
Learn how to build AI agents from scratch...
|
||||
|
||||
## Content
|
||||
[Full transcript or article text...]
|
||||
|
||||
## Key Points
|
||||
- Point 1 from the content
|
||||
- Point 2 from the content
|
||||
- Point 3 from the content
|
||||
|
||||
---
|
||||
|
||||
## Tags
|
||||
#youtube #video #ai #agents #notes
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Instagram extraction fails
|
||||
Instagram requires browser automation. Make sure you've run:
|
||||
```bash
|
||||
playwright install
|
||||
python main.py
|
||||
```
|
||||
|
||||
If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
|
||||
|
||||
### YouTube transcript not available
|
||||
Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
|
||||
|
||||
### Obsidian vault not found
|
||||
By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
|
||||
|
||||
## Project Structure
|
||||
### Options
|
||||
|
||||
```
|
||||
content-extractor/
|
||||
├── main.py # Main entry point
|
||||
├── config.py # Configuration settings
|
||||
├── obsidian_writer.py # Obsidian note writer
|
||||
├── requirements.txt # Python dependencies
|
||||
├── .env.example # Example environment file
|
||||
├── README.md # This file
|
||||
└── extractors/
|
||||
├── __init__.py
|
||||
├── youtube_extractor.py # YouTube extraction
|
||||
├── instagram_extractor.py # Instagram extraction
|
||||
└── blog_extractor.py # Blog/article extraction
|
||||
-o, --output DIR Output directory (default: ./invoices)
|
||||
--headless Run browser headless
|
||||
--no-headless Show browser window (useful for debugging)
|
||||
--vat-id ID VAT ID to fill on invoices
|
||||
--document-type TYPE Document type to select
|
||||
--company NAME Company name to fill
|
||||
--notes TEXT Notes to fill on invoices
|
||||
-v, --verbose Verbose logging
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
CLI arguments override `.env` values.
|
||||
|
||||
- [ ] AI-powered summarization (using LLMs)
|
||||
- [ ] Podcast/audio extraction (whisper transcription)
|
||||
- [ ] Twitter/X thread extraction
|
||||
- [ ] LinkedIn post extraction
|
||||
- [ ] Batch processing (extract from multiple URLs)
|
||||
- [ ] Web interface
|
||||
- [ ] Automatic tagging based on content
|
||||
## How it works
|
||||
|
||||
## License
|
||||
|
||||
MIT License - Feel free to use and modify!
|
||||
|
||||
---
|
||||
|
||||
Built with 🔥 by RUBIUS for naki
|
||||
1. Logs in to `secure.backblaze.com`
|
||||
2. Navigates to the billing page
|
||||
3. Iterates over all billing groups and years
|
||||
4. For each invoice, opens the invoice page, fills the configured fields, and exports to PDF
|
||||
5. Skips already-downloaded invoices
|
||||
|
||||
Reference in New Issue
Block a user