feat: Initial commit - Content Extractor for YouTube, Instagram, and blogs

- YouTube extraction with transcript support
- Instagram reel extraction via browser automation
- Blog/article web scraping
- Auto-save to Obsidian vaults
- Smart key point generation
- Configurable via .env file
- Quick extract shell script

Tech stack: Python, requests, beautifulsoup4, playwright, youtube-transcript-api
This commit is contained in:
naki
2026-03-05 13:02:58 +05:30
commit c997e764b5
12 changed files with 1302 additions and 0 deletions

21
.env.example Normal file
View File

@@ -0,0 +1,21 @@
# Content Extractor Configuration
# Obsidian vault path (default: ~/Obsidian Vault)
OBSIDIAN_VAULT_PATH=~/Obsidian Vault
# Browser settings (for Instagram extraction)
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000
# Content extraction settings
MAX_CONTENT_LENGTH=10000
GENERATE_SUMMARY=true
# YouTube settings
YOUTUBE_LANGUAGE=en
# Instagram settings
INSTAGRAM_WAIT_TIME=5
# Logging
LOG_LEVEL=INFO

52
.gitignore vendored Normal file
View File

@@ -0,0 +1,52 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Environment
.env
.env.local
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# Logs
*.log
logs/
# OS
.DS_Store
Thumbs.db
# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/
# Playwright
.playwright/

192
README.md Normal file
View File

@@ -0,0 +1,192 @@
# Content Extractor 🔥
Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
## Features
- **YouTube Videos**: Extract title, description, transcript, author, duration, views
- **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
- **Blog Posts/Articles**: Extract title, author, content, tags, publish date
- **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
- **Smart Summaries**: Generates key points from extracted content
## Installation
```bash
# Navigate to the content-extractor directory
cd ~/Desktop/itsthatnewshit/content-extractor
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers (for Instagram extraction)
playwright install
```
## Usage
### Basic Usage
```bash
# Extract from YouTube video
python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
# Extract from Instagram reel
python main.py "https://www.instagram.com/reel/REEL_ID"
# Extract from blog post
python main.py "https://example.com/article"
```
### Advanced Options
```bash
# Specify Obsidian vault path
python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
# Custom output filename
python main.py <url> --output "my-note-title"
# Save to specific folder in Obsidian
python main.py <url> --folder "Learning/YouTube"
# Only print content, don't save to Obsidian
python main.py <url> --no-save
# Generate summary
python main.py <url> --summarize
```
### Examples
```bash
# Save YouTube tutorial to Learning folder
python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
# Extract Instagram reel without saving
python main.py "https://instagram.com/reel/xyz789" --no-save
# Extract blog post to default vault
python main.py "https://medium.com/article" --folder "Articles"
```
## Configuration
Create a `.env` file in the project directory to customize settings:
```bash
cp .env.example .env
```
Edit `.env` with your preferences:
```env
# Obsidian vault path
OBSIDIAN_VAULT_PATH=~/Obsidian Vault
# Browser settings (for Instagram)
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000
# Content extraction
MAX_CONTENT_LENGTH=10000
GENERATE_SUMMARY=true
# YouTube
YOUTUBE_LANGUAGE=en
# Instagram
INSTAGRAM_WAIT_TIME=5
```
## Output Format
Notes are saved in markdown format with:
- Title and metadata (source, URL, extraction date)
- Author, duration, views (when available)
- Description/summary
- Full content (transcript or article text)
- Key points
- Tags for easy organization
Example output:
```markdown
# How to Build AI Agents
## Metadata
- **Source**: Youtube
- **URL**: https://youtube.com/watch?v=abc123
- **Extracted**: 2026-02-21 15:30:00
- **Author**: Tech Channel
- **Duration**: 12:34
- **Views**: 1.2M
## Description
Learn how to build AI agents from scratch...
## Content
[Full transcript or article text...]
## Key Points
- Point 1 from the content
- Point 2 from the content
- Point 3 from the content
---
## Tags
#youtube #video #ai #agents #notes
```
## Troubleshooting
### Instagram extraction fails
Instagram requires browser automation. Make sure you've run:
```bash
playwright install
```
If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
### YouTube transcript not available
Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
### Obsidian vault not found
By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
## Project Structure
```
content-extractor/
├── main.py # Main entry point
├── config.py # Configuration settings
├── obsidian_writer.py # Obsidian note writer
├── requirements.txt # Python dependencies
├── .env.example # Example environment file
├── README.md # This file
└── extractors/
├── __init__.py
├── youtube_extractor.py # YouTube extraction
├── instagram_extractor.py # Instagram extraction
└── blog_extractor.py # Blog/article extraction
```
## Future Enhancements
- [ ] AI-powered summarization (using LLMs)
- [ ] Podcast/audio extraction (whisper transcription)
- [ ] Twitter/X thread extraction
- [ ] LinkedIn post extraction
- [ ] Batch processing (extract from multiple URLs)
- [ ] Web interface
- [ ] Automatic tagging based on content
## License
MIT License - Feel free to use and modify!
---
Built with 🔥 by RUBIUS for naki

47
config.py Normal file
View File

@@ -0,0 +1,47 @@
"""
Configuration for Content Extractor
"""
import os
from pathlib import Path
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
class Config:
"""Configuration settings for content extractor."""
# Obsidian vault path (default to common locations)
OBSIDIAN_VAULT_PATH = os.getenv(
"OBSIDIAN_VAULT_PATH",
os.path.expanduser("~/Obsidian Vault") # Default location
)
# Browser settings (for Instagram and dynamic content)
BROWSER_HEADLESS = os.getenv("BROWSER_HEADLESS", "true").lower() == "true"
BROWSER_TIMEOUT = int(os.getenv("BROWSER_TIMEOUT", "30000")) # 30 seconds
# Content extraction settings
MAX_CONTENT_LENGTH = int(os.getenv("MAX_CONTENT_LENGTH", "10000")) # Max chars
GENERATE_SUMMARY = os.getenv("GENERATE_SUMMARY", "true").lower() == "true"
# YouTube settings
YOUTUBE_LANGUAGE = os.getenv("YOUTUBE_LANGUAGE", "en")
# Instagram settings (requires browser automation)
INSTAGRAM_WAIT_TIME = int(os.getenv("INSTAGRAM_WAIT_TIME", "5")) # seconds
# Logging
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")
LOG_FILE = os.getenv("LOG_FILE", "content_extractor.log")
@classmethod
def validate(cls):
"""Validate configuration."""
# Check if Obsidian vault path exists
if not Path(cls.OBSIDIAN_VAULT_PATH).exists():
print(f"⚠️ Warning: Obsidian vault path does not exist: {cls.OBSIDIAN_VAULT_PATH}")
print(" You can set OBSIDIAN_VAULT_PATH environment variable or use --obsidian-path flag")
return True

25
extract.sh Executable file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
# Content Extractor - Quick extraction script
# Usage: ./extract.sh <url> [folder]
if [ -z "$1" ]; then
echo "Usage: $0 <url> [folder]"
echo ""
echo "Examples:"
echo " $0 https://youtube.com/watch?v=abc123"
echo " $0 https://instagram.com/reel/xyz789 Learning"
echo " $0 https://medium.com/article Articles"
exit 1
fi
URL="$1"
FOLDER="${2:-Content Extractor}"
echo "🔥 Content Extractor"
echo "===================="
echo "URL: $URL"
echo "Folder: $FOLDER"
echo ""
cd "$(dirname "$0")"
python main.py "$URL" --folder "$FOLDER"

13
extractors/__init__.py Normal file
View File

@@ -0,0 +1,13 @@
"""
Content Extractors Package
"""
from .youtube_extractor import YouTubeExtractor
from .blog_extractor import BlogExtractor
from .instagram_extractor import InstagramExtractor
__all__ = [
"YouTubeExtractor",
"BlogExtractor",
"InstagramExtractor",
]

View File

@@ -0,0 +1,224 @@
"""
Blog/Article Extractor
Extracts:
- Title, author, publish date
- Main article content
- Tags/categories
- Summary
"""
import re
from typing import Dict, Any, Optional
from urllib.parse import urlparse
try:
import requests
from bs4 import BeautifulSoup
except ImportError:
requests = None
BeautifulSoup = None
class BlogExtractor:
"""Extract content from blog posts and articles."""
def __init__(self, url: str):
self.url = url
self.html = None
self.soup = None
self._fetch_page()
def _fetch_page(self):
"""Fetch the webpage."""
if requests is None:
raise ImportError("requests not installed. Run: pip install requests")
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
try:
response = requests.get(self.url, headers=headers, timeout=30)
response.raise_for_status()
self.html = response.text
except Exception as e:
raise Exception(f"Failed to fetch page: {str(e)}")
def _parse_html(self):
"""Parse HTML with BeautifulSoup."""
if BeautifulSoup is None:
raise ImportError("beautifulsoup4 not installed. Run: pip install beautifulsoup4")
if self.soup is None:
self.soup = BeautifulSoup(self.html, 'lxml')
def extract(self) -> Dict[str, Any]:
"""Extract all content from the page."""
self._parse_html()
content = {
"title": self._get_title(),
"description": self._get_description(),
"author": self._get_author(),
"publish_date": self._get_publish_date(),
"content": self._get_content(),
"key_points": self._generate_key_points(),
"tags": self._get_tags(),
}
return content
def _get_title(self) -> str:
"""Get page title."""
# Try Open Graph title first
og_title = self.soup.find('meta', property='og:title')
if og_title and og_title.get('content'):
return og_title['content'].strip()
# Try Twitter card title
twitter_title = self.soup.find('meta', attrs={'name': 'twitter:title'})
if twitter_title and twitter_title.get('content'):
return twitter_title['content'].strip()
# Try h1 tag
h1 = self.soup.find('h1')
if h1:
return h1.get_text().strip()
# Fallback to <title> tag
title_tag = self.soup.find('title')
if title_tag:
return title_tag.get_text().strip()
return "Untitled Article"
def _get_description(self) -> str:
"""Get page description."""
# Try Open Graph description
og_desc = self.soup.find('meta', property='og:description')
if og_desc and og_desc.get('content'):
return og_desc['content'].strip()
# Try meta description
meta_desc = self.soup.find('meta', attrs={'name': 'description'})
if meta_desc and meta_desc.get('content'):
return meta_desc['content'].strip()
return ""
def _get_author(self) -> str:
"""Get article author."""
# Try Open Graph author
og_author = self.soup.find('meta', property='article:author')
if og_author and og_author.get('content'):
return og_author['content'].strip()
# Try meta author
meta_author = self.soup.find('meta', attrs={'name': 'author'})
if meta_author and meta_author.get('content'):
return meta_author['content'].strip()
# Try to find author in byline
byline = self.soup.find(class_=re.compile(r'byline|author', re.I))
if byline:
return byline.get_text().strip()
return "Unknown"
def _get_publish_date(self) -> str:
"""Get publish date."""
# Try Open Graph publish time
og_time = self.soup.find('meta', property='article:published_time')
if og_time and og_time.get('content'):
return og_time['content'][:10] # YYYY-MM-DD
# Try meta publish date
meta_time = self.soup.find('meta', attrs={'name': 'date'})
if meta_time and meta_time.get('content'):
return meta_time['content'][:10]
# Try time tag
time_tag = self.soup.find('time')
if time_tag and time_tag.get('datetime'):
return time_tag['datetime'][:10]
return "Unknown"
def _get_content(self) -> str:
"""Extract main article content."""
# Remove unwanted elements
for element in self.soup(['script', 'style', 'nav', 'header', 'footer', 'aside']):
element.decompose()
# Try to find main content area
content_areas = [
self.soup.find('article'),
self.soup.find(class_=re.compile(r'article|content|post|entry', re.I)),
self.soup.find(id=re.compile(r'article|content|post', re.I)),
self.soup.find('main'),
]
content_elem = next((elem for elem in content_areas if elem), None)
if content_elem:
# Get paragraphs from content area
paragraphs = content_elem.find_all('p')
else:
# Fallback to all paragraphs
paragraphs = self.soup.find_all('p')
# Extract text from paragraphs
text_parts = []
for p in paragraphs:
text = p.get_text().strip()
if len(text) > 50: # Filter out short paragraphs
text_parts.append(text)
# Join and clean
content = "\n\n".join(text_parts)
content = re.sub(r'\n{3,}', '\n\n', content) # Remove excessive newlines
return content[:10000] # Limit length
def _generate_key_points(self) -> list:
"""Generate key points from content."""
content = self._get_content()
if not content:
return []
# Extract first few sentences as key points
sentences = re.split(r'[.!?]+', content)
key_points = []
for sentence in sentences[:5]:
sentence = sentence.strip()
if len(sentence) > 30 and len(sentence) < 200:
key_points.append(sentence + '.')
return key_points
def _get_tags(self) -> list:
"""Get article tags/categories."""
tags = []
# Try Open Graph article tags
og_tags = self.soup.find_all('meta', property='article:tag')
for tag in og_tags:
if tag.get('content'):
tags.append(tag['content'].lower().replace(' ', '-'))
# Try to find tag elements
tag_elements = self.soup.find_all(class_=re.compile(r'tag|category|label', re.I))
for elem in tag_elements[:5]: # Limit to 5
text = elem.get_text().strip().lower()
if len(text) < 30:
tags.append(text.replace(' ', '-'))
# Add domain-based tag
domain = urlparse(self.url).netloc
if domain:
tags.append(domain.replace('www.', '').split('.')[0])
return list(set(tags))[:10] # Remove duplicates and limit

View File

@@ -0,0 +1,175 @@
"""
Instagram Reel Extractor
Extracts:
- Title/caption
- Author/creator
- Description
- Transcript (if available via captions)
- Metadata (views, likes, etc.)
Note: Instagram requires browser automation. Uses Playwright.
"""
import re
import time
from typing import Dict, Any
from urllib.parse import urlparse
try:
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout
except ImportError:
sync_playwright = None
class InstagramExtractor:
"""Extract content from Instagram reels."""
def __init__(self, url: str, headless: bool = True):
self.url = url
self.headless = headless
self.data = {}
if sync_playwright is None:
raise ImportError("playwright not installed. Run: pip install playwright && playwright install")
def extract(self) -> Dict[str, Any]:
"""Extract content from Instagram reel."""
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=self.headless)
page = browser.new_page(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
# Navigate to the reel
print(f"📱 Loading Instagram reel...")
page.goto(self.url, timeout=30000)
# Wait for content to load
time.sleep(3)
# Try to close any cookies/login prompts
try:
page.click('button:has-text("Not Now")', timeout=3000)
except:
pass
try:
page.click('button:has-text("Allow")', timeout=3000)
except:
pass
# Extract data
self.data = self._extract_data(page)
browser.close()
except PlaywrightTimeout:
print("⚠️ Timeout loading Instagram page")
self.data = self._fallback_extract()
except Exception as e:
print(f"⚠️ Error: {str(e)}")
self.data = self._fallback_extract()
return self.data
def _extract_data(self, page) -> Dict[str, Any]:
"""Extract data from loaded page."""
data = {
"title": "Instagram Reel",
"description": "",
"author": "Unknown",
"content": "",
"key_points": [],
"tags": ["instagram", "reel"],
}
# Try to get caption/description
try:
# Look for caption text
captions = page.query_selector_all('h1, h2, span')
for caption in captions:
text = caption.inner_text()
if len(text) > 20 and len(text) < 500:
if not data["description"]:
data["description"] = text
break
except Exception as e:
print(f"⚠️ Could not extract caption: {e}")
# Try to get author
try:
author_elem = page.query_selector('a[href*="/"] h1, a[href*="/"] h2, header span')
if author_elem:
data["author"] = author_elem.inner_text().strip()
except:
pass
# Try to get engagement metrics
try:
likes_elem = page.query_selector('span:has-text("likes"), span:has-text("views")')
if likes_elem:
data["views"] = likes_elem.inner_text().strip()
except:
pass
# Extract any visible text as content
try:
# Get all text content
body_text = page.inner_text('body')
# Filter for meaningful content
lines = body_text.split('\n')
meaningful_lines = [
line.strip() for line in lines
if len(line.strip()) > 30 and len(line.strip()) < 300
]
data["content"] = "\n\n".join(meaningful_lines[:10])[:5000]
except Exception as e:
print(f"⚠️ Could not extract page text: {e}")
# Generate key points from description
if data["description"]:
sentences = data["description"].split('.')[:3]
data["key_points"] = [s.strip() + '.' for s in sentences if len(s.strip()) > 20]
# Add URL-based tags
parsed = urlparse(self.url)
if '/reel/' in parsed.path:
data["tags"].append("reel")
if '/video/' in parsed.path:
data["tags"].append("video")
return data
def _fallback_extract(self) -> Dict[str, Any]:
"""Fallback extraction when browser automation fails."""
print("⚠️ Using fallback extraction method...")
# Try to extract what we can from the URL itself
data = {
"title": "Instagram Content",
"description": "[Could not extract - Instagram requires login]",
"author": "Unknown",
"content": "",
"key_points": [
"Instagram content extraction requires browser automation",
"Consider using Instagram's official API or downloading the video manually",
],
"tags": ["instagram", "fallback"],
}
# Extract reel ID from URL
try:
parsed = urlparse(self.url)
path_parts = parsed.path.split('/')
for i, part in enumerate(path_parts):
if part in ['reel', 'p', 'tv'] and i + 1 < len(path_parts):
reel_id = path_parts[i + 1]
data["key_points"].append(f"Reel ID: {reel_id}")
break
except:
pass
return data

View File

@@ -0,0 +1,203 @@
"""
YouTube Video Extractor
Extracts:
- Title, description, author
- Transcript/captions
- Duration, views, publish date
- Tags/categories
"""
import re
from typing import Optional, Dict, Any
from urllib.parse import urlparse, parse_qs
try:
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api._errors import TranscriptsDisabled, NoTranscriptFound
except ImportError:
YouTubeTranscriptApi = None
try:
from pytubefix import YouTube # More reliable than pytube
except ImportError:
try:
from pytube import YouTube
except ImportError:
YouTube = None
class YouTubeExtractor:
"""Extract content from YouTube videos."""
def __init__(self, url: str):
self.url = url
self.video_id = self._extract_video_id(url)
self.youtube = None
def _extract_video_id(self, url: str) -> str:
"""Extract video ID from YouTube URL."""
patterns = [
r'(?:youtube\.com\/watch\?v=|youtu\.be\/)([a-zA-Z0-9_-]{11})',
r'youtube\.com\/embed\/([a-zA-Z0-9_-]{11})',
r'youtube\.com\/v\/([a-zA-Z0-9_-]{11})',
]
for pattern in patterns:
match = re.search(pattern, url)
if match:
return match.group(1)
raise ValueError(f"Could not extract YouTube video ID from: {url}")
def _init_youtube(self):
"""Initialize YouTube object."""
if YouTube is None:
raise ImportError("pytube or pytubefix not installed. Run: pip install pytubefix")
if self.youtube is None:
self.youtube = YouTube(self.url)
def extract(self) -> Dict[str, Any]:
"""Extract all content from YouTube video."""
self._init_youtube()
content = {
"title": self._get_title(),
"description": self._get_description(),
"author": self._get_author(),
"duration": self._get_duration(),
"publish_date": self._get_publish_date(),
"views": self._get_views(),
"content": self._get_transcript(),
"key_points": self._generate_key_points(),
"tags": self._get_tags(),
}
return content
def _get_title(self) -> str:
"""Get video title."""
try:
self._init_youtube()
return self.youtube.title
except Exception as e:
return f"Video {self.video_id}"
def _get_description(self) -> str:
"""Get video description."""
try:
self._init_youtube()
return self.youtube.description or ""
except Exception:
return ""
def _get_author(self) -> str:
"""Get video author/channel name."""
try:
self._init_youtube()
return self.youtube.author
except Exception:
return "Unknown"
def _get_duration(self) -> str:
"""Get video duration in readable format."""
try:
self._init_youtube()
seconds = self.youtube.length
minutes, secs = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
if hours > 0:
return f"{hours}:{minutes:02d}:{secs:02d}"
else:
return f"{minutes}:{secs:02d}"
except Exception:
return "Unknown"
def _get_publish_date(self) -> str:
"""Get video publish date."""
try:
self._init_youtube()
if hasattr(self.youtube, 'publish_date') and self.youtube.publish_date:
return self.youtube.publish_date.strftime("%Y-%m-%d")
except Exception:
pass
return "Unknown"
def _get_views(self) -> str:
"""Get view count."""
try:
self._init_youtube()
views = self.youtube.views
if views > 1_000_000:
return f"{views / 1_000_000:.1f}M"
elif views > 1_000:
return f"{views / 1_000:.1f}K"
else:
return str(views)
except Exception:
return "Unknown"
def _get_transcript(self) -> str:
"""Get video transcript/captions."""
if YouTubeTranscriptApi is None:
return "[Transcript not available - youtube-transcript-api not installed]"
try:
# New API requires creating an instance
api = YouTubeTranscriptApi()
transcript_list = api.list(self.video_id)
# Try to find English transcript
transcript = None
for t in transcript_list:
if t.language_code == 'en':
transcript = t
break
# Fallback to first available
if transcript is None:
transcript = next(iter(transcript_list), None)
if transcript is None:
return "[No transcript available]"
transcript_data = transcript.fetch()
# New API returns FetchedTranscript with snippets
if hasattr(transcript_data, 'snippets'):
full_text = " ".join([snippet.text for snippet in transcript_data.snippets])
else:
# Fallback for older API format
full_text = " ".join([entry['text'] for entry in transcript_data])
# Clean up the text
full_text = full_text.replace("\n", " ").strip()
return full_text[:10000] # Limit length
except Exception as e:
return f"[Transcript not available: {str(e)}]"
def _generate_key_points(self) -> list:
"""Generate key points from transcript (simple extraction)."""
transcript = self._get_transcript()
if not transcript or transcript.startswith("["):
return []
# Simple sentence extraction (first few sentences as key points)
sentences = transcript.split('.')[:5]
key_points = [s.strip() + '.' for s in sentences if len(s.strip()) > 20]
return key_points[:5]
def _get_tags(self) -> list:
"""Get video tags."""
try:
self._init_youtube()
if hasattr(self.youtube, 'keywords'):
return self.youtube.keywords[:10] if self.youtube.keywords else []
except Exception:
pass
return ["youtube", "video"]

199
main.py Normal file
View File

@@ -0,0 +1,199 @@
#!/usr/bin/env python3
"""
Content Extractor - Extract key information from URLs and save to Obsidian
Supports:
- YouTube videos (transcripts, descriptions, metadata)
- Blog posts & articles (web scraping)
- Instagram reels (via browser automation)
- Generic URLs (text extraction)
Usage:
python main.py <url> [--obsidian-path <path>] [--output <filename>]
"""
import argparse
import sys
from pathlib import Path
from datetime import datetime
from typing import Optional
from extractors.youtube_extractor import YouTubeExtractor
from extractors.blog_extractor import BlogExtractor
from extractors.instagram_extractor import InstagramExtractor
from obsidian_writer import ObsidianWriter
from config import Config
def detect_source_type(url: str) -> str:
"""Detect the type of content based on URL."""
if "youtube.com" in url or "youtu.be" in url:
return "youtube"
elif "instagram.com" in url and "/reel" in url:
return "instagram"
elif "instagram.com" in url:
return "instagram"
else:
return "blog"
def extract_content(url: str, source_type: str) -> dict:
"""Extract content from URL based on source type."""
print(f"🔍 Extracting content from {source_type}...")
if source_type == "youtube":
extractor = YouTubeExtractor(url)
elif source_type == "instagram":
extractor = InstagramExtractor(url)
else:
extractor = BlogExtractor(url)
return extractor.extract()
def main():
parser = argparse.ArgumentParser(
description="Extract content from URLs and save to Obsidian notes"
)
parser.add_argument("url", help="URL to extract content from")
parser.add_argument(
"--obsidian-path",
type=str,
default=Config.OBSIDIAN_VAULT_PATH,
help="Path to Obsidian vault"
)
parser.add_argument(
"--output",
type=str,
default=None,
help="Output filename (without .md extension)"
)
parser.add_argument(
"--folder",
type=str,
default="Content Extractor",
help="Folder in Obsidian vault to save notes"
)
parser.add_argument(
"--no-save",
action="store_true",
help="Only print extracted content, don't save to Obsidian"
)
parser.add_argument(
"--summarize",
action="store_true",
help="Generate a summary of the content"
)
args = parser.parse_args()
# Detect source type
source_type = detect_source_type(args.url)
print(f"📌 Detected source type: {source_type}")
# Extract content
try:
content = extract_content(args.url, source_type)
except Exception as e:
print(f"❌ Extraction failed: {e}")
sys.exit(1)
if not content:
print("❌ No content could be extracted")
sys.exit(1)
# Generate output filename
if args.output:
filename = args.output
else:
# Generate from title or URL
title = content.get("title", "Untitled")
filename = f"{title[:50]}_{datetime.now().strftime('%Y%m%d')}"
# Sanitize filename
filename = "".join(c for c in filename if c.isalnum() or c in " -_").strip()
# Create markdown content
markdown = generate_markdown(content, source_type, args.url)
# Print preview
print("\n" + "="*80)
print("📝 EXTRACTED CONTENT PREVIEW")
print("="*80)
print(markdown[:2000] + "..." if len(markdown) > 2000 else markdown)
print("="*80)
# Save to Obsidian
if not args.no_save:
writer = ObsidianWriter(args.obsidian_path)
output_path = writer.save_note(markdown, filename, args.folder)
print(f"\n✅ Note saved to: {output_path}")
else:
print("\n⚠️ Note not saved (--no-save flag)")
return content
def generate_markdown(content: dict, source_type: str, url: str) -> str:
"""Generate markdown content for Obsidian note."""
lines = []
# Header
lines.append(f"# {content.get('title', 'Untitled')}")
lines.append("")
# Metadata
lines.append("## Metadata")
lines.append("")
lines.append(f"- **Source**: {source_type.capitalize()}")
lines.append(f"- **URL**: {url}")
lines.append(f"- **Extracted**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if content.get("author"):
lines.append(f"- **Author**: {content.get('author')}")
if content.get("duration"):
lines.append(f"- **Duration**: {content.get('duration')}")
if content.get("publish_date"):
lines.append(f"- **Published**: {content.get('publish_date')}")
if content.get("views"):
lines.append(f"- **Views**: {content.get('views')}")
lines.append("")
# Description/Summary
if content.get("description"):
lines.append("## Description")
lines.append("")
lines.append(content.get("description", ""))
lines.append("")
# Main Content (transcript, article text, etc.)
if content.get("content"):
lines.append("## Content")
lines.append("")
lines.append(content.get("content", ""))
lines.append("")
# Key Points/Summary
if content.get("key_points"):
lines.append("## Key Points")
lines.append("")
for point in content.get("key_points", []):
lines.append(f"- {point}")
lines.append("")
# Tags
lines.append("---")
lines.append("")
lines.append("## Tags")
lines.append("")
tags = content.get("tags", [])
if not tags:
tags = ["content-extractor", source_type, "notes"]
lines.append(" ".join(f"#{tag}" for tag in tags))
lines.append("")
return "\n".join(lines)
if __name__ == "__main__":
main()

128
obsidian_writer.py Normal file
View File

@@ -0,0 +1,128 @@
"""
Obsidian Note Writer
Saves extracted content as markdown notes in Obsidian vault.
"""
import os
from pathlib import Path
from datetime import datetime
from typing import Optional
class ObsidianWriter:
"""Write content to Obsidian vault as markdown notes."""
def __init__(self, vault_path: str):
self.vault_path = Path(vault_path).expanduser()
self._validate_vault()
def _validate_vault(self):
"""Validate that the path is an Obsidian vault."""
if not self.vault_path.exists():
print(f"⚠️ Creating Obsidian vault directory: {self.vault_path}")
self.vault_path.mkdir(parents=True, exist_ok=True)
# Check if it looks like an Obsidian vault
obsidian_config = self.vault_path / ".obsidian"
if not obsidian_config.exists():
print(f"⚠️ Warning: {self.vault_path} doesn't look like an Obsidian vault")
print(" (No .obsidian directory found)")
print(" Notes will still be saved, but you may want to set the correct vault path")
def save_note(
self,
content: str,
filename: str,
folder: Optional[str] = None,
subfolder: Optional[str] = None
) -> Path:
"""
Save a note to Obsidian vault.
Args:
content: Markdown content to save
filename: Filename without .md extension
folder: Folder in vault (default: root)
subfolder: Subfolder within folder (optional)
Returns:
Path to saved file
"""
# Build path
if folder:
note_dir = self.vault_path / folder
if subfolder:
note_dir = note_dir / subfolder
else:
note_dir = self.vault_path
# Create directory if it doesn't exist
note_dir.mkdir(parents=True, exist_ok=True)
# Sanitize filename
filename = self._sanitize_filename(filename)
# Add .md extension
filepath = note_dir / f"{filename}.md"
# Handle duplicate filenames
counter = 1
original_filepath = filepath
while filepath.exists():
filepath = original_filepath.with_name(f"{filename}_{counter}.md")
counter += 1
# Write the file
try:
with open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
print(f"✅ Note saved: {filepath.name}")
return filepath
except Exception as e:
raise Exception(f"Failed to save note: {str(e)}")
def _sanitize_filename(self, filename: str) -> str:
"""Sanitize filename for filesystem."""
# Remove invalid characters
invalid_chars = '<>:"/\\|?*'
for char in invalid_chars:
filename = filename.replace(char, '')
# Replace spaces with hyphens (optional, but cleaner)
# filename = filename.replace(' ', '-')
# Limit length
if len(filename) > 100:
filename = filename[:100]
return filename.strip()
def create_daily_note(self, content: str) -> Path:
"""Create/update a daily note."""
today = datetime.now().strftime("%Y-%m-%d")
folder = "Daily Notes"
return self.save_note(content, today, folder)
def append_to_note(self, filename: str, content: str, folder: Optional[str] = None) -> Path:
"""Append content to an existing note."""
if folder:
note_dir = self.vault_path / folder
else:
note_dir = self.vault_path
filepath = note_dir / f"{filename}.md"
# If file doesn't exist, create it
if not filepath.exists():
return self.save_note(content, filename, folder)
# Append to existing file
try:
with open(filepath, 'a', encoding='utf-8') as f:
f.write("\n\n---\n\n")
f.write(content)
print(f"✅ Content appended to: {filepath.name}")
return filepath
except Exception as e:
raise Exception(f"Failed to append to note: {str(e)}")

23
requirements.txt Normal file
View File

@@ -0,0 +1,23 @@
# Content Extractor Dependencies
# Web scraping
requests>=2.31.0
beautifulsoup4>=4.12.0
lxml>=4.9.0
# YouTube
youtube-transcript-api>=0.6.0
pytube>=15.0.0
# Browser automation (for Instagram and dynamic content)
playwright>=1.40.0
# Text processing
markdown>=3.5.0
# Utilities
python-dotenv>=1.0.0
pydantic>=2.5.0
# Date handling
python-dateutil>=2.8.0