feat: Initial commit - Content Extractor for YouTube, Instagram, and blogs

- YouTube extraction with transcript support - Instagram reel extraction via browser automation - Blog/article web scraping - Auto-save to Obsidian vaults - Smart key point generation - Configurable via .env file - Quick extract shell script Tech stack: Python, requests, beautifulsoup4, playwright, youtube-transcript-api
2026-03-05 13:02:58 +05:30
commit c997e764b5
12 changed files with 1302 additions and 0 deletions
@@ -0,0 +1,21 @@
 # Content Extractor Configuration
 # Obsidian vault path (default: ~/Obsidian Vault)
 OBSIDIAN_VAULT_PATH=~/Obsidian Vault
 # Browser settings (for Instagram extraction)
 BROWSER_HEADLESS=true
 BROWSER_TIMEOUT=30000
 # Content extraction settings
 MAX_CONTENT_LENGTH=10000
 GENERATE_SUMMARY=true
 # YouTube settings
 YOUTUBE_LANGUAGE=en
 # Instagram settings
 INSTAGRAM_WAIT_TIME=5
 # Logging
 LOG_LEVEL=INFO
@@ -0,0 +1,52 @@
 # Python
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .Python
 env/
 venv/
 ENV/
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 # Environment
 .env
 .env.local
 # IDE
 .vscode/
 .idea/
 *.swp
 *.swo
 *~
 # Logs
 *.log
 logs/
 # OS
 .DS_Store
 Thumbs.db
 # Testing
 .pytest_cache/
 .coverage
 htmlcov/
 .tox/
 # Playwright
 .playwright/
@@ -0,0 +1,192 @@
 # Content Extractor 🔥
 Extract key information from URLs (YouTube, Instagram, blogs) and save to Obsidian notes automatically.
 ## Features
 - **YouTube Videos**: Extract title, description, transcript, author, duration, views
 - **Instagram Reels**: Extract caption, author, engagement metrics (via browser automation)
 - **Blog Posts/Articles**: Extract title, author, content, tags, publish date
 - **Auto-save to Obsidian**: Notes are automatically formatted and saved to your Obsidian vault
 - **Smart Summaries**: Generates key points from extracted content
 ## Installation
 ```bash
 # Navigate to the content-extractor directory
 cd ~/Desktop/itsthatnewshit/content-extractor
 # Install dependencies
 pip install -r requirements.txt
 # Install Playwright browsers (for Instagram extraction)
 playwright install
 ```
 ## Usage
 ### Basic Usage
 ```bash
 # Extract from YouTube video
 python main.py "https://www.youtube.com/watch?v=VIDEO_ID"
 # Extract from Instagram reel
 python main.py "https://www.instagram.com/reel/REEL_ID"
 # Extract from blog post
 python main.py "https://example.com/article"
 ```
 ### Advanced Options
 ```bash
 # Specify Obsidian vault path
 python main.py <url> --obsidian-path "/path/to/Obsidian Vault"
 # Custom output filename
 python main.py <url> --output "my-note-title"
 # Save to specific folder in Obsidian
 python main.py <url> --folder "Learning/YouTube"
 # Only print content, don't save to Obsidian
 python main.py <url> --no-save
 # Generate summary
 python main.py <url> --summarize
 ```
 ### Examples
 ```bash
 # Save YouTube tutorial to Learning folder
 python main.py "https://youtu.be/abc123" --folder "Learning" --output "Python Tutorial"
 # Extract Instagram reel without saving
 python main.py "https://instagram.com/reel/xyz789" --no-save
 # Extract blog post to default vault
 python main.py "https://medium.com/article" --folder "Articles"
 ```
 ## Configuration
 Create a `.env` file in the project directory to customize settings:
 ```bash
 cp .env.example .env
 ```
 Edit `.env` with your preferences:
 ```env
 # Obsidian vault path
 OBSIDIAN_VAULT_PATH=~/Obsidian Vault
 # Browser settings (for Instagram)
 BROWSER_HEADLESS=true
 BROWSER_TIMEOUT=30000
 # Content extraction
 MAX_CONTENT_LENGTH=10000
 GENERATE_SUMMARY=true
 # YouTube
 YOUTUBE_LANGUAGE=en
 # Instagram
 INSTAGRAM_WAIT_TIME=5
 ```
 ## Output Format
 Notes are saved in markdown format with:
 - Title and metadata (source, URL, extraction date)
 - Author, duration, views (when available)
 - Description/summary
 - Full content (transcript or article text)
 - Key points
 - Tags for easy organization
 Example output:
 ```markdown
 # How to Build AI Agents
 ## Metadata
 - **Source**: Youtube
 - **URL**: https://youtube.com/watch?v=abc123
 - **Extracted**: 2026-02-21 15:30:00
 - **Author**: Tech Channel
 - **Duration**: 12:34
 - **Views**: 1.2M
 ## Description
 Learn how to build AI agents from scratch...
 ## Content
 [Full transcript or article text...]
 ## Key Points
 - Point 1 from the content
 - Point 2 from the content
 - Point 3 from the content
 ---
 ## Tags
 #youtube #video #ai #agents #notes
 ```
 ## Troubleshooting
 ### Instagram extraction fails
 Instagram requires browser automation. Make sure you've run:
 ```bash
 playwright install
 ```
 If it still fails, Instagram may have changed their UI. The extractor has a fallback mode that will still extract basic info.
 ### YouTube transcript not available
 Some videos don't have captions/transcripts. The extractor will fall back to extracting the description only.
 ### Obsidian vault not found
 By default, the tool looks for `~/Obsidian Vault`. If your vault is elsewhere, use the `--obsidian-path` flag or set `OBSIDIAN_VAULT_PATH` in your `.env` file.
 ## Project Structure
 ```
 content-extractor/
 ├── main.py                 # Main entry point
 ├── config.py              # Configuration settings
 ├── obsidian_writer.py     # Obsidian note writer
 ├── requirements.txt       # Python dependencies
 ├── .env.example          # Example environment file
 ├── README.md             # This file
 └── extractors/
    ├── __init__.py
    ├── youtube_extractor.py    # YouTube extraction
    ├── instagram_extractor.py  # Instagram extraction
    └── blog_extractor.py       # Blog/article extraction
 ```
 ## Future Enhancements
 - [ ] AI-powered summarization (using LLMs)
 - [ ] Podcast/audio extraction (whisper transcription)
 - [ ] Twitter/X thread extraction
 - [ ] LinkedIn post extraction
 - [ ] Batch processing (extract from multiple URLs)
 - [ ] Web interface
 - [ ] Automatic tagging based on content
 ## License
 MIT License - Feel free to use and modify!
 ---
 Built with 🔥 by RUBIUS for naki
@@ -0,0 +1,47 @@
 """
 Configuration for Content Extractor
 """
 import os
 from pathlib import Path
 from dotenv import load_dotenv
 # Load environment variables
 load_dotenv()
 class Config:
    """Configuration settings for content extractor."""
    # Obsidian vault path (default to common locations)
    OBSIDIAN_VAULT_PATH = os.getenv(
        "OBSIDIAN_VAULT_PATH",
        os.path.expanduser("~/Obsidian Vault")  # Default location
    )
    # Browser settings (for Instagram and dynamic content)
    BROWSER_HEADLESS = os.getenv("BROWSER_HEADLESS", "true").lower() == "true"
    BROWSER_TIMEOUT = int(os.getenv("BROWSER_TIMEOUT", "30000"))  # 30 seconds
    # Content extraction settings
    MAX_CONTENT_LENGTH = int(os.getenv("MAX_CONTENT_LENGTH", "10000"))  # Max chars
    GENERATE_SUMMARY = os.getenv("GENERATE_SUMMARY", "true").lower() == "true"
    # YouTube settings
    YOUTUBE_LANGUAGE = os.getenv("YOUTUBE_LANGUAGE", "en")
    # Instagram settings (requires browser automation)
    INSTAGRAM_WAIT_TIME = int(os.getenv("INSTAGRAM_WAIT_TIME", "5"))  # seconds
    # Logging
    LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")
    LOG_FILE = os.getenv("LOG_FILE", "content_extractor.log")
    @classmethod
    def validate(cls):
        """Validate configuration."""
        # Check if Obsidian vault path exists
        if not Path(cls.OBSIDIAN_VAULT_PATH).exists():
            print(f"⚠️  Warning: Obsidian vault path does not exist: {cls.OBSIDIAN_VAULT_PATH}")
            print("   You can set OBSIDIAN_VAULT_PATH environment variable or use --obsidian-path flag")
        return True
@@ -0,0 +1,25 @@
 #!/bin/bash
 # Content Extractor - Quick extraction script
 # Usage: ./extract.sh <url> [folder]
 if [ -z "$1" ]; then
    echo "Usage: $0 <url> [folder]"
    echo ""
    echo "Examples:"
    echo "  $0 https://youtube.com/watch?v=abc123"
    echo "  $0 https://instagram.com/reel/xyz789 Learning"
    echo "  $0 https://medium.com/article Articles"
    exit 1
 fi
 URL="$1"
 FOLDER="${2:-Content Extractor}"
 echo "🔥 Content Extractor"
 echo "===================="
 echo "URL: $URL"
 echo "Folder: $FOLDER"
 echo ""
 cd "$(dirname "$0")"
 python main.py "$URL" --folder "$FOLDER"
@@ -0,0 +1,13 @@
 """
 Content Extractors Package
 """
 from .youtube_extractor import YouTubeExtractor
 from .blog_extractor import BlogExtractor
 from .instagram_extractor import InstagramExtractor
 __all__ = [
    "YouTubeExtractor",
    "BlogExtractor",
    "InstagramExtractor",
 ]
@@ -0,0 +1,224 @@
 """
 Blog/Article Extractor
 Extracts:
 - Title, author, publish date
 - Main article content
 - Tags/categories
 - Summary
 """
 import re
 from typing import Dict, Any, Optional
 from urllib.parse import urlparse
 try:
    import requests
    from bs4 import BeautifulSoup
 except ImportError:
    requests = None
    BeautifulSoup = None
 class BlogExtractor:
    """Extract content from blog posts and articles."""
    def __init__(self, url: str):
        self.url = url
        self.html = None
        self.soup = None
        self._fetch_page()
    def _fetch_page(self):
        """Fetch the webpage."""
        if requests is None:
            raise ImportError("requests not installed. Run: pip install requests")
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        try:
            response = requests.get(self.url, headers=headers, timeout=30)
            response.raise_for_status()
            self.html = response.text
        except Exception as e:
            raise Exception(f"Failed to fetch page: {str(e)}")
    def _parse_html(self):
        """Parse HTML with BeautifulSoup."""
        if BeautifulSoup is None:
            raise ImportError("beautifulsoup4 not installed. Run: pip install beautifulsoup4")
        if self.soup is None:
            self.soup = BeautifulSoup(self.html, 'lxml')
    def extract(self) -> Dict[str, Any]:
        """Extract all content from the page."""
        self._parse_html()
        content = {
            "title": self._get_title(),
            "description": self._get_description(),
            "author": self._get_author(),
            "publish_date": self._get_publish_date(),
            "content": self._get_content(),
            "key_points": self._generate_key_points(),
            "tags": self._get_tags(),
        }
        return content
    def _get_title(self) -> str:
        """Get page title."""
        # Try Open Graph title first
        og_title = self.soup.find('meta', property='og:title')
        if og_title and og_title.get('content'):
            return og_title['content'].strip()
        # Try Twitter card title
        twitter_title = self.soup.find('meta', attrs={'name': 'twitter:title'})
        if twitter_title and twitter_title.get('content'):
            return twitter_title['content'].strip()
        # Try h1 tag
        h1 = self.soup.find('h1')
        if h1:
            return h1.get_text().strip()
        # Fallback to <title> tag
        title_tag = self.soup.find('title')
        if title_tag:
            return title_tag.get_text().strip()
        return "Untitled Article"
    def _get_description(self) -> str:
        """Get page description."""
        # Try Open Graph description
        og_desc = self.soup.find('meta', property='og:description')
        if og_desc and og_desc.get('content'):
            return og_desc['content'].strip()
        # Try meta description
        meta_desc = self.soup.find('meta', attrs={'name': 'description'})
        if meta_desc and meta_desc.get('content'):
            return meta_desc['content'].strip()
        return ""
    def _get_author(self) -> str:
        """Get article author."""
        # Try Open Graph author
        og_author = self.soup.find('meta', property='article:author')
        if og_author and og_author.get('content'):
            return og_author['content'].strip()
        # Try meta author
        meta_author = self.soup.find('meta', attrs={'name': 'author'})
        if meta_author and meta_author.get('content'):
            return meta_author['content'].strip()
        # Try to find author in byline
        byline = self.soup.find(class_=re.compile(r'byline|author', re.I))
        if byline:
            return byline.get_text().strip()
        return "Unknown"
    def _get_publish_date(self) -> str:
        """Get publish date."""
        # Try Open Graph publish time
        og_time = self.soup.find('meta', property='article:published_time')
        if og_time and og_time.get('content'):
            return og_time['content'][:10]  # YYYY-MM-DD
        # Try meta publish date
        meta_time = self.soup.find('meta', attrs={'name': 'date'})
        if meta_time and meta_time.get('content'):
            return meta_time['content'][:10]
        # Try time tag
        time_tag = self.soup.find('time')
        if time_tag and time_tag.get('datetime'):
            return time_tag['datetime'][:10]
        return "Unknown"
    def _get_content(self) -> str:
        """Extract main article content."""
        # Remove unwanted elements
        for element in self.soup(['script', 'style', 'nav', 'header', 'footer', 'aside']):
            element.decompose()
        # Try to find main content area
        content_areas = [
            self.soup.find('article'),
            self.soup.find(class_=re.compile(r'article|content|post|entry', re.I)),
            self.soup.find(id=re.compile(r'article|content|post', re.I)),
            self.soup.find('main'),
        ]
        content_elem = next((elem for elem in content_areas if elem), None)
        if content_elem:
            # Get paragraphs from content area
            paragraphs = content_elem.find_all('p')
        else:
            # Fallback to all paragraphs
            paragraphs = self.soup.find_all('p')
        # Extract text from paragraphs
        text_parts = []
        for p in paragraphs:
            text = p.get_text().strip()
            if len(text) > 50:  # Filter out short paragraphs
                text_parts.append(text)
        # Join and clean
        content = "\n\n".join(text_parts)
        content = re.sub(r'\n{3,}', '\n\n', content)  # Remove excessive newlines
        return content[:10000]  # Limit length
    def _generate_key_points(self) -> list:
        """Generate key points from content."""
        content = self._get_content()
        if not content:
            return []
        # Extract first few sentences as key points
        sentences = re.split(r'[.!?]+', content)
        key_points = []
        for sentence in sentences[:5]:
            sentence = sentence.strip()
            if len(sentence) > 30 and len(sentence) < 200:
                key_points.append(sentence + '.')
        return key_points
    def _get_tags(self) -> list:
        """Get article tags/categories."""
        tags = []
        # Try Open Graph article tags
        og_tags = self.soup.find_all('meta', property='article:tag')
        for tag in og_tags:
            if tag.get('content'):
                tags.append(tag['content'].lower().replace(' ', '-'))
        # Try to find tag elements
        tag_elements = self.soup.find_all(class_=re.compile(r'tag|category|label', re.I))
        for elem in tag_elements[:5]:  # Limit to 5
            text = elem.get_text().strip().lower()
            if len(text) < 30:
                tags.append(text.replace(' ', '-'))
        # Add domain-based tag
        domain = urlparse(self.url).netloc
        if domain:
            tags.append(domain.replace('www.', '').split('.')[0])
        return list(set(tags))[:10]  # Remove duplicates and limit
@@ -0,0 +1,175 @@
 """
 Instagram Reel Extractor
 Extracts:
 - Title/caption
 - Author/creator
 - Description
 - Transcript (if available via captions)
 - Metadata (views, likes, etc.)
 Note: Instagram requires browser automation. Uses Playwright.
 """
 import re
 import time
 from typing import Dict, Any
 from urllib.parse import urlparse
 try:
    from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout
 except ImportError:
    sync_playwright = None
 class InstagramExtractor:
    """Extract content from Instagram reels."""
    def __init__(self, url: str, headless: bool = True):
        self.url = url
        self.headless = headless
        self.data = {}
        if sync_playwright is None:
            raise ImportError("playwright not installed. Run: pip install playwright && playwright install")
    def extract(self) -> Dict[str, Any]:
        """Extract content from Instagram reel."""
        try:
            with sync_playwright() as p:
                browser = p.chromium.launch(headless=self.headless)
                page = browser.new_page(
                    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
                )
                # Navigate to the reel
                print(f"📱 Loading Instagram reel...")
                page.goto(self.url, timeout=30000)
                # Wait for content to load
                time.sleep(3)
                # Try to close any cookies/login prompts
                try:
                    page.click('button:has-text("Not Now")', timeout=3000)
                except:
                    pass
                try:
                    page.click('button:has-text("Allow")', timeout=3000)
                except:
                    pass
                # Extract data
                self.data = self._extract_data(page)
                browser.close()
        except PlaywrightTimeout:
            print("⚠️  Timeout loading Instagram page")
            self.data = self._fallback_extract()
        except Exception as e:
            print(f"⚠️  Error: {str(e)}")
            self.data = self._fallback_extract()
        return self.data
    def _extract_data(self, page) -> Dict[str, Any]:
        """Extract data from loaded page."""
        data = {
            "title": "Instagram Reel",
            "description": "",
            "author": "Unknown",
            "content": "",
            "key_points": [],
            "tags": ["instagram", "reel"],
        }
        # Try to get caption/description
        try:
            # Look for caption text
            captions = page.query_selector_all('h1, h2, span')
            for caption in captions:
                text = caption.inner_text()
                if len(text) > 20 and len(text) < 500:
                    if not data["description"]:
                        data["description"] = text
                    break
        except Exception as e:
            print(f"⚠️  Could not extract caption: {e}")
        # Try to get author
        try:
            author_elem = page.query_selector('a[href*="/"] h1, a[href*="/"] h2, header span')
            if author_elem:
                data["author"] = author_elem.inner_text().strip()
        except:
            pass
        # Try to get engagement metrics
        try:
            likes_elem = page.query_selector('span:has-text("likes"), span:has-text("views")')
            if likes_elem:
                data["views"] = likes_elem.inner_text().strip()
        except:
            pass
        # Extract any visible text as content
        try:
            # Get all text content
            body_text = page.inner_text('body')
            # Filter for meaningful content
            lines = body_text.split('\n')
            meaningful_lines = [
                line.strip() for line in lines 
                if len(line.strip()) > 30 and len(line.strip()) < 300
            ]
            data["content"] = "\n\n".join(meaningful_lines[:10])[:5000]
        except Exception as e:
            print(f"⚠️  Could not extract page text: {e}")
        # Generate key points from description
        if data["description"]:
            sentences = data["description"].split('.')[:3]
            data["key_points"] = [s.strip() + '.' for s in sentences if len(s.strip()) > 20]
        # Add URL-based tags
        parsed = urlparse(self.url)
        if '/reel/' in parsed.path:
            data["tags"].append("reel")
        if '/video/' in parsed.path:
            data["tags"].append("video")
        return data
    def _fallback_extract(self) -> Dict[str, Any]:
        """Fallback extraction when browser automation fails."""
        print("⚠️  Using fallback extraction method...")
        # Try to extract what we can from the URL itself
        data = {
            "title": "Instagram Content",
            "description": "[Could not extract - Instagram requires login]",
            "author": "Unknown",
            "content": "",
            "key_points": [
                "Instagram content extraction requires browser automation",
                "Consider using Instagram's official API or downloading the video manually",
            ],
            "tags": ["instagram", "fallback"],
        }
        # Extract reel ID from URL
        try:
            parsed = urlparse(self.url)
            path_parts = parsed.path.split('/')
            for i, part in enumerate(path_parts):
                if part in ['reel', 'p', 'tv'] and i + 1 < len(path_parts):
                    reel_id = path_parts[i + 1]
                    data["key_points"].append(f"Reel ID: {reel_id}")
                    break
        except:
            pass
        return data
@@ -0,0 +1,203 @@
 """
 YouTube Video Extractor
 Extracts:
 - Title, description, author
 - Transcript/captions
 - Duration, views, publish date
 - Tags/categories
 """
 import re
 from typing import Optional, Dict, Any
 from urllib.parse import urlparse, parse_qs
 try:
    from youtube_transcript_api import YouTubeTranscriptApi
    from youtube_transcript_api._errors import TranscriptsDisabled, NoTranscriptFound
 except ImportError:
    YouTubeTranscriptApi = None
 try:
    from pytubefix import YouTube  # More reliable than pytube
 except ImportError:
    try:
        from pytube import YouTube
    except ImportError:
        YouTube = None
 class YouTubeExtractor:
    """Extract content from YouTube videos."""
    def __init__(self, url: str):
        self.url = url
        self.video_id = self._extract_video_id(url)
        self.youtube = None
    def _extract_video_id(self, url: str) -> str:
        """Extract video ID from YouTube URL."""
        patterns = [
            r'(?:youtube\.com\/watch\?v=|youtu\.be\/)([a-zA-Z0-9_-]{11})',
            r'youtube\.com\/embed\/([a-zA-Z0-9_-]{11})',
            r'youtube\.com\/v\/([a-zA-Z0-9_-]{11})',
        ]
        for pattern in patterns:
            match = re.search(pattern, url)
            if match:
                return match.group(1)
        raise ValueError(f"Could not extract YouTube video ID from: {url}")
    def _init_youtube(self):
        """Initialize YouTube object."""
        if YouTube is None:
            raise ImportError("pytube or pytubefix not installed. Run: pip install pytubefix")
        if self.youtube is None:
            self.youtube = YouTube(self.url)
    def extract(self) -> Dict[str, Any]:
        """Extract all content from YouTube video."""
        self._init_youtube()
        content = {
            "title": self._get_title(),
            "description": self._get_description(),
            "author": self._get_author(),
            "duration": self._get_duration(),
            "publish_date": self._get_publish_date(),
            "views": self._get_views(),
            "content": self._get_transcript(),
            "key_points": self._generate_key_points(),
            "tags": self._get_tags(),
        }
        return content
    def _get_title(self) -> str:
        """Get video title."""
        try:
            self._init_youtube()
            return self.youtube.title
        except Exception as e:
            return f"Video {self.video_id}"
    def _get_description(self) -> str:
        """Get video description."""
        try:
            self._init_youtube()
            return self.youtube.description or ""
        except Exception:
            return ""
    def _get_author(self) -> str:
        """Get video author/channel name."""
        try:
            self._init_youtube()
            return self.youtube.author
        except Exception:
            return "Unknown"
    def _get_duration(self) -> str:
        """Get video duration in readable format."""
        try:
            self._init_youtube()
            seconds = self.youtube.length
            minutes, secs = divmod(seconds, 60)
            hours, minutes = divmod(minutes, 60)
            if hours > 0:
                return f"{hours}:{minutes:02d}:{secs:02d}"
            else:
                return f"{minutes}:{secs:02d}"
        except Exception:
            return "Unknown"
    def _get_publish_date(self) -> str:
        """Get video publish date."""
        try:
            self._init_youtube()
            if hasattr(self.youtube, 'publish_date') and self.youtube.publish_date:
                return self.youtube.publish_date.strftime("%Y-%m-%d")
        except Exception:
            pass
        return "Unknown"
    def _get_views(self) -> str:
        """Get view count."""
        try:
            self._init_youtube()
            views = self.youtube.views
            if views > 1_000_000:
                return f"{views / 1_000_000:.1f}M"
            elif views > 1_000:
                return f"{views / 1_000:.1f}K"
            else:
                return str(views)
        except Exception:
            return "Unknown"
    def _get_transcript(self) -> str:
        """Get video transcript/captions."""
        if YouTubeTranscriptApi is None:
            return "[Transcript not available - youtube-transcript-api not installed]"
        try:
            # New API requires creating an instance
            api = YouTubeTranscriptApi()
            transcript_list = api.list(self.video_id)
            # Try to find English transcript
            transcript = None
            for t in transcript_list:
                if t.language_code == 'en':
                    transcript = t
                    break
            # Fallback to first available
            if transcript is None:
                transcript = next(iter(transcript_list), None)
            if transcript is None:
                return "[No transcript available]"
            transcript_data = transcript.fetch()
            # New API returns FetchedTranscript with snippets
            if hasattr(transcript_data, 'snippets'):
                full_text = " ".join([snippet.text for snippet in transcript_data.snippets])
            else:
                # Fallback for older API format
                full_text = " ".join([entry['text'] for entry in transcript_data])
            # Clean up the text
            full_text = full_text.replace("\n", " ").strip()
            return full_text[:10000]  # Limit length
        except Exception as e:
            return f"[Transcript not available: {str(e)}]"
    def _generate_key_points(self) -> list:
        """Generate key points from transcript (simple extraction)."""
        transcript = self._get_transcript()
        if not transcript or transcript.startswith("["):
            return []
        # Simple sentence extraction (first few sentences as key points)
        sentences = transcript.split('.')[:5]
        key_points = [s.strip() + '.' for s in sentences if len(s.strip()) > 20]
        return key_points[:5]
    def _get_tags(self) -> list:
        """Get video tags."""
        try:
            self._init_youtube()
            if hasattr(self.youtube, 'keywords'):
                return self.youtube.keywords[:10] if self.youtube.keywords else []
        except Exception:
            pass
        return ["youtube", "video"]
@@ -0,0 +1,199 @@
 #!/usr/bin/env python3
 """
 Content Extractor - Extract key information from URLs and save to Obsidian
 Supports:
 - YouTube videos (transcripts, descriptions, metadata)
 - Blog posts & articles (web scraping)
 - Instagram reels (via browser automation)
 - Generic URLs (text extraction)
 Usage:
    python main.py <url> [--obsidian-path <path>] [--output <filename>]
 """
 import argparse
 import sys
 from pathlib import Path
 from datetime import datetime
 from typing import Optional
 from extractors.youtube_extractor import YouTubeExtractor
 from extractors.blog_extractor import BlogExtractor
 from extractors.instagram_extractor import InstagramExtractor
 from obsidian_writer import ObsidianWriter
 from config import Config
 def detect_source_type(url: str) -> str:
    """Detect the type of content based on URL."""
    if "youtube.com" in url or "youtu.be" in url:
        return "youtube"
    elif "instagram.com" in url and "/reel" in url:
        return "instagram"
    elif "instagram.com" in url:
        return "instagram"
    else:
        return "blog"
 def extract_content(url: str, source_type: str) -> dict:
    """Extract content from URL based on source type."""
    print(f"🔍 Extracting content from {source_type}...")
    if source_type == "youtube":
        extractor = YouTubeExtractor(url)
    elif source_type == "instagram":
        extractor = InstagramExtractor(url)
    else:
        extractor = BlogExtractor(url)
    return extractor.extract()
 def main():
    parser = argparse.ArgumentParser(
        description="Extract content from URLs and save to Obsidian notes"
    )
    parser.add_argument("url", help="URL to extract content from")
    parser.add_argument(
        "--obsidian-path",
        type=str,
        default=Config.OBSIDIAN_VAULT_PATH,
        help="Path to Obsidian vault"
    )
    parser.add_argument(
        "--output",
        type=str,
        default=None,
        help="Output filename (without .md extension)"
    )
    parser.add_argument(
        "--folder",
        type=str,
        default="Content Extractor",
        help="Folder in Obsidian vault to save notes"
    )
    parser.add_argument(
        "--no-save",
        action="store_true",
        help="Only print extracted content, don't save to Obsidian"
    )
    parser.add_argument(
        "--summarize",
        action="store_true",
        help="Generate a summary of the content"
    )
    args = parser.parse_args()
    # Detect source type
    source_type = detect_source_type(args.url)
    print(f"📌 Detected source type: {source_type}")
    # Extract content
    try:
        content = extract_content(args.url, source_type)
    except Exception as e:
        print(f"❌ Extraction failed: {e}")
        sys.exit(1)
    if not content:
        print("❌ No content could be extracted")
        sys.exit(1)
    # Generate output filename
    if args.output:
        filename = args.output
    else:
        # Generate from title or URL
        title = content.get("title", "Untitled")
        filename = f"{title[:50]}_{datetime.now().strftime('%Y%m%d')}"
        # Sanitize filename
        filename = "".join(c for c in filename if c.isalnum() or c in " -_").strip()
    # Create markdown content
    markdown = generate_markdown(content, source_type, args.url)
    # Print preview
    print("\n" + "="*80)
    print("📝 EXTRACTED CONTENT PREVIEW")
    print("="*80)
    print(markdown[:2000] + "..." if len(markdown) > 2000 else markdown)
    print("="*80)
    # Save to Obsidian
    if not args.no_save:
        writer = ObsidianWriter(args.obsidian_path)
        output_path = writer.save_note(markdown, filename, args.folder)
        print(f"\n✅ Note saved to: {output_path}")
    else:
        print("\n⚠️  Note not saved (--no-save flag)")
    return content
 def generate_markdown(content: dict, source_type: str, url: str) -> str:
    """Generate markdown content for Obsidian note."""
    lines = []
    # Header
    lines.append(f"# {content.get('title', 'Untitled')}")
    lines.append("")
    # Metadata
    lines.append("## Metadata")
    lines.append("")
    lines.append(f"- **Source**: {source_type.capitalize()}")
    lines.append(f"- **URL**: {url}")
    lines.append(f"- **Extracted**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    if content.get("author"):
        lines.append(f"- **Author**: {content.get('author')}")
    if content.get("duration"):
        lines.append(f"- **Duration**: {content.get('duration')}")
    if content.get("publish_date"):
        lines.append(f"- **Published**: {content.get('publish_date')}")
    if content.get("views"):
        lines.append(f"- **Views**: {content.get('views')}")
    lines.append("")
    # Description/Summary
    if content.get("description"):
        lines.append("## Description")
        lines.append("")
        lines.append(content.get("description", ""))
        lines.append("")
    # Main Content (transcript, article text, etc.)
    if content.get("content"):
        lines.append("## Content")
        lines.append("")
        lines.append(content.get("content", ""))
        lines.append("")
    # Key Points/Summary
    if content.get("key_points"):
        lines.append("## Key Points")
        lines.append("")
        for point in content.get("key_points", []):
            lines.append(f"- {point}")
        lines.append("")
    # Tags
    lines.append("---")
    lines.append("")
    lines.append("## Tags")
    lines.append("")
    tags = content.get("tags", [])
    if not tags:
        tags = ["content-extractor", source_type, "notes"]
    lines.append(" ".join(f"#{tag}" for tag in tags))
    lines.append("")
    return "\n".join(lines)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,128 @@
 """
 Obsidian Note Writer
 Saves extracted content as markdown notes in Obsidian vault.
 """
 import os
 from pathlib import Path
 from datetime import datetime
 from typing import Optional
 class ObsidianWriter:
    """Write content to Obsidian vault as markdown notes."""
    def __init__(self, vault_path: str):
        self.vault_path = Path(vault_path).expanduser()
        self._validate_vault()
    def _validate_vault(self):
        """Validate that the path is an Obsidian vault."""
        if not self.vault_path.exists():
            print(f"⚠️  Creating Obsidian vault directory: {self.vault_path}")
            self.vault_path.mkdir(parents=True, exist_ok=True)
        # Check if it looks like an Obsidian vault
        obsidian_config = self.vault_path / ".obsidian"
        if not obsidian_config.exists():
            print(f"⚠️  Warning: {self.vault_path} doesn't look like an Obsidian vault")
            print("   (No .obsidian directory found)")
            print("   Notes will still be saved, but you may want to set the correct vault path")
    def save_note(
        self, 
        content: str, 
        filename: str, 
        folder: Optional[str] = None,
        subfolder: Optional[str] = None
    ) -> Path:
        """
        Save a note to Obsidian vault.
        Args:
            content: Markdown content to save
            filename: Filename without .md extension
            folder: Folder in vault (default: root)
            subfolder: Subfolder within folder (optional)
        Returns:
            Path to saved file
        """
        # Build path
        if folder:
            note_dir = self.vault_path / folder
            if subfolder:
                note_dir = note_dir / subfolder
        else:
            note_dir = self.vault_path
        # Create directory if it doesn't exist
        note_dir.mkdir(parents=True, exist_ok=True)
        # Sanitize filename
        filename = self._sanitize_filename(filename)
        # Add .md extension
        filepath = note_dir / f"{filename}.md"
        # Handle duplicate filenames
        counter = 1
        original_filepath = filepath
        while filepath.exists():
            filepath = original_filepath.with_name(f"{filename}_{counter}.md")
            counter += 1
        # Write the file
        try:
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(content)
            print(f"✅ Note saved: {filepath.name}")
            return filepath
        except Exception as e:
            raise Exception(f"Failed to save note: {str(e)}")
    def _sanitize_filename(self, filename: str) -> str:
        """Sanitize filename for filesystem."""
        # Remove invalid characters
        invalid_chars = '<>:"/\\|?*'
        for char in invalid_chars:
            filename = filename.replace(char, '')
        # Replace spaces with hyphens (optional, but cleaner)
        # filename = filename.replace(' ', '-')
        # Limit length
        if len(filename) > 100:
            filename = filename[:100]
        return filename.strip()
    def create_daily_note(self, content: str) -> Path:
        """Create/update a daily note."""
        today = datetime.now().strftime("%Y-%m-%d")
        folder = "Daily Notes"
        return self.save_note(content, today, folder)
    def append_to_note(self, filename: str, content: str, folder: Optional[str] = None) -> Path:
        """Append content to an existing note."""
        if folder:
            note_dir = self.vault_path / folder
        else:
            note_dir = self.vault_path
        filepath = note_dir / f"{filename}.md"
        # If file doesn't exist, create it
        if not filepath.exists():
            return self.save_note(content, filename, folder)
        # Append to existing file
        try:
            with open(filepath, 'a', encoding='utf-8') as f:
                f.write("\n\n---\n\n")
                f.write(content)
            print(f"✅ Content appended to: {filepath.name}")
            return filepath
        except Exception as e:
            raise Exception(f"Failed to append to note: {str(e)}")
@@ -0,0 +1,23 @@
 # Content Extractor Dependencies
 # Web scraping
 requests>=2.31.0
 beautifulsoup4>=4.12.0
 lxml>=4.9.0
 # YouTube
 youtube-transcript-api>=0.6.0
 pytube>=15.0.0
 # Browser automation (for Instagram and dynamic content)
 playwright>=1.40.0
 # Text processing
 markdown>=3.5.0
 # Utilities
 python-dotenv>=1.0.0
 pydantic>=2.5.0
 # Date handling
 python-dateutil>=2.8.0