Thursday, May 8, 2025

Web Scraping and Data Extraction - Social Media Hashtag Monitor

 


Notes:

  • Problem Solved: Scrapes Twitter for real-time hashtag mentions.

  • Customization Benefits: Track campaigns, analyze sentiment, or discover influencers.

  • Further Adoption: Store in a database, analyze sentiment, or trigger alerts.

Python Code:


import snscrape.modules.twitter as sntwitter

def get_tweets_by_hashtag(hashtag, max_tweets=50):
    tweets = []
    for i, tweet in enumerate(sntwitter.TwitterHashtagScraper(hashtag).get_items()):
        if i >= max_tweets:
            break
        tweets.append({'user': tweet.user.username, 'content': tweet.content})
    return tweets

# Example usage
tweets = get_tweets_by_hashtag("AI")
for t in tweets[:5]:
    print(f"{t['user']}: {t['content'][:80]}")


Web Scraping and Data Extraction - PDF Invoice Parser

 


Notes:

  • Problem Solved: Extracts structured data (like totals, dates) from PDF invoices.

  • Customization Benefits: Works with invoice templates or billing automation systems.

  • Further Adoption: Connect to accounting software or ERP platforms.

Python Code:


import pdfplumber

def extract_invoice_data(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = pdf.pages[0].extract_text()
    lines = text.split('\n')
    data = {}
    for line in lines:
        if "Invoice Number" in line:
            data['invoice_number'] = line.split(":")[-1].strip()
        elif "Total Amount" in line:
            data['total_amount'] = line.split(":")[-1].strip()
        elif "Date" in line:
            data['date'] = line.split(":")[-1].strip()
    return data

# Example usage
# print(extract_invoice_data("invoice_sample.pdf"))

Web Scraping and Data Extraction - Real-Time News Extractor

 


Notes:

  • Problem Solved: Extracts headlines from news websites in real time.

  • Customization Benefits: Filter by topic or sentiment, or push to dashboards.

  • Further Adoption: Use for trend analysis, sentiment detection, or alert systems.

Python Code:


import feedparser

def get_news_rss(feed_url):
    feed = feedparser.parse(feed_url)
    return [{'title': entry.title, 'link': entry.link} for entry in feed.entries]

rss_url = "http://feeds.bbci.co.uk/news/rss.xml"
headlines = get_news_rss(rss_url)
for news in headlines[:5]:
    print(news['title'], "-", news['link'])

Web Scraping and Data Extraction - Job Listing Aggregator

 


Notes:

  • Problem Solved: Extracts job postings from multiple job boards.

  • Customization Benefits: Filter by keywords, location, or salary.

  • Further Adoption: Feed into job boards, CRMs, or recruitment analytics platforms.

Python Code:


import requests
from bs4 import BeautifulSoup

def scrape_indeed_jobs(query, location):
    base_url = "https://www.indeed.com/jobs"
    params = {"q": query, "l": location}
    response = requests.get(base_url, params=params)
    soup = BeautifulSoup(response.text, 'html.parser')
    jobs = []
    for job_card in soup.select('.result'):
        title = job_card.select_one('h2.jobTitle').text.strip()
        company = job_card.select_one('.companyName').text.strip()
        jobs.append({'title': title, 'company': company})
    return jobs

print(scrape_indeed_jobs("data analyst", "New York, NY"))

Web Scraping and Data Extraction - E-commerce Price Tracker

 


Notes:

  • Problem Solved: Tracks product prices across e-commerce sites (e.g., Amazon, Flipkart).

  • Customization Benefits: Monitor competitors, automate pricing strategies, or trigger alerts.

  • Further Adoption: Integrate with BI tools, pricing engines, or push notifications.

Python Code:


import requests
from bs4 import BeautifulSoup

def get_amazon_price(product_url, headers):
    response = requests.get(product_url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.find(id="productTitle").get_text(strip=True)
    price = soup.find('span', {'class': 'a-offscreen'}).get_text(strip=True)
    return {'title': title, 'price': price}

# Example usage
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.amazon.com/dp/B08N5WRWNW'  # Example product
print(get_amazon_price(url, headers))

Customer Relationship Management (CRM) - Voice of Customer Analyzer

 


Notes:

  • Problem Solved: Performs sentiment analysis on customer reviews or NPS responses.

  • Customization Benefits: Tailor sentiment thresholds or keywords per product.

  • Further Adoption: Feed results into product improvement or alerting systems.

Python Code:


from textblob import TextBlob
import pandas as pd

def analyze_sentiment(text):
    return TextBlob(text).sentiment.polarity

feedback_df = pd.read_csv("customer_feedback.csv")  # Column: 'feedback'
feedback_df['sentiment_score'] = feedback_df['feedback'].apply(analyze_sentiment)

# Categorize feedback
feedback_df['sentiment_label'] = feedback_df['sentiment_score'].apply(
    lambda x: 'Positive' if x > 0.1 else ('Negative' if x < -0.1 else 'Neutral')
)
print(feedback_df[['feedback', 'sentiment_label']])

Customer Relationship Management (CRM) - Sales Forecasting Tool

 


Notes:

  • Problem Solved: Predicts future sales based on pipeline data and historical trends.

  • Customization Benefits: Incorporate external data like seasonality or macroeconomic factors.

  • Further Adoption: Display results in BI dashboards or CRM widgets.

Python Code:


import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv("sales_pipeline.csv")  # Columns: 'month', 'opportunities', 'closed_deals'
X = df[['opportunities']]
y = df['closed_deals']

model = LinearRegression()
model.fit(X, y)

# Forecast next month's sales
next_opps = pd.DataFrame({'opportunities': [150]})
forecast = model.predict(next_opps)
print(f"Predicted sales for next month: {forecast[0]:.2f}")

IoT (Internet of Things) Automation - Smart Energy Usage Tracker

  Notes: Problem Solved: Logs and analyzes power usage from smart meters. Customization Benefits: Track per-device energy and set ale...