Student Review Scraping Architecture

Summary of the favcollege system for scraping, analyzing, and caching student reviews from Twitter and Reddit.

Key Takeaway

The architecture solves three problems: sourcing authentic student reviews at scale (scraping), understanding them (sentiment analysis + topic categorization), and serving them fast (denormalized caching). The denormalized cache pattern — storing aggregates on the parent table and syncing via a separate script — is a practical tradeoff between real-time accuracy and page load speed.

Sentiment Analysis Pipeline

Keyword-based using the sentiment npm library. Score normalized to -1..+1, then mapped to 1-5 stars: rating = round((sentiment + 1) * 2 + 1).

Thresholds: >0.2 positive (4-5 stars), -0.2..+0.2 neutral (3 stars), 0.2 negative (1-2 stars).

Topic Categorization

Seven categories detected via keyword matching: academics, campus life, career, value, facilities, diversity, location. Each review can have multiple categories.

Denormalized Cache Pattern

Reviews live in student_feedback table. College pages read from cached fields on colleges table (totalReviews, averageRating, sentimentScore). A sync script (npm run update-counts) recalculates aggregates and updates cached values. Must run after every scrape.

This avoids expensive COUNT/AVG aggregate queries on every page load while accepting eventual consistency.

Review Quality Filters

  • Minimum 100 characters
  • Must use first-person pronouns (I, my, me)
  • No retweets, simple replies, promotional content
  • Twitter: engagement metrics map to helpful_votes; 100+ engagement = is_top_comment
  • Reddit: karma and account age checks; verified student flairs prioritized

See also: source-college-scorecard-api, sentiment-analysis