datascience
Web Scraping Application for Real Estate Data Extraction
A general-purpose web scraping tool to extract and manage real estate data from various online sources using Python.
Shipped January 2026
General purpose webscraping application designed to extract, clean, and manage real estate and related data from various online sources including Google Sheets and realtor websites.
Features
- Scrapes real estate agent and listing data from realtor.com and rew.ca.
- Integrates with Google Sheets and Google Drive APIs for data storage and management.
- Batch downloads and merges data from multiple sources.
- Cleans and filters scraped data for further analysis.
- Supports exporting data in CSV and JSON formats.
- Configurable via YAML files.
Tech Stack
- Python 3
- Libraries: BeautifulSoup, requests, pandas, google-api-python-client, google-auth, PyYAML
- Google Sheets and Drive APIs
Getting Started
Prerequisites
- Python 3.6+
- Google API credentials (
credentials.jsonandtoken.json) for Sheets and Drive access.
Installation
git clone https://github.com/justin-napolitano/project-web-scraping.git
cd project-web-scraping
pip install -r requirements.txt # Assumed requirements file
Configuration
- Edit
config.yamlto set directories, file names, and task flags. - Place Google API credentials in the project root.
Running
python main.py
This will load the configuration, initialize services, and run the scraping and data processing pipeline.
Project Structure
main.py: Entry point of the application.program_skeleton.py: Core workflow orchestrator managing tasks.load_vars.py: Loads and sets environmental variables and config.get_creds.py: Handles Google API credential loading.goog_sheets.py: Google Sheets API interactions.google_drive.py: Google Drive API interactions.batch_download.py: Batch download logic for Google Sheets data.readwrite.py: Utilities for reading and writing data files.clean_df.py: Data cleaning functions.df_filter.py: Data filtering logic (partially obsolete).merge.py: Functions to merge CSV data files.download.py: Downloads PDFs from URLs.fix_files.py: Fixes file naming inconsistencies.confirm_drcts.py: Ensures folder structure exists.log.py: Logging and garbage collection utilities.rew_scraper.pyandrew_scraper3.py: Scrapers for rew.ca.realtor_scraper_sheets_4.py: Scraper for realtor.com integrated with Google Sheets.
Future Work / Roadmap
- Refactor and unify scraping modules for better maintainability.
- Improve error handling and logging across all modules.
- Automate credential refresh and token management.
- Enhance configuration flexibility to support more data sources.
- Implement more robust data validation and deduplication.
- Add unit and integration tests.
- Document all functions and classes with detailed docstrings.
Note: Some modules and functions are partially implemented or marked as obsolete and may require cleanup or rework.
Need more context?
Want help adapting this playbook?
Send me the constraints and I'll annotate the relevant docs, share risks I see, and outline the first sprint so the work keeps moving.