Loc Prodifier: Merging Data in Google BigQuery | Justin Napolitano

Overview

Loc Prodifier is a Python tool designed to merge data from staging tables into production tables within Google BigQuery while preventing duplicate records. It supports both local execution and deployment on Google Cloud Run, enabling scalable and parallel processing of multiple tables using Google Cloud Workflows.

Features

Merges data from staging tables into production tables without duplicates.
Supports parallel execution across multiple tables.
Can run locally with custom credentials or be deployed on Google Cloud Run.
Configurable via command-line arguments.

Tech Stack

Python 3.7+
Google Cloud BigQuery
Google Cloud Run
Google Cloud Workflows
Docker

Getting Started

Prerequisites

Python 3.7 or higher
Google Cloud SDK
Docker
Google Cloud project with BigQuery, Cloud Run, and Artifact Registry enabled

Installation

Clone the repository:

git clone https://github.com/justin-napolitano/loc_prodifier.git
cd loc_prodifier

Install required Python packages:

pip install -r requirements.txt

Running Locally

Ensure you have your Google Cloud credentials JSON file. Run:

python loc_prodifier.py --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local

Running with Docker

Build the Docker image:

docker build -t my-bigquery-script .

Run the container:

docker run --rm my-bigquery-script --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local

Deploying to Google Cloud Run

Use the provided cloudbuild.yaml to build and push the Docker image to Artifact Registry.
Deploy the Cloud Run job using the Cloud Build steps or manually with gcloud commands.
Use workflow.yaml to orchestrate parallel merges via Google Cloud Workflows.

Project Structure

loc_prodifier/
├── cloudbuild.yaml        # Cloud Build configuration for building and deploying
├── Dockerfile             # Docker image build instructions
├── gcputils/              # Google Cloud utility submodule (BigQuery, Storage, Logging, Secrets)
│   ├── BigQueryClient.py  # BigQuery client wrapper
│   ├── gcpclient.py       # Google Cloud Storage client
│   ├── GoogleCloudLogging.py # Cloud Logging client
│   ├── GoogleSecretManager.py # Secret Manager client
│   └── ...
├── loc_prodifier.py       # Main script for merging tables
├── readme-prodifier.md    # Original README content
├── requirements.txt       # Python dependencies
└── workflow.yaml          # Cloud Workflows definition for parallel execution

Future Work / Roadmap

Add support for configurable merge conditions and update clauses.
Enhance error handling and logging integration.
Provide more detailed usage examples and automated tests.
Expand support for additional data sources or cloud providers.
Implement monitoring and alerting for workflow executions.

Note: Some documentation and features are inferred based on available code and files.