datascience

Java Data Ingestion from Google Cloud to PostgreSQL

A Java-based workflow for ingesting JSON data from Google Cloud Storage into PostgreSQL, with error handling for data integrity.

Shipped January 2026

A Java-based data ingestion workflow designed to download JSON data from a Google Cloud Storage bucket, parse it, and insert it into a PostgreSQL database. It handles unique constraint violations gracefully to maintain data integrity.


Features

  • Connects to Google Cloud Storage to list and download JSON files.
  • Parses JSON data and processes various entities such as Items, Resources, Contributors, Call Numbers, and Subjects.
  • Inserts parsed data into PostgreSQL tables with error handling for unique constraint violations.
  • Modular processors for different data components to maintain clean separation of concerns.

Tech Stack

  • Java 11
  • Maven for build and dependency management
  • PostgreSQL as the relational database
  • Google Cloud Storage for data source
  • JSON processing with org.json

Getting Started

Prerequisites

  • Java 11 or higher installed
  • Maven installed
  • PostgreSQL running locally or accessible
  • Google Cloud Storage bucket with JSON files
  • Service account key JSON file for GCS authentication

Installation

  1. Clone the repository:
git clone https://github.com/justin-napolitano/sup-court-data-ingestion.git
cd sup-court-data-ingestion
  1. Update the database connection parameters and Google Cloud credentials path in DataIngestionMain.java.

  2. Build the project using Maven:

mvn clean package

Running

Run the main class using Maven exec plugin:

mvn exec:java -Dexec.mainClass="com.data_ingestion.DataIngestionMain"

Project Structure

sup-court-data-ingestion/
├── pom.xml
├── readme.md
├── resources/
│   └── secret.json  # Google Cloud service account key
├── src/
│   ├── main/
│   │   ├── java/
│   │   │   └── com/data_ingestion/
│   │   │       ├── CallNumbersProcessor.java
│   │   │       ├── ContributorsProcessor.java
│   │   │       ├── DataIngestionClient.java
│   │   │       ├── DataIngestionMain.java
│   │   │       ├── GCSClient.java
│   │   │       ├── ItemsProcessor.java
│   │   │       ├── ResourcesProcessor.java
│   │   │       └── SubjectsProcessor.java
│   └── test/
│       └── java/
│           └── com/example/AppTest.java
└── target/  # Maven build output

Future Work / Roadmap

  • Add comprehensive unit and integration tests for processors and clients.
  • Implement configuration management to externalize DB and GCS credentials.
  • Enhance error handling and logging with a structured logging framework.
  • Support incremental data ingestion and data update scenarios.
  • Containerize the application for easier deployment.
  • Add support for parallel processing to improve ingestion speed.

For any questions or contributions, please open an issue or submit a pull request.

Need more context?

Want help adapting this playbook?

Send me the constraints and I'll annotate the relevant docs, share risks I see, and outline the first sprint so the work keeps moving.