Lightweight Python API Wrapper for Apache Spark | Justin Napolitano

A lightweight Python API wrapper for Apache Spark designed to simplify data manipulation tasks. This project provides an easy interface to instantiate Spark sessions and load CSV data into Spark DataFrames.

Features

Simplified Spark session management
Load CSV files as Spark DataFrames with header support

Tech Stack

Python
Apache Spark (PySpark)

Getting Started

Prerequisites

Python 3.6+
Apache Spark installed and configured

Installation

Clone the repository:

git clone https://github.com/justin-napolitano/project-spark-api.git
cd project-spark-api

Install PySpark (if not already installed):

pip install pyspark

Usage

Example usage in Python:

from sparkAPI import SparkAPI

spark_api = SparkAPI()
df = spark_api.load_spark_data_from_csv('path/to/your/file.csv')
df.show()

Project Structure

project-spark-api/
├── sparkAPI.py       # Main API wrapper class for Spark session and data loading

Future Work / Roadmap

Add support for additional data formats (e.g., JSON, Parquet)
Implement data transformation utilities
Enable configuration options for Spark session (e.g., app name, master URL)
Add error handling and logging
Provide unit tests and examples