datascience

Document Parser: A Python API for Document Processing

Explore a document parsing pipeline that processes various formats, extracts entities, and provides an API for efficient document handling.

Shipped January 2026

A document parsing pipeline designed to process various document types, tokenize and chunk content, extract entities, relationships, and citations, and store the results in a database. It leverages the Unstructured library and provides an API for document partitioning.

Features

  • Supports parsing of multiple document formats including PDFs, Word documents, HTML, emails, and images.
  • Tokenization and chunking of documents for downstream processing.
  • Extraction of entities, relationships, and citations from documents.
  • Integration with the Unstructured library for document partitioning.
  • API interface for document processing with support for form parameters.
  • Handles compressed files (gzip) and supports content type detection and validation.
  • Docker-compose setup for running the Unstructured service.

Tech Stack

  • Python 3.x
  • FastAPI for API implementation
  • Unstructured library for document partitioning
  • pypdf for PDF manipulation
  • Pandas for data handling
  • Docker for containerization
  • pytest for testing

Getting Started

Prerequisites

  • Python 3.8 or higher
  • Docker and Docker Compose (for running the Unstructured service)

Installation

  1. Clone the repository:
git clone https://github.com/justin-napolitano/document-parser.git
cd document-parser/unstructured-api
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Running the Unstructured Service

Start the Unstructured document parser service using Docker Compose:

docker-compose up -d

This will start the service on port 9000.

Running the API

Run the FastAPI app:

uvicorn prepline_general.api.app:app --host 0.0.0.0 --port 8000

The API documentation will be available at http://localhost:8000/general/docs.

Running Tests

Run the test suite with pytest:

pytest

Project Structure

.
├── docker-compose.yml          # Docker Compose config for Unstructured service
├── index.md                   # Project goals and overview
├── sample-docs/               # Sample documents for testing
├── scripts/                   # Utility scripts including smoketest
├── unstructured-api/          # Main API and processing code
│   ├── prepline_general/      # Core API modules and utilities
│   │   ├── api/               # FastAPI app, routers, models, utils
│   │   ├── filetypes.py       # File type detection and validation
│   │   ├── models/            # Pydantic models for form parameters
│   │   ├── utils.py           # Helpers for type parsing and conversion
│   │   ├── openapi.py         # Custom OpenAPI schema generation
│   │   ├── general.py         # API endpoints and processing logic
│   │   └── ...
│   ├── LICENSE.md             # Apache 2.0 License
│   ├── README.md              # Unstructured API announcement and info
│   ├── CHANGELOG.md           # Version history and changes
│   └── test_general/          # Tests for API and utilities
└── README.md                  # This file

Future Work / Roadmap

  • Add detailed support for more document types and complex layouts.
  • Improve entity and relationship extraction capabilities.
  • Enhance chunking strategies and support for multipage sections.
  • Add authentication and rate limiting to the API.
  • Provide hosted deployment options and scalability improvements.
  • Expand test coverage and add benchmarks.
  • Improve documentation with usage examples and tutorials.

Assumptions

  • Primary language is Python based on code and dependencies.
  • The project is a wrapper around the Unstructured library with added API and utilities.
  • Some details on usage and installation are inferred from typical FastAPI and Docker setups.

License

This project uses the Apache License 2.0 as indicated in the LICENSE.md file.

Need more context?

Want help adapting this playbook?

Send me the constraints and I'll annotate the relevant docs, share risks I see, and outline the first sprint so the work keeps moving.