All Projects
Automation

Codeforces Scraper

About

The Codeforces Scraper is an automated data extraction tool designed to scrape competitive programming problem descriptions directly from the Codeforces platform. Once the raw HTML or text is extracted, the application leverages the Google Gemini API to intelligently parse the messy, unstructured data and convert it into a clean, highly structured format (such as JSON or Markdown). This makes it incredibly easy to index, store, or migrate competitive programming problems into custom databases or study platforms.

Tech Stack

Python
BeautifulSoup
Playwright
Google Gemini API
JSON
Pydantic

Features

Automated Problem Extraction

Reliably fetches problem statements, input/output constraints, and test cases directly from Codeforces URLs.

AI-Powered Parsing

Utilizes the Gemini API to understand the context of scraped text, separating description from constraints and examples.

Structured Output Generation

Converts raw, unstructured webpage data into clean, machine-readable formats for seamless downstream integration.

Architecture

01

Scraping Layer

A web crawler navigates to the target Codeforces problem URL and extracts the raw DOM elements containing the problem statement.

02

AI Processing Pipeline

The raw text is passed to the Gemini API using a carefully engineered prompt to identify and categorize specific fields.

03

Formatting & Output Layer

The response from Gemini is validated and serialized into a structured format, ready to be saved or pushed to a database.

Future Improvements

Batch Scraping

Implement a queue system to scrape and structure all problems from a specific contest in one go.

Test Case Auto-Runner Generation

Expand the AI prompt to automatically generate boilerplate test-runner code for instant local testing.

Markdown/PDF Export

Add formatting options to beautifully render structured JSON data into clean Markdown files or PDF documents.