Skip to content

Latest commit

 

History

History
95 lines (73 loc) · 3.22 KB

README.md

File metadata and controls

95 lines (73 loc) · 3.22 KB

TodoFEC-parser

This project aligns with the TodoFEC initiative to create a standardized set of data tasks for comparing data processing frameworks. We parse and store the data in Parquet files, then upload to a public S3 bucket, so everyone access the data easily

Query FEC Data on S3 with DuckDB

The FEC data for this project is available as Parquet files in an S3 bucket, allowing direct querying without downloading. You can use DuckDB to query the data directly.

  1. Install duckdb
   pip install duckdb
  1. Open duckdb
   duckdb
  1. Run a Query: Use the following command to query the Parquet file directly from S3
  select count(*) from read_parquet('s3://datarecce-todofec/pac_summary_2024.parquet');

Here are the S3 URIs of available dataset:

s3://datarecce-todofec/all_candidates_2024.parquet
s3://datarecce-todofec/candidate_master_2024.parquet
s3://datarecce-todofec/candidate_committee_linkage_2024.parquet
s3://datarecce-todofec/house_senate_2024.parquet
s3://datarecce-todofec/committee_master_2024.parquet
s3://datarecce-todofec/pac_summary_2024.parquet
s3://datarecce-todofec/contributions_from_committees_to_candidates_2024.parquet
s3://datarecce-todofec/operating_expenditures_2024.parquet

System Prequisites

Before you begin you'll need the following on your system:

  • Python >=3.12 (see here)
  • Python Poetry >= 1.8 (see here)

Setup dependencies

Install the python dependencies

poetry install

Run the script

Once installation has completed you can start parsing data.

poetry run python main.py

The result

tree --du -h datarecce-todofec/
[804M]  datarecce-todofec/
├── [354M]  parquet
│   ├── [173K]  all_candidates_2020.parquet
│   ├── [164K]  all_candidates_2024.parquet
│   ├── [ 86K]  candidate_committee_linkage_2024.parquet
│   ├── [330K]  candidate_master_2024.parquet
│   ├── [885K]  committee_master_2024.parquet
│   ├── [ 21M]  contributions_from_committees_to_candidates_2020.parquet
│   ├── [ 14M]  contributions_from_committees_to_candidates_2024.parquet
│   ├── [118K]  house_senate_2024.parquet
│   ├── [ 36M]  operating_expenditures_2024.parquet
│   ├── [449K]  pac_summary_2024.parquet
│   └── [281M]  transactions_between_committees_2024.parquet
└── [450M]  raw
    └── [450M]  bulk-downloads
        ├── [ 28M]  2020
        │   ├── [ 28M]  pas220.zip
        │   └── [179K]  weball20.zip
        └── [422M]  2024
            ├── [ 91K]  ccl24.zip
            ├── [855K]  cm24.zip
            ├── [343K]  cn24.zip
            ├── [ 45M]  oppexp24.zip
            ├── [356M]  oth24.zip
            ├── [ 19M]  pas224.zip
            ├── [169K]  weball24.zip
            ├── [448K]  webk24.zip
            └── [119K]  webl24.zip