Cloud Cost Optimization Pipeline with PySpark, Dataproc & BigQueryÂ
Cloud Cost Optimization Pipeline with PySpark, Dataproc & BigQueryÂ
Key Highlights
Built using Apache Spark on Dataproc to handle large-scale cost data efficiently.
Automated using Cloud Composer (Airflow) for scheduled, reliable job orchestration.
Delivered processed outputs to Google BigQuery for seamless integration with BI tools like Looker Studio and Power BI.
Supports daily automated refresh, cost categorization, and monthly financial aggregation.
Designed and deployed a fully automated, scalable ETL pipeline on Google Cloud to transform and analyze enterprise-level cloud cost data (Apptio-style). The solution helps finance and infrastructure teams monitor cloud expenditure, identify cost optimization opportunities, and enable data-driven budgeting decisions.
transform_cost_data.py (PySpark ETL script)
apptio_etl_pipeline.py (Airflow DAG)
BigQuery table: apptio_dataset.monthly_cloud_costs
GCS output path: gs://dataproc-buck02/Apptio_Transformed/final_agg/
Provides centralized visibility into cloud spending across providers and services.
Enables real-time financial reporting with automated daily refresh.
Supports cost governance, forecasting, and waste reduction initiatives.
Streamlines collaboration between finance, DevOps, and infrastructure teams.