Skip to content

abhisheksunny/Data-Warehouse-using-Redshift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Background Information

The music streaming startup, Sparkify, has grown their user base and song database and currently have their data in S3, in a directory of containing files of JSON logs of user activity on the app, and another directory with JSON metadata about the songs on their app.

Purpose

Converting song and log data files to properly formatted and segregated tables in a Star Schema that would allow easy and instant query over the data for fetching meaningful results and insights. AWS Redshift is used for data ingestion and ETL transformation.


Running Instruction

Update dwh.cfg file with Redshift cluster details.
Execute python create_tables.py for creating staging, fact and dimension tables.
Execute python etl.py for loading data to staging tables and then to remaining Fact and Dimension Tables.


Files

  • Dashboard.ipynb - Dashboard for running Ad-Hoc or Analytical queries after data injestion.
  • create_tables.py - Python Script for dropping and creating tables.
  • dwh.cfg - Configuration file, containing connection details and data location.
  • etl.py - Python Script for ingesting data.
  • sql_queries.py - All the PostgreSQL queries used in the project at one place.

Data Ingestion

Song File - The data within these JSON files is loaded into staging_songs without any filteration or modification and then this staging table is used, to populate songs and artists tables.

Log File - Just as song data, Log File is loaded from JSON Files onto staging_events and from this staging table all the other tables(users, songplays, time) are populated.


Dependencies

  • psycopg2
  • sconfigparser

Additional Design Information

Fact Table

songplays - records in log data associated with song plays

  • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

  • user_id, first_name, last_name, gender, level

songs - songs in music database

  • song_id, title, artist_id, year, duration

artists - artists in music database

  • artist_id, name, location, lattitude, longitude

time - timestamps of records in songplays broken down into specific units

  • start_time, hour, day, week, month, year, weekday

Stagging Tables

staging_events - Log data files

  • event_id, artist, auth, first_name, gender, item_session, last_name, length, level, location, method, page, registration, session_id, song, status, ts, user_agent, user_id

staging_songs - Song metadata data files

  • num_songs, artist_id, artist_latitude, artist_longitude, artist_location, artist_name, song_id, title, duration, year

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published