This Capstone Project required me to work with the following technologies to manage an ETL process for a Loan Application dataset and a Credit Card dataset: Python (Pandas, advanced modules e.g., Matplotlib), MariaDB, Apache Spark (Spark Core, Spark SQL), and Python Visualization and Analytics libraries.
Workflow Diagram.
1. Load Credit Card Database
For this application I created a Python and PySpark SQL program to read/extract the data from existing JSON files and load it onto the SQL Database according to the specifications found in the mapping document.
2. Application Front-End
This application displays the data to the user which was loaded onto the database in the previous application. It also asks for user input based on which data is selected/manipulated.
3. Data Analysis and Visualization
In addition to quering the database and displaying/updating the data using spark dataframe, this application also visualized the data in the form of plots using Matplotlib.
Number of transactions per category from the highest to the lowest.
Number of Customers by State.
4. LOAN Application Dataset
This application sends an HTTP request to an API endpoint, receives JSON response and loads the data to the SQL database.
5. Data Analysis and Visualization for Loan Application
This application queries the data that was created in the previous application. It also visualizes the data in the forms of plots.
Percentage of rejection for married male applicants.