Pet Projects in Data Science & Analytics

Interpreting impact of words in reviews on the ratings

Rotten tomatoes

 

In this assignment, I'll be analyzing movie reviews in an attempt to determine whether movies are good or bad. I've downloaded a large number of movie reviews from the Rotten Tomatoes website. Also fetched is a file "movies.dat" that contains metadata for ~65,000 different movies. Using Python and commonly used machine learning algorithms from sckit-learn package, The mission is to develop a classifier to determine whether a movie is Fresh or Rotten based on the contents of the reviews given to the movie.

 

The second part of this project is to understand the correlation of predictor variables in the Boston data set used below and apply dimensionality reduction using PCA to enhance performance of prediction.

 

The output of the same can be viewed here

Supervised and unsupervised machine learning on Boston dataset

Boston

 

For this project, I used data from the UCI Archives, which collected the dataset from the StatLib library which is maintained at Carnegie Mellon University.

 

The goal of this analysis is to actually implement some of the basic machine learning algorithms to measure the impact of several predictor variables on the mean price of houses in Boston.

 

Using Python for machine learning, I evaluated the impact of 13 factors on the mean value of a house at Boston and the output of the same can be viewed here

Scraping course-catalog data from IIT Kharagpur's website

IIT-Kharagpur

 

As a part of this project, I performed web scraping on the course-catalog data collected from the Computer Science and Engineering department of Indian Institute of Technology - Kharagpur.

 

This was achieved in the iPython notebook environment using the Beautiful Soup web scraper and the resultant data-frame was pushed to a resultant CSV file, which was then hosted on AWS before being converetd into a MySQL database on an EC2 instance of AWS.

 

The code can be viewed - here. Also, I have included a visualization right there which will help understand how the frequency of courses falling within number of prerequisites varies.

Evaluating the impact of Progresa

Progresa

 

For this project, I used data from the Progresa program, a government social assistance program in Mexico. This program, as well as the details of its impact, are described in the paper "School subsidies for the poor: evaluating the Mexican Progresa poverty program", by Paul Shultz. The goal of this analysis is to implement some of the basic econometric techniques to measure the impact of Progresa on secondary school enrollment rates.

 

The timeline of the program was:

 

Using Python for data analysis, I evaluated the impact of the program on socio-economic outcomes of individuals in Mexico and the output of the same can be viewed here

Sortable barchart visualization

Sorting

 

For this assignment, I created an EC2 instance of a sortable bar chart visualization using the D3.js library of JavaScript and hosted it on an AWS server.

Statistical and exploratory data analysis from US states 2010 census data

US Map

 

United States - the world's super power with an open source of data of the 50 states. We go back in time and observe few visualizations based on data collected in the 1970s. One can read the documentation of the 'state' data set here

 

I performed exploratory data analysis on this data using R Studio and the output of the same can be viewed here

San Francisco Crime Classification

Criminal

 

As my attempt to predict the category of crimes that occurred in the city by the bay, the dataset retrieved from Kaggle is a collection of all crimes committed in San Francisco during the period of 6 June, 2003 to 13 May, 2015.

 

I performed exploratory data analysis on this data using R Studio and the output of the same can be viewed here

Data analysis of flights from NYC in 2013

New York airport

 

New York - one of the busiest airports in the world has several incoming and outgoing flights all year round. R has a built-in package 'nycflights13' which has all information about flights from NYC in 2013. One can read the documentation of the 'nycflights13' package here

 

I performed exploratory data analysis on this data using R Studio and the output of the same can be viewed here

Understanding top sorting algorithms

Java AWT Sorting

 

For this assignment, I utilized the AWT and Swing toolkits of Java to graphically simulate the five widely used sorting techniques, namely - Quick Sort, Merge Sort, Insertion Sort, Bubble Sort and Selection Sort.