Exploratory Data Analysis: Prosper Loan Dataset

by Tyler Julian Last Updated: 4/05/2018

========================================================

Abstract

In the finance, everyone is looking at getting the best deal. Best interest rate, best return, low risk and high reward… as both borrowers and lenders, we want it all. This primarily takes form in the interest rate of a loan. Unfortunately, the interest rate that borrowers want and what lenders want do not always line up. Even worse is trying to understand the factors that determine the interest rate to begin with. For borrowers, credit score is considered the most popular variable, but what other factors impact the loans? The amount of the loan? Where you live? How much income you have? This project hopes to explore some of these trends in a real-life dataset and reveal the biggest factors that influence a borrower’s interest rate.

Data Source

This dataset contains loan data from Prosper, an online marketplace that brings investors and borrowers together to fund small to medium sized loans. Borrowers can get loans up to $35,000, with the interest rates set by Prosper. Lenders get to choose what loans they want to fund, with higher risk loans providing a higher return.

The dataset can be downloaded from this project’s GitHub repository.

Initial Exploration

A logical start to understanding this dataset would be to look at its shape and variables.

## [1] 113937     81

The dataset contains 113,937 observations from 81 different variables.

Let’s look at a sample of those variables:

##  [1] "ListingKey"          "ListingNumber"       "ListingCreationDate"
##  [4] "CreditGrade"         "Term"                "LoanStatus"         
##  [7] "ClosedDate"          "BorrowerAPR"         "BorrowerRate"       
## [10] "LenderYield"

There are a multitude variables in this dataset, some of whose meaning may not be intuitive at first glance. This reference, adapted from the Prosper API, provides context to each of the variables.

Univariate Exploration

Borrower Interest Rate

One of most important variables for both a borrower and a lender is the interest rate of the loan. The interest rate is ultimately set by Prosper and is determined by a variety of variables. This is the variable of most interest in this exploration.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The interest rate curve seems to be slightly right skewed, with most values falling between 14% and 25%. There is an abnormal spike of interest rates occuring at 32%. This could be because of overlapping binwidths. Let’s graph interest rate again, but with smaller binwidths to investigate:

Unfortunately, it seems binwidth is not the answer. 32% still has an extremely high appearance count. At the moment, there appears to be no answer as to why this rate occurs so frequently. One option would be to explore a subset of the dataset that only had 32% interest rate users and compare them with the entire population for differences. However, that is beyond the scope of this analysis, as there are other variables to explore.

Average Credit Score

As one of the most well known indicators of an individual’s financial credibility, Credit Score is the most basic requirement for getting a loan.

This dataset does not provide a singular credit score, but rather a range. The two variables, CreditScoreRangeLower and CreditScoreRangeUpper, store the upper and lower bounds of a user’s credit score. To visualize the data properly, the two ranges need to be combined into an average. The newly created variable will be named CreditScoreAverage.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0   670.0   690.0   695.6   730.0   890.0     591

With a mean of 695.6 and a median of 690, the central tendency of the distribution seems to be consistent. There are 591 NA’s in the data and the 1st and 3rd quartiles are within expectations. However, the minimum value seems way off the charts in terms of its distance from the rest of the data. The value itself also doesn’t make sense, as the lowest a FICO credit score can be is 400. Let’s plot the data to get a better understanding of variable’s distribution.

There seems to be a handful of credit scores close to zero. Because these scores are abysmally low and are no where near the rest of the distribution, they are marked as outliers. With this new knowledge, the data is plotted again with smaller binwidths, better breaks, and without the outliers.

After those adjustments, the data now has a semi-bell curve. From 680 to 900, the shape of the plot is almost perfect in terms of being normal. From 440 to 680, it has more deviation, but still follows a normal shape.

This plot follows the initial intuition about credit score. It makes sense that most people fall right in the middle of the ranges, with a majority of the scores floating around the average score of 700. You have a smaller subset of people that maybe haven’t been up to date on payments or have defaulted, leading to very poor scores. There is also another subset of people who are maybe very diligent about their payments, resulting in high scores.

Income

Many lenders also look at a borrower’s income when determining the risk associated with a borrower. The intuition is that a borrower with more disposable income should be more likely to pay their monthly payments on a loan. This makes them a less risky investment.

Income for borrowers is stored in the variable IncomeRange:

##  Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...

str(IncomeRange) shows that income is factored into levels:

## [1] "$0"             "$1-24,999"      "$100,000+"      "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed"  "Not employed"

Two things to note:

The levels are not ordered.
Some of the levels can be combined.

For this analysis, the values need to be be ordered from least to greatest. This will provide a much more organized plot and move the data type from nominal to ordinal. The levels Not employed and $0, while still technically having different meanings, both represent that the buyer does not have a primary career that provides income. As a result, these two levels be combined to reduce the complexity of the variable.

## [1] "Not displayed"  "$0"             "$1-24,999"      "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "$100,000+"

Now that the levels have been properly transformed, the non-NA data can be visualized as a barchart:

## 
##  Not displayed             $0      $1-24,999 $25,000-49,999 $50,000-74,999 
##           7741           1427           7274          32192          31050 
## $75,000-99,999      $100,000+ 
##          16916          17337

The chart shows that most borrowers have income that falls within two bins: $25,000-49,999 and $50,000-74,999. There are more borrowers that fall above these ranges than below them. This might suggest that borrowers with higher reported incomes are more likely to take out a loan than those with lower reported incomes.

Stated Monthly Income

Monthly income is very similar to income range. The biggest and most important difference is that this variable is quantitative rather than qualitative.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

There is a very large outlier in dataset with one individual claiming to bring in almost 2 million dollars every month. The top 1% of the data will be removed so the distribution can be more clearly viewed.

The distribution is skewed right with most individuals making about $4,750, or about $57,000 a year. This follows very closely to with the results from the IncomeRange variable. A majority of borrowers make less than $7,500 a month, while there are some borrowers that make much more.

Debt to Income Ratio

Debt to income ratio is exactly what it sounds like: a borrower’s debt divided by their current income at the time the credit profile was pulled. A higher debt to income ratio might sugggest that an individual is a riskier investment, since they are already burdened with more debt compared to their income.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

A quick look at the summary of the variable shows that the median ratio is 0.22, while the mean is a bit higher at 0.276. The maximum is incredibly high, showing a debt to income ratio of over 1000%. A quick look at the variable reference page shows that any ratios higher than 10.01 are rounded to that number.

In order to deal with these potential outliers, the top 1% of values are excluded from the resulting plot:

The distribution of debt to income seems to be slightly right skewed with a long tail. There are small peaks at all of the 0.5 breaks, which might suggest that the data was rounded upon retrieval. The right skew would explain the higher mean than the median. It makes intuitive sense that the majority of individuals have a low debt to income ratio on average, with the count decreasing as the debt to income ratio climbs higher. It becomes increasingly difficult for an individual to pay their expenses as the debt to income ratio increases due to interest accrued on the debt.

Term Length

Term length is the length of time the loan has until it has to be paid off. A longer term length means the borrowers have to pay less per month, but perhaps more interest over the life of the loan.

## [1] 36 60 12

Looking at only the unique values of Term, the only terms offered by Prosper are 12, 36, and 60 term lengths. The API shows that this variable is measured in months.

The Term variable currently has no levels, but can be easily factored since there are so few unique values.

## 
##    12    36    60 
##  1614 87778 24545

It appears that the vast majority of loans have a term length of 36, accounting for 77% of all loans. Term lengths of 60 make up most of the remaining loans, with 12 month terms only accounting for 1.4% of all loans.

Estimated Return

For lenders, how much return they receive from their investment is a very important factor when choosing which loans to fund. Prosper has a variable for this value, EstimatedReturn, displayed as a percentage. This variable is calculated by taking the difference from a loan’s EstimatedEffectiveReturn and EstimatedLoss. Both of these variables can be found in detail from the variable reference page.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084

A summary shows that the average expected return is around 9.5%. This is slightly higher than 6.7% average total return listed on their website. There are also a lot of NAs for this particular variable, probably attributable to the fact that this variable wasn’t being tracked until the start of July 2009.

The distribution appears to be approximately normally distributed. However, some of the values seem to have very high counts, which makes it much harder to see the data in the tails. By scaling the y-axis by square roots, the tails should be easier to see.

Now the tails are much easier to see. The distribution does indeed remain approximately normally distributed, with the peak right around the mean and median. Most of the estimated returns are positive.

## [1] 0.9982798

In fact, over 99% of all loans through Prosper have a positive estimated return. From a lender standpoint, this is excellent news. However, I doubt that these estimations live up to reality. This variable is more theoretical and makes assumptions about the actions of the borrower. It assumes that the borrower makes all of their payments and does not default on the loan, which I don’t think the data will support. Further investigation into the result of loans is required.

Borrower State

BorrowerState is exactly what it sounds like: the state the borrower currently resides in when acquiring the loan.

##  Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...

The structure of the variable shows that each state is treated as its own level.

Quickly plotting the variable shows that the levels need to be reordered if they are going to be visualised in a bar chart. This data will be reordered from highest count to lowest count.

California appears to have much more borrowers than any other state. This is understandable sense Prosper was founded in San Francisco California. The states following California–Texas, New York, Florida–are also metro hubs with large tech industries. Individuals who are more technologically inclined might be more willing to take out a loan online.

Credit Grade and Prosper Rating

In an effort to make investments easier to assess for potential investors, Prosper created the “CreditGrade” scoring system. This was eventually replaced with the “Prosper Rating” system in 2009, which is very similiar to CreditGrade. In order to analyze these variables throughout all of the data, the two rating systems need to be combined into one.

## [1] ""   "A"  "AA" "B"  "C"  "D"  "E"  "HR" "NC"

## [1] ""   "A"  "AA" "B"  "C"  "D"  "E"  "HR"

Looking at the levels of the two variables, Credit Grade has an extra level called NC. The rest of both levels are the same. It is also important to note that the variables have no NA values. Rather, they are input as “”.

In order to manipulate these variables effectively, the “” values need to be substituted with NA. Once the “” values transformed into NA’s, ProsperRatings and CreditGrade can be combined into a single column and ordered from the lowest ratings to the highest rating.

##    NC    HR     E     D     C     B     A    AA 
##   141 10443 13084 19427 23994 19970 17866  8881

The plot for the combined rating systems turns out very well. The data appears normally distributed with the C rating having the most values. The highest rating, AA, has the least amount of values.

Prosper Score

A borrower’s Prosper Score is an internal measurement used by Prosper to measure the risk of a borrower. The higher the score, the lower the risk of the borrower to lenders.

##  num [1:113937] NA 7 NA 9 4 10 2 4 9 11 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    4.00    6.00    5.95    8.00   11.00   29084

It appears that there are quite a few “NA’s” in this variable. There is also some values above ten. The Prosper API doesn’t provide any insight into the values; on the contrary, values should only go up to ten. As a result, NA’s and values greater than 10 will be left out of the resulting plot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   5.862   8.000  10.000

## 
##     1     2     3     4     5     6     7     8     9    10 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750

The ProsperScore distribution appears normally distributed. This distribution appears similar to the combined rating system that was explored previously. Most values fall in the middle, between 4 and 8, with the median value at 6. The tails on either end of the distribution have much lower values.

Loan Status

Each loan has a loan status associated with it. This determines whether the current loan has been completed, whether it is late on its payments, or whether it has been defaulted on.

##  [1] "Cancelled"              "Chargedoff"            
##  [3] "Completed"              "Current"               
##  [5] "Defaulted"              "FinalPaymentInProgress"
##  [7] "Past Due (>120 days)"   "Past Due (1-15 days)"  
##  [9] "Past Due (16-30 days)"  "Past Due (31-60 days)" 
## [11] "Past Due (61-90 days)"  "Past Due (91-120 days)"

Currently, there is a lot of detail in the levels. However, the granularity of the data can be reduced by combining some of the levels. More specifically, some of the Past Due bins can be combined. FinalPaymentInProgress loans can also be lumped with Current loans.

Once the categorical bins have been combined, the levels can also be ordered from the most favorable status to the least favorable status.

## 
##           Completed             Current PastDue (1-30 days) 
##               38074               56781                1071 
##  PastDue (>30 days)           Defaulted          Chargedoff 
##                 996                5018               11992 
##           Cancelled 
##                   5

A vast majority of the loans are either completed or active with up-to-date payments. However, there is also a large minority of loans that either go into default or are chargedoff. These negative results account for approximately 15% of all loans from Prosper. Recalling the results from the EstimatedReturn distribution, it is clear that not all investments result in a positive return. Lenders that invested in loans that were either defaulted on or chargedoff probably made a net loss.

Loan Amount

Loan amount is the total amount of cash given to the borrower once the loan is approved. Loans start at $1,000 and go up to a maximum of $35,000.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The distribution is right skewed with multi-modal peaks. These peaks are all at the $5,000 increments, which is probably attributable to the fact most people get loans at even amounts. 75% of the loans are $12,000 or less, with only a handful of loans at the maximum amount of $35,000.

Employment Status

A common metric requested from lenders is the borrower’s current employment status. Employment status gives the lender a gauge on the potential cashflow the borrower currently has. More cashflow usually represents a safer loan.

## [1] ""              "Employed"      "Full-time"     "Not available"
## [5] "Not employed"  "Other"         "Part-time"     "Retired"      
## [9] "Self-employed"

Currently, the levels are not ordered and are listed nominally. Not available and "" basically represent the same idea and can be combined.

To better analyse the variable, the levels will be ppoperly ordered and the two levels mentioned earlier will be removed from the plot.

It appears that most borrowers are employed in some manner. Very few report being unemployed, retired, or part-time. I imagine most borrowers just report being “Employed” and leave out the details of their type of employment.

Employment Duration

Along with employment status, lenders also consider the borrower’s employment duration. Longer durations usually signify more stable income, and thus a safer investment.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

The summary of the variable, as well as the Prosper API, confirm that EmploymentStatusDuration is measured in months. The variable will be viewed in years for the following plots, as years are much easier to visualize and understand.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.167   5.583   8.006  11.420  62.920    7625

The initial plot shows that the data is right skewed and unimodal. There is a long tail with some values way beyond the 3rd quartile. 50% of borrowers have employment durations between two and five 1/2 years. Approximately 80% of borrwers have at least one year of employment duration.

Summary of Univariate Analysis

The dataset consists of 83 variables with 113,937 observations.

The variables of interest in this analysis are:

Interest Rate
Average Credit Score
Income range
Monthly Income
Debt/Income Ratio
Term Length
Estimated Return
Borrower State
Borrower Rating
Prosper Score
Loan Status
Loan Amount
Employment Status
Employment Duration

Some of the more important observations:

The interest rate distribution was slightly right skewed, with the average rate being ~19%.
The average credit score was 690.
The loan amount distribution was multimodal with most loans being less than or equal to $15,000.
Most loans had terms of 32 months.
Borrower ratings and Prosper Score were normally distributed, with most loans having moderate risk.
The average borrower was employed for 5.5 years and made between $50,000 and $75,000.

The main feature of interest in this dataset is the borrower’s interest rate. The goal is to make a model that can predict the interest rate of Prosper loan, given some list of variables.

There are several variables that could potentially predict the interest rate a borrower has. Obviously Credit Score will be used, as it is one of the more well known predictors for interest rate. Income, Loan Amount, Employment Duration/Status, and Debt/Income Ratio all might be correlated with interest rate as well, and will definitely appear in the bivariate analysis. Rating and Prosper Score are other interesting variables to explore, especially with some of the other supporting variables.

In this analysis, there were several variables that were adapted so they could be properly used. Income, Term Length, Loan Status, Borrower State, and Employment Status all had their levels added, combined, or reordered. One new variable, Rating, was recreated entirely using the two other rating systems implemented by Prosper. This rating give a score to each borrower in the dataset that determines their riskiness and return for lenders.

From the univariate exploration, there were a few unusual observations. The interest rate distribution had an unusually high occurence count at 32%. This could be due to rounding, or is a specific point that Prosper’s interest rate algorithms arrive at. There were also some values outside of the range of credit scores. This could be due to those borrowers having no credit.

Bivariate Exploration

Interest Rate vs. Credit Score

A borrower’s credit score is widely known as being the biggest factor in determining the interest rate of a loan. Considering that, it will be the first variable paired with interest rate.

The initial plot shows that there may be a correlation between interest rate and credit score. There appears to be a slight downward trend in th data. As credit score increases, the interest rate seems to decrease. This is the relationship that was expected.

## [1] -0.4615667

A short correlation test supports this theory. The test shows that there is a moderate negative correlation between credit score and interest rate.

## 
## Call:
## lm(formula = BorrowerRate ~ CreditScoreAverage, data = loans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49873 -0.05051 -0.01165  0.04585  0.21868 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.539e-01  2.070e-03   267.5   <2e-16 ***
## CreditScoreAverage -5.190e-04  2.963e-06  -175.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0663 on 113344 degrees of freedom
##   (591 observations deleted due to missingness)
## Multiple R-squared:  0.213,  Adjusted R-squared:  0.213 
## F-statistic: 3.068e+04 on 1 and 113344 DF,  p-value: < 2.2e-16

Fitting this distribution to a linear model shows that there is indeed some sort of relationship. While there is a correlation between credit score and interest rate, it only explains 21.3% of the variance in interest rate according to the R^2 score.

With this first variable, the underlaying factors that may predict the interest rate of a loan are coming to the surface. However, credit score was an obvious first choice. What other variables in tandem with credit score might provide us with a stronger predictive model?

Interest Rate vs. Loan Amount

One intuition might be that the size of the loan might impact the interest rate. As the size of the loan goes up, lenders might require that the borrowers are more trust worthy and less risky compared to a small sized loan.

Overall, there appears to be a slight downward trend in the data. At the amount of the loan increases, the interest rate decreases. The dark bands represent the multi-model peaks loan amount histogram, where borrowers request loans at similiar, even amounts.

## [1] -0.3289599

## 
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount, data = loans)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.221676 -0.053914 -0.001008  0.055049  0.283705 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.256e-01  3.491e-04   646.3   <2e-16 ***
## LoanOriginalAmount -3.941e-06  3.351e-08  -117.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07065 on 113935 degrees of freedom
## Multiple R-squared:  0.1082, Adjusted R-squared:  0.1082 
## F-statistic: 1.383e+04 on 1 and 113935 DF,  p-value: < 2.2e-16

The scatterplot and the correlation test both support the relationship between loans amount and interest rate. The small negative correlation means that as the size of the loan increases, the interest rate on average decreases a little as well.

Unfortunately, casting a linear model on this variable does not yield the best results. The relationship is not strong enough, and thus only accounts for 7% of the variation in interest rates.

Interest Rate vs. Rating

The rating of a loan is the snapshot of a borrowers riskiness at Prosper. I suspect that there might be a relationship between the rating of a loan and its interest rate, considering that the rating system is used to determine the risk and return of a loan.

## 
## Call:
## lm(formula = BorrowerRate ~ ratings, data = loans)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.296560 -0.016560 -0.000755  0.020135  0.237194 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.222360   0.002951  75.350  < 2e-16 ***
## ratingsHR    0.074200   0.002971  24.976  < 2e-16 ***
## ratingsE     0.061532   0.002967  20.740  < 2e-16 ***
## ratingsD     0.014906   0.002962   5.033 4.84e-07 ***
## ratingsC    -0.031404   0.002960 -10.611  < 2e-16 ***
## ratingsB    -0.068102   0.002961 -22.996  < 2e-16 ***
## ratingsA    -0.107167   0.002963 -36.173  < 2e-16 ***
## ratingsAA   -0.135754   0.002974 -45.642  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03504 on 113798 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.7806, Adjusted R-squared:  0.7805 
## F-statistic: 5.783e+04 on 7 and 113798 DF,  p-value: < 2.2e-16

There is clearly a strong relationship between interest rate and ratings; in fact it is the strongest relationship we have seen so far. The difference between the median of the lowest rating and the highest rating is almost 25 percentage points! Ratings account for 78% of the variance in interest rate according to the R^2 value. It could be that rating is tied to other variables that are correlated with interest rate, such as credict score or the amount of the loan. Regardless, this is definitely a variable that could be used to predict the interest rate of a loan.

Interest Rate vs. Prosper Score

Similar to Rating, _Prosper_ score is another risk measurement of a borrower. The major difference is that it is used internally. I expect that there will be some sort of relationship between it and interest rate, just like with Rating.

## [1] -0.6497361

## 
## Call:
## lm(formula = BorrowerRate ~ ProsperScore, data = loans)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.187124 -0.041790 -0.009099  0.036780  0.236614 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.174e-01  5.251e-04   604.5   <2e-16 ***
## ProsperScore -2.040e-02  8.195e-05  -249.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05673 on 84851 degrees of freedom
##   (29084 observations deleted due to missingness)
## Multiple R-squared:  0.4222, Adjusted R-squared:  0.4222 
## F-statistic: 6.199e+04 on 1 and 84851 DF,  p-value: < 2.2e-16

As predicted, there is a moderate negative correlation between interest rate and Prosper score. As the score goes up (or as risk goes down), interest rates tend to decrease as well. Scores near the middle appear to have the most variance in interest rate, with some individuals having rates at 10% on the low side and 35% on the high side. This variable also had a high R^2 value at 42%, making this a good feature for the linear model.

Interest Rate vs. Income

It would make sense that individuals who have more cash flow are rewarded with lower interest rates. The intuition is higher income results in a better ability to pay off their loan. This means a less risky investment for the lender and lower interst rates for the borrower.

## [1] -0.0889818

Based off both the boxplot and the scatter plot, there does appear to be a relationship between income and interest rate, but it is not very strong. Borrowers with more income do have slightly lower interest rates on average, with each range having a lower median interest rate than the previous. Stated monthly income follows a similar pattern, but further shows how weak the correlation is.

These findings are surprising. I expected income to be correlated much more highly with interest rate. It is possible that these variables aren’t very reliable in determining a borrower’s trustworthiness. Income can fluctuate greatly from month to month, and sometimes isn’t verifiable. These factors together may make income less a determinant for interest rate and more as a secondary feature to describe the borrower.

Interest Rate vs. Borrower State

Where someone lives in the United States can usually have a profound impact on someones financials, due to cost of living, taxes, etc. It would make sense that interest rates also fluctuate in the same pattern.

This inital plot, sorted by the median interest rate, displays quite the contrary. Interest rates do not seem to differ much between states on average. The horizontal red line is the median interest rate across all states. Overall, there are slight deviations between between states, but they all still lie close to the overall median interest rate. The range between the highest and lowest medians is about 6%. A few of the states that deviate the most are Maine and Iowa on the low side and Alabama and North Dakota on the high side.

Interest Rate vs. Employment

Employment history and duration are other important metrics that might have a relationship with interest rate. Borrowers who are currently employed and who have a longer employment duration may be considered to be a safer investment.

For the most part, employment status does not seem to be related to interest rate. There is not much variance between the different employment statuses with the exception of being not employed. Reporting as “Not employed” seems to be the only response that may be related to a higher interest rate.

## [1] -0.01990744

Unfortunately, there also does not seem to be any correlation between employment duration and interest rate either. The scatter plot shows no discernable trend or pattern. The correlation test returns approximately zero, representing no relationship.

It is surprising to find that employment status and duration don’t have much of a relationship with interest rate. The only red flags are when someone reports being unemployed. The difference between working for 6 months and working for 30 years doesn’t seem to have much of an impact on the interest rate a borrower would receive for a loan.

Summary Bivariate Analysis

The bivariate section definitely shed some light into the relationships between interest rate and the other variables in the dataset.

Interest rate appears to be negatively correlated with several variables in the dataset. There were some variables that had no relationships with interest rate at all. Employment duration and employment status both had no correlation with interest rate. Income had a slight negative correlation, but nothing that might suggest that they are closely related. Surprisingly, interest rates did not fluctate much between different states, even though cost of living is different amongst each of them.

The strongest relationship was between interest rate and the borrower’s rating, Prosper Score, and credit score. All of these variables had moderate to strong negative correlations with interest rate. Together, these three variables might provide the foundations of a linear model for predicting interest rate.

Multivariate Exploration

Interest Rate by Rating Density Plot

If borrower ratings and interest rate are truly related, there should be a clear segmentation between the amount of borrowers and their interest rate at each rating.

This density plot visualization really shows how each rating corresponds with a specific range of interest rates. The highest rating “AA” peaks at the low end of the interest rates. The rating under that, “A”, peaks at a little higher interest rate. As you move down the tier list, the interest rates peak at higher and higher interest rates, with the high risk (“HR”) accounts peaking at the notorious 32% mark.

This clear delineation of the rating further supports the strong relationship between interest rates and ratings.

Loan Amount vs. Interest Rate by Rating

Loan amount and interest rate had a slight negative correlation in the bivariate section. Let’s explore this variable further, but with ratings in the mix as well.

This plot still shows the small correlation between rating and loan amount from the bivariate section. Overall, there is a rightward shift in the distributions of loan amounts as the rating increases. Most of the high risk loans are for smaller loans less than $5,000. The loan amounts above $25,000 contain almost no high risk accounts and only “B” or higher ratings. The downard trend in the interest rate a the rating increases further demonstrates the strong relationship between ratings and interest rate seen in the previous plot.

Average Credit Score vs. Interest Rate by Rating

Of all of the variables, credit score, rating, and interest rate are the variables that might be the most closely related.

I believe that this plot perfectly captures the relationship between interest rate, credit score, and ratings. There is a clear separation between the ratings are the credit score increases. The lower, less risk accounts with high credit scores get the best ratings by Propser, and thus the lowest interest rates. As the credit score decreases, the average interest rate increases and the ratings begin to change. There is a nice progression of the ratings as both the interest rate increases and the credit score decreases.

Something very peculiar is the band of “HR” borrowers that persist across the top of the distribution as the credit score increases. They all appear to be around, you guessed it, 32%. There must be a specfic scenario where a borrower has a decent credit score, but is still classified as a high risk investment. This phenomenon may be the subject of further study in the future.

With the major relationships explored, a linear model model can be built for predicting interest rate.

## 
## Calls:
## m1: lm(formula = BorrowerRate ~ ratings, data = loans)
## m2: lm(formula = BorrowerRate ~ ratings + ProsperScore, data = loans)
## m3: lm(formula = BorrowerRate ~ ratings + ProsperScore + CreditScoreAverage, 
##     data = loans)
## 
## ======================================================================
##                             m1              m2              m3        
## ----------------------------------------------------------------------
##   (Intercept)               0.222***        0.312***        0.282***  
##                            (0.003)         (0.000)         (0.001)    
##   ratings: HR/NC            0.074***                                  
##                            (0.003)                                    
##   ratings: E/NC             0.062***       -0.024***       -0.024***  
##                            (0.003)         (0.000)         (0.000)    
##   ratings: D/NC             0.015***       -0.073***       -0.073***  
##                            (0.003)         (0.000)         (0.000)    
##   ratings: C/NC            -0.031***       -0.126***       -0.126***  
##                            (0.003)         (0.000)         (0.000)    
##   ratings: B/NC            -0.068***       -0.167***       -0.169***  
##                            (0.003)         (0.000)         (0.000)    
##   ratings: A/NC            -0.107***       -0.211***       -0.213***  
##                            (0.003)         (0.000)         (0.000)    
##   ratings: A/NCA           -0.136***       -0.247***       -0.251***  
##                            (0.003)         (0.000)         (0.001)    
##   ProsperScore                              0.001***        0.002***  
##                                            (0.000)         (0.000)    
##   CreditScoreAverage                                        0.000***  
##                                                            (0.000)    
## ----------------------------------------------------------------------
##   R-squared                 0.781           0.915           0.915     
##   adj. R-squared            0.781           0.915           0.915     
##   sigma                     0.035           0.022           0.022     
##   F                     57826.500      130230.538      114629.424     
##   p                         0.000           0.000           0.000     
##   Log-likelihood       219909.979      204322.264      204552.934     
##   Deviance                139.732          40.241          40.023     
##   AIC                 -439801.958     -408626.528     -409085.868     
##   BIC                 -439715.177     -408542.390     -408992.381     
##   N                    113806           84853           84853         
## ======================================================================

Ultimately, the model can account for 91.5% of the variance in the interest rate of a loan. This is a solid start for the model, considering it only uses 3 features.

Summary of Multivariate Analysis

The multivariate analysis further supported the relationships between interest rate and the other variables in the dataset. Rating continued to show a strong connection with interest rates across both loan amounts and credit score. The density plot of ratings by interest rate further supported the clear segmentation of ratings by interest rate.

The linear model that was built to predict interest rate can account for 91.5% of the varince in interest rate. It was built using the Ratings, Prosper Score, and Credit Score variables.

Conclusion / Final Plots

Relationship between Interest Rate and Rating

Ratings are clearly segmented by interest rates. Low risk borrowers have higher occurences at low interest rates. As the risk level of the rating increases, borrower frequency peaks at higher and higher interest rates.

Interest Rates Across US States

For the most part, median interest rates remain consistent across the various states in the US. The highest deviation from the median is at most 3%, with most states deviating by less than 1%. States with the highest deviations also have the lowest samples on average amongst the states.

The Power of Credit Score and Rating on Interest Rates

As credit score increases, interest rates fall. The bands of diverging colors show the change in rating as you get higher credit scores. The safest investments appear at the highest credit scores with low interest rates and the best ratings. The high risk investments usually have low credit scores, high interest rates, and low ratings.

Reflection

Overall, this was a very interesting and enlightening dataset to explore. It was really surprising to me how impactful credit score was on interest rate, and how negligible income, locale, and employment were. Perhaps some of those other variables are used to determing whether a borrower can get a loan to begin with rather than what there interest rate will be.

The hardest part of this project was sifting through the vast amount of variables. There are so many potential combinations of categorical and continuous variables that I might have missed some interesting trends. I sometimes found myself at a dead end in the later bi/multivariate sections, wishing that I could go back and add more variables without increasing the bulk of the project. I think in the future I will use more visualization tools to preview as much of the data as possible first before diving into analyzing multiple variables at once. That way, I can sift through the variables that have no trends and go straight to the more impactful relationships.

I definitely believe that more research can be done on this dataset. There are variables that I didn’t explore due to the length of the project and the depth of the dataset. I would love to one day revisit this project and perfect the model further by implementing more features. I could also take a deep dive in investing the the unusual high occurence of the 32% interest rate. In the end, I learned a lot about EDA in this project and now have a whole toolkit of methods at my disposal to explore other datasets.