Calculating Alpha, Beta, and Correlation Coefficient (R2) of a custom Defi Token Index

12 min readJun 10, 2021

Hey everyone!

I wanted to share something I’ve been working on in my free time — both as something to help inform your decisions but also as a means to gather feedback and constructive criticism from those of you out there with a ton more experience, insight, and technical prowess than myself ( probably most of you out there :) ). A the title indicates, my little mini-project is essentially a comparison of the daily returns of a custom-made index of Defi-related tokens against a benchmark index of the general cryptocurrency landscape. I make this comparison through the lens of the CAPM model by calculating beta (measure/systematic risk), alpha (it’s “edge” over the market), and R-squared (relationship between portfolio and index).

This is for those interested in the recent growth of decentralized finance, but more generally, for anyone who’s curiosity extends into finance, crypto, data analysis, or computer science.

Any feedback and constructive criticism is totally welcome. Feel free to make recommendations or pick this thing apart if you see something that could be better or is just plain wrong. I’m an open book and would love to get better at this so I can keep coming back with better projects and better insights to share. Keep reading, and much love!

Summary:

I always feel like it’s a good idea to start with a general overview and then get into the most interesting parts of the nitty-gritty. Here is the basic process from start to finish, and the results:

Create a cap-weighted benchmark index to represent the crypto market as a whole, and a benchmark index for Defi Tokens. From here on I will refer to the Defi Token Index as ‘Defi’ and and the General Cryptocurrency index as ‘Benchmark’ for the sake of brevity.
Plot the daily returns against each other and do a linear regression to calculate the beta, alpha, and r-squared of the Defi benchmark (response variable) against the general crypto index (predictor)
Adjust for any outliers and look for any news-driven or rate-driven rationale with respect to these large deviations.
Make a guess as to why the data came out like it did, list any potential holes in my logic, and any shaky assumptions that the analyses is built on

Results:

I find that Defi has:

○ Beta of ~.878, alpha of ~.0066, and an r-squared of .301.

There were a couple large spikes in Defi’s daily returns, and removing those outliers gives us:

○ Beta of ~.822, alpha of ~.0042, and an r-squared of ~.332

Admittedly, I had to use some refreshers on these metrics I thought I’d share them here and here. Alpha is the excess of return of the investment relative to the return of the benchmark. Beta is the measure of volatility of defi in relation to the overall market (represented here by the benchmark). R-squared is the percentage of Defi’s movements that can be explained by movement’s in the benchmark.

Daily Returns Timeseries

Defi vs. Benchmark Daily Returns (same data)

I want to caveat by saying I had some doubts about the results of my analyses, for several reasons:

I always figured that the crypto market generally moves in tandem. When bitcoin and eth go up, the rest of the tokens/coins seem to go up as well. The R-squared of .3 does not support this notion as it implies a low correlation and implies that only 30%-33% of Defi’s movements can be explained by movements in the benchmark index.
The beta of ~.88 is a little more believable, but again (maybe my eyes fool me) it seems like movements in Defi’s returns stretch farther than corresponding movements in the benchmark’s returns.
These metrics are derived from the Capital Asset Pricing model, which is a very traditional lens to view the space from. It’s also predicated on the assumption that a) markets are very competitive and efficient and b) the markets are dominated by rational, risk-averse investors, who seek to maximize returns. I feel like I can refute at least one of the two assumptions with a simple “stonk” or “lambo” meme, but I’ll just leave that here for you to decide.
I tried to double-check my doubts above by switching the predictor and response variables (same r-squared), shifting the data set by 1–3 days to see if there was a systematic lag (didn’t seem so), removing a couple outliers per above (not much change), doing an RSS calculation (in line with the r-squared I got)

For further exploration, I get into the technical details/implementation below — seeing as I’m pretty new to this I imagine there is (a lot of) room for improvement/optimization. My code is probably horrendously ugly, or inelegant at best (I just learned Python 3 months ago [as of April 1, 2021]) so have at it! I want to get better.

Implementation:

Creating the Benchmarks

For the total crypto index, I looked at how other entities were creating indices, including the composition methodology, and the respective index weighting of different assets. These included the CMC Markets All Crypto Index, Bloomberg Galaxy Crypto Index, Solactive CMC200. I didn’t use a divisor like they did because I am just measuring relative change and not creating a palatable index product for investor consumption.
I decided to use choose my own asset makeup using the following: BTC, ETH, LTC, BCH, Monero, EOS, Dogecoin. I wanted to use currencies that have been in circulation for a while, consistenly in the top, and that captured a large percentage of the total market cap (currently standing around $1.99T at the time of this writing). Collectively, these assets make up about 70% of the total market, which I feel is a decent capture. I wanted to avoid coins that were a little more nuanced (e.g. exchange tokens like Binance Coin) or coins that don’t yet fulfill the stated purpose (e.g. Cardano which is a smart contract platform that isn’t live yet), both due to recent disppraportionate rises and risk associated with adoption or actual use that could potentially skew the metric. There’s also a bajillion other coins out there, and I don’t mean to offend anyone by leaving them out of this calculation — I’m not super familiar with them and would probably only make up a small relative composition to the benchmark, so I just went with what I know. But feel free to make your case in the comments.
I also decided to leave price out of the explicit calculation and just use daily market caps, as that would capture the asset’s weight as well as the price action. Here I was looking for the daily perentage change weighted by each asset’s position in the market. I tried to emulate having a position in a weighted index like the SPY, but obviously in the context of cryptocurrency. I used Coingecko’s free API to request each’s asset’s daily marketcap and put them in an Sqlite database. Then I summed up the marketcaps of each asset in the benchmark and grouped by date. CoinMarketCap’s numbers were pretty much in line with Coingecko’s, but their API isn’t free so there you go. With CoinGecko, you don’t need an API key and everything is publicily available and free. Thanks to Coingecko! Here is the short script I used:

import sqlite3
connection = sqlite3.connect('defibeta.db')
cursor = connection.cursor()
cursor.execute("""CREATE TABLE IF NOT EXISTS benchmark ASSELECT datetime(date, 'unixepoch') AS date, sum(marketcap) AS total_marketcap FROM market_capWHERE coin_id IN ('bitcoin', 'ethereum', 'litecoin','bitcoin-cash', 'monero', 'eos', 'dogecoin')GROUP BY date;)""")

4. For the Defi Index, I chose the following assets: Aave, Pancake Swap, Uniswap, Sushiswap, Yearn.finance, Compound, UMA, 0x, Bancor, and Mkr. These are all related to either DEXs, liquidity pool management, lending protocols, or some combination or derivative of these trends. I won’t get into all of them here, but I will touch on Mkr’s inclusion in this index due to it’s representation of voting power for DAI governance, which makes up a large part of the defi ecosystem. DAI is a stablecoin purely collateralized by crypto assets, as opposed to Tether or USDC which is in part issued by regulated financial institutions and backed by reserve assets. Although Mkr isn’t really a transacting token, in the spirit of Defi I thought I’d include it in the defi basket here. I used the same script above but querying Defi’s ‘coin_id’.

5. Both benchmarks’ data starts 10/6/2020 to present. As defi is such a new trend, I wanted to start at a date that would allow me enough data to create a benchmark and adjust for weird price movements that often happen when an asset is introduced. 10/5/2020 is basically the earliest date that I could gather data on Aave, and it’s followed closely by more than half the assets in the basket in how short-lived the asset’s existence is. This is a list of the earliest dates, per asset, that the data is available for: Aave (10/3/2020), Pancake Swap (9/29/2020), Uniswap (9/16/2020), Sushiswap (8/28/20), Yearn.finance (7/22/20), Compound (6/15/20), 0x (11/9/17), Bancor (6/26/20), Mkr (1/30/17). As you can see, most of the assets are pretty new.

Adjusting the Benchmarks

I did a slight adjustment to the overall benchmark makeup to give a little more weight to other assets outside of bitcoin. By simply summing the respective marketcaps of the benchmark index (before the adjustment), bitcoin would have make up about 80%, followed by ~17.5% ETH, and less than 5% of the remaining assets. To adjust the benchmark, I doubled the weighting of ETH and the other assets, and reduced BTC’s weight by about 25%. The adjusted weights (making up the benchmark I used for the analysis) are as follows: 59% BTC, 35% ETH, 6% for the remaining assets. I did this by creating a separate sqlite table from the table with the raw data:

One thing that I did not adjust for is free-float, or basically the proportion of assets that are actually liquid. Currently about 15% of ETH has been dormant for 2–3 years, and about 20% for BTC, although I get conflicting reports from different sources. I thought about this a lot and decided against doing that as it left a little more room for error than I cared for. I also did a quick on excel using the 15% and 20% numbers above and found that it wouldn’t actually change the weighting by that much and seeing as I already performed a custom adjustment, it didn’t seem like a fruitful exercise.

Calculating Alpha, Beta, R-Squared using Python

To calculate the metrics for the mini-project, I did the following:

Queried sqlite for the daily aggregate market caps of the respective indices (benchmark & defi) and put them in a Pandas dataframe.
From that dataframe I isolated the marketcap column from the date column did a .pct_change() method to create a numpy array of daily returns
Then I used a statsmodel module to do a ordinary least squares (OLS) linear regression on the two variables. I basically followed along with the script and logic here from the Quantinsti Blog without defining my own linreg function. I just took the pieces and put them together in the main script. From a mathematical standpoint, you are basically finding the line that minimizes the vertical distance from the actual data points to said line. It is depicted in the scatterplot as the red dotted line. The slope of your regression line would be the beta calculation and the y-intercept would be the alpha calculation.

Here is the script I used:

import sqlite3, numpy as np, pandas as pd, matplotlib.pyplot as plt
import statsmodels.api as sm from statsmodels
import regression from pandas
import DataFrame from datetime
import datetime from matplotlib import dates as mpl_dates
connection = sqlite3.connect('defibeta.db')
adj_benchmark_query = pd.read_sql_query("""SELECT * FROM benchmark_adjusted;""", connection)
defi_benchmark_query = pd.read_sql_query("""SELECT * FROM defi_benchmark;""", connection)
# create series of benchmark and defi returns
adj_benchmark_df = pd.DataFrame(adj_benchmark_query)
defi_df = pd.DataFrame(defi_benchmark_query)
returns_adj_benchmark = adj_benchmark_df['total_marketcap'].pct_change()[1:]
returns_defi = defi_df['total_marketcap'].pct_change()[1:]
# linear regression set up
x = np.array(returns_adj_benchmark)
y = np.array(returns_defi)x = sm.add_constant(x)
model = sm.OLS(y,x)results = model.fit()
alpha = results.params[0]beta = results.params[1]
r2 = results.rsquared
print(f'alpha: {alpha}' )print(f'beta: {beta}')print(f'r squared: {r2}')

Plotting the data

I used matplotlib here to do an overlaid timeseries plot of the returns of both defi and the benchmark, and then a scatterplot to help visualize the regression line in relation to the data points of one return versus the other for each particular day.
I really struggled with manipulating tick frequency on my plots so I did some ghetto hacks to create date labels and in the interest of time and sanity, I didn’t even bother to try and adjust the frequency automatically shown on the date plots after a couple hours at it. I tried to do additional ticks for the 15th of each month, but I’ll leave that struggle for future me to deal with (Sucka!).

Time Series:

# convert date column to datetime
objectdate_labels = []for i in defi_df['date'][1:]:k = i.split()[0]
clean_date = datetime.strptime(k, '%Y-%m-%d')date_labels.append(clean_date)
# plot time series overlay
plt.style.use('seaborn')
plt.figure(figsize=(10,5))
plt.plot_date(date_labels, returns_adj_benchmark, color = '#228833', markersize='.05', linestyle='--', label='Benchmark')
plt.plot_date(date_labels, returns_defi, color = '#ee6677', markersize='.05', linestyle='solid', alpha=0.8, label='Defi')
plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%b %d %y')
plt.gca().xaxis.set_major_formatter(date_format)plt.axhline(0,color='#0099ff', linestyle='--', linewidth=0.5)
plt.ylabel("Daily Return")
plt.title('Daily Index Returns: Benchmark vs. Defi (Timeseries)')
plt.legend()
plt.show()

Scatterplot:

# create linear regression
linex = x[:,1]
# remove constant
x2 = np.linspace(-0.2,0.3,100)
y_hat = x2 * beta + alpha
# daily return correlation scatter plot
plt.figure(figsize=(10,10))
plt.scatter(x,y,alpha=0.8, c=ols, cmap='RdPu')plt.colorbar(orientation='horizontal', aspect=50, label='Devation of Returns from OLS')
plt.plot(x2, y_hat, 'r', linestyle='--' , alpha = 0.8, label='Regression Line')
plt.xlim(-0.2,0.3)
plt.ylim(-0.2,0.3)
# set background color and ref lines
ax = plt.gca()ax.set_facecolor('#000000')
plt.axvline(0,color='#e1e1e1', linewidth=0.5)
plt.axhline(0,color='#e1e1e1', linewidth=0.5)
plt.axhline(total_variance, color='#a7eec9', linestyle='--', linewidth=0.8, label='Avg Daily Defi Return')# labels, titles, legend
plt.xlabel('Benchmark Daily Return')
plt.ylabel('Defi Daily Return')
plt.title('Defi vs. Benchmark Return')
plt.legend()
plt.show()

Analysis/Insight

Outliers and potential drivers/events around that time: Looking at the time series plot, you can see a couple spikes in the defi daily returns — one around Jan 24–26, and another around Mar 6–8. Due to some of the doubts previously mentioned, a) I eliminated those dates from the data set (separately) to see how that would affect the metrics I was going after and b) looked into those dates to see if any significant events happened around that time that might contribute to the increase in returns (or strengthen a case for a simple random walk). I ended up with a Beta of ~.822, alpha of ~.0042, and an r-squared of ~.332 — which didn’t change the metrics too much. The alpha decreased a bit, the beta decreased ab it, and the r-squared increased a bit — which all makes sense within the context of eliminating outliers. My search for significant news was just as underwhelming. I didn’t really find anything significant that would have basically doubled or tripled avg daily returns. Some events included dYdX getting $10m Series B funding (Jan 26), reddit announcing a partnership with the Ethereum Foundation (Jan 27), and Sushiswap trading being added to Coinbase (Mar 9). Coincidentally, the first spike coincided with the whole GameStop fiasco (reached $347 on Jan 7).

I have to admit I am still scratching my head a little bit. I would have expected a little more correlation and a higher beta — but keeping in mind the data set is so small and recent, I’m going to default to the numbers for now. I’m not sure that there is any actionable insight on this, aside from the fact that Defi is clearly growing (see plot below of growth in the defi benchmark marketcap).

Conclusion

In conclusion, I understand that the essence of defi isn’t really capitalizing on the gains of the tokens themselves, but rather using them as a medium of transacting between lenders and borrowers without a centralized entity. I get that — and I think there is room for better metrics for better aspects of this new asset class including but not limited to lending rates and profitability, growth in dapp usage, and growth in use cases (demand) for borrowing. For now I can’t think of any reason to borrow crypto assets aside from either leveraging a long position, entering a short position, or simply gaining interest on existing assets that are just sitting there. In the ‘real world’, as I say to myself with air quotes, there are many use cases for lending and borrowing outside of pure finance. You borrow to buy a house, you ‘borrow’ when your business buys things on credit terms, you borrow each time your credit card is swiped. My hope is that there really is something here to the Defi dynamic that can extend to a more broader audience and broader use cases. I plan to explore this further and share this with you guys as I go (hopefully using your feedback to sharpen my capabilities).

[ Originally published on Reddit — April 1, 2021]