Home

Quant interview prep

2024-10-01T00:00:00+00:00

Quant Prep

Quant Prep

Brain Teaser

Question:

One hundred tigers and one sheep are put on a magic island that only has grass. Tigers can eat grass, but they would rather eat sheep. Assume: A. Each time only one tiger can eat one sheep, and that tiger itself will become a sheep after it eats the seep. B. All tigers are smart and perfectly rational and they want to survive. So will the sheep be eaten?

Q: Four people, $A,B,C$ and $D$ need to get across a river. The only way to cross the river is by an old bridge, which holds at most 2 people at a time. Being dark, they can’t cross the bridge without a torch, of which they only have one. So each pair can only walk at the speed of the slower person. They need to get all of them across to the other side as quckly as poosible. $A$ is the slowerst and takes 10 minutes to cross; $B$ takes 5 minutes; $C$ takes 2 minutes; and $D$ takes 1 minute. What is the minimum time to get all of them across to the other side?

Q: Suppose that you are blind-folded in a room and told that there are 1000 coins on the floor. 980 of the coins have tails up and the other 20 coins have heads up. Can you separate the coins into two piles so to guarantee both piles have euqal number of heads? Asssume that you cannot tell a coin’s side by touching it, but you are allowed to turn over any number of coins.

Q: One hundred prisoners are given the chance to be set free tomorrow. They are all told that each will be given a red or blue hat to wear. Each prisoner can see everyone else’s hat but not his own. The hat colors are assigned randomly and once the hats are placed on the top of each prisoner’s head they cannot communicate with one another in any form, or else they are immediately executed. The prisoners will be called out in random order and the prisoner called out will guess the color of his hat. Each prisoner declares the color of his hat so that everyone else can hear it. If a prisoner guesses correctly the color of his hat, he is set free immediately; otherwise he is executed.

They are given the night to come up with a strategy among themselves to save as many prisoners as possible. What is the best strategy they can adopt and how many prisoners can they guarantee to save? What if there are 3 possible hat colors?

Q: Seven prisoners are given the chance to be set free tomorrow. An executioner will put a hat on each prisoner’s head. Each hat can be one of the seven colors of the rainbow and the hat colors are assigned completely at the executioner’s discretion. Every prisoner can see that hat colors of the other six prisoners, but not his own. They cannot communicate with others in any form, or else they are immediately executed. Then each prisoner writes down his guess of his own hat color. If at least one prisoner correctly guesses the color of his hat, they all will be set free immediately; otherwise they will be executed. They are given the night to come up with a strategy. Is there a strategy that they can guarantee that they will be set free?

Mathematics

Q: Can you pack 53 bricks of dimensions $1\times1\times4$ into a $6\times6\times6$ box?

Q: A basketball player is taking 100 free throws. She scores one point if the ball passes through the hoop and zero point if she misses. She has scored on her first throw and missed on her second. For each of the following throw the prababilty of her scoring is the fraction of throws she has made so far. For example, if she has scored 23 points after the 40th throw, the probability that she will score in the 41th throw is 23/40. After 100 throws (including the first and the second), what is the probability that she scores exactly 50 baskets?

A: Note that the probability of scoring one and missing the next is the same as missing one and scoring the next at any point in the game. For simplicity, we consider the probability of she scores the 3-51th throws and missing the 52-100th throws. Which would be

\[\frac{1}{2}\cdot\frac{2}{3}\cdot\frac{3}{4}\cdots\frac{49}{50}\cdot\frac{1}{51}\cdot\frac{2}{52}\cdots\frac{49}{99}=\frac{(49!)^2}{99!}\]

The final answer would be this multiplies the total number of possible records that counts 50 scorings, each of the 3-100th throws could be either missing or not. So the answer is

\[\frac{(49!)^2}{99!}\cdot\binom{98}{49}=\frac{1}{99}\]

Q: What is the expected number of cards that need to be turned over in a regular 52-card deck in order to see the first ace?

A: Suppose $X_i$ is 1 if some card $i$ is turned over before 4 aces, then the number of cards turned over until seeing the first ace would be $X=X_1+\cdots + X_{48}+1$, and thus

\[E(X)=1+\sum_iE(X_i)=1+48\cdot\frac{1}{5}=\frac{53}{5}\]

Q: There are $N$ distinct types of coupons in cereal boxes and each type, independent of prior selections, is equally likely to be in a box.

If a child wants to collect a complete set of coupons with at least one of each type, how many coupons(boxes) on average are needed to make such a complete set?
If the child has collected $n$ coupons, what is the expected number of distinct coupon types?

Q: You just had two dice custom-made. Instead of numbers -6, you place single-digit numbers on the faces of each dice so that every morning you can arrange the dice in a way as to make the two front faces show the current day of the month. You must use both dice (in other words, days 1-9 must be shown as 01-09), but you can switch the order of the dice if you want. What numbers do you have to put on the six faces of each of the two dice to achieve that?

Q: A sultan has captured 50 wise men. He has a glass currently standing bottom down. Every minute he calls one of the wise men who can choose either to turn it over (set it upside down or bottom down) or to do nothing. The wise men will be called randomly, possibly for an infinite number of times. When someone called to the sultan correctly states that all wise men have already been called to the sultan at least once, everyone goes free. But if his statement is wrong, the sultan puts everyone to death. The wise men are allwed to communicate only once before they get imprisoned into separate rooms (one per room). Design a strategy that lets the wise men go free.

Q: You are holding two glass balls in a 100-story building. If a ball is thrown out of the window, it will not break if the floor number is less than $X$, and it will always break if the floor number is equal to or greater tan $X$. You would like to determine $X$. What is the strategy that will minimize the number of drops for the worst case scenario?

Q: At a theater ticket office, $2n$ people are waiting to buy tickets. $n$ of them have only $\$ 5$ bills and the other $n$ people have only $\$ 10$ bills. Then ticket seller has no change to start with. If each person buys one $\$ 5$ ticket, what is the probability that all people will be able to buy tickets without having to change positions?

Q: Assume you have a fair coin. What is the expected number of coin tosses to get $n$ heads in a row?

Q: The Boston Red Sox and the Colorado Rockes are playing in the World Series finals. In case you are not familiar with the World Series, there are a maximum of 7 games and the first team that wins 4 games claims the championship. You have $\$ 100$ dollars to place a double-or-nothing bet on the Red Sox. Unfortunately, you can only bet on each individual game. not the series as a whole. How much should you bet on each game so that if the Red Sox wins the whole series, you win exactly $\$ 100$, and if Red Sox loses, you lose exactly $\$ 100$?

Q: A casino comes up with a fancy dice game. It allows you to roll a dice as many times as you want unless a 6 appears. After each roll, if 1 appears, you will win $\$ 1$; if 2 appears, you will win $\$ 2$; …; if 5 appears, you win $\$ 5$; but if 6 appears all the moneys you have won in the game is lost and the game stops. After each roll, if the dice number is 1-5, you can decide whether to keep the money or keep on rolling. How much are you willing to pay to play the game (if you are risk neutral)?

Q: Suppose $X$ is a Brownian motion with no drift, i.e. $dX_t=dW_t$. If $X$ starts at 0, what is the probability that $X$ hits 3 before hitting -5? What if $X$ has drift $m$, i,e, $dX_t=mdt+dW_t$?

Q: A couple decide to start having children and keep having children until they have more girls than boys. How many children do they expect to have?

Q: Consider a shuffled deck of 52 cards. How many cards on average do you need to draw before you draw a King?

Q: Suppose $S_n$ is a biased random walk with probability $p<1/2$ of moving up and $1-p$ of moving down with $S_0=0$. What is the expected steps that $S_n$ reaches $\alpha$ or $-\beta$ ($\alpha,\beta\in\mathbb Z_{\geq1}$)

A: First we verify two martingales

\[\begin{align*} \mathbb E\left[S_{n+1}+(1-2p)(n+1)\right]&=\mathbb E\left[S_{n+1}\right]+(1-2p)(n+1)\\ &=p(s_n+1)+(1-p)(s_n-1)+(1-2p)(n+1)\\ &=s_n+(1-2p)n \end{align*}\]

and

\[\begin{align*} \mathbb E\left[\left(\frac{1-p}{p}\right)^{S_{n+1}}\right]&=p\left(\frac{1-p}{p}\right)^{s_n+1}+(1-p)\left(\frac{1-p}{p}\right)^{s_n-1}\\ &=p\left(\frac{1-p}{p}\right)\left(\frac{1-p}{p}\right)^{s_n}+(1-p)\left(\frac{p}{1-p}\right)\left(\frac{1-p}{p}\right)^{s_n}\\ &=\left(\frac{1-p}{p}\right)^{s_n} \end{align*}\]

Suppose the probability of $S_n$ first reaches $\alpha$ is $p_\alpha$ and $T$ is the stopping time, then we have

\[1=\left(\frac{1-p}{p}\right)^0=\mathbb E\left[\left(\frac{1-p}{p}\right)^{S_{T}}\right]=p_\alpha\left(\frac{1-p}{p}\right)^\alpha+(1-p_\alpha)\left(\frac{1-p}{p}\right)^{-\beta}\]

From which we get

\[p_\alpha=\frac{1-\left(\frac{1-p}{p}\right)^{-\beta}}{\left(\frac{1-p}{p}\right)^\alpha-\left(\frac{1-p}{p}\right)^{-\beta}}\]

On the other hand, we also have

\[0=\mathbb E\left[S_T+(1-2p)T\right]=\mathbb E\left[S_T\right]+(1-2p)\mathbb E\left[T\right]=p_\alpha\alpha+(1-p_\alpha)\cdot(-\beta)+(1-2p)\mathbb E\left[T\right]\]

From which we can deduce

\[\mathbb E\left[T\right]=\frac{1}{1-2p}\left(\frac{1-\left(\frac{1-p}{p}\right)^{-\beta}}{\left(\frac{1-p}{p}\right)^\alpha-\left(\frac{1-p}{p}\right)^{-\beta}}\alpha - \frac{\left(\frac{1-p}{p}\right)^\alpha-1}{\left(\frac{1-p}{p}\right)^\alpha-\left(\frac{1-p}{p}\right)^{-\beta}}\beta\right)\]

Q: You play a game with a biased coin where there is a 40% chance of heads and 60% chance of tails. You may place a bet; If heads is flipped then you receive your bet back plus the same in winnings. If tails is flipped then you lose your bet. You have $\$ 10$ and you want to turn this into $\$ 20$ by continuously betting $\$ 1$ at a time, walking away when you either have a total of $\$ 20$ or are bankrupt. What is the probability you will leave with $\$ 20$?

Properties of characteristic functions $\varphi_X(t)=E[e^{itx}]$

If $X\sim\mathcal N(\mu,\sigma^2)$, $\varphi_X(t)=e^{it\mu-\frac{1}{2}\sigma^2t^2}$
\[E[X^k]=\frac{1}{i^k}\varphi^{(k)}_X(0)\]
$\varphi_{c_1X_1+\cdots c_nX_n+b}(t)=e^{itb}\varphi_{X_1}(c_1t)\cdots\varphi_{X_n}(c_nt)$ if $X_i$ are independent
$\varphi_{X,Y}(s,t)=\varphi_X(s)\varphi_Y(t)$ if $X,Y$ are independent

Properties of Fourier transform

\[\hat f(-\xi)=\overline{\hat f(\xi)}\]
\[\widehat{af(x)+bg(x)}=a\hat f(\xi)+b\hat g(\xi)\]
\[\widehat{f(x-a)}=\hat f(\xi)e^{-2\pi ia\xi}\]
\[\widehat{f(x)e^{2\pi iax}}=\hat f(\xi-a)\]
\[\widehat{f(ax)}=\frac{1}{|a|}\hat f(\frac{\xi}{a})\]
\[\widehat{\hat f(\xi)}=f(-x)\]
\[\widehat{(f*g)(x)}=\hat f(\xi)\hat g(\xi)\]
\[\widehat{f(x)g(x)}=(\hat f*\hat g)(\xi)\]
\[\widehat{f^{(n)}(x)}=(2\pi i\xi)^n\hat f(\xi)\]
\[\widehat{x^nf(x)}=(\frac{i}{2\pi})^n\hat f^{(n)}(\xi)\]

Common distributions

binomial distribution
geometric distribution
negative binomial distribution
Poisson distribution
Poisson process
Exponential distribution

Q: What is the law of large numbers?

Weak form: $\dfrac{(X_1+\cdots+X_n)}{n}\xrightarrow{P}\mu$

Proof: Chebychev’s inequality $P(|\bar X - \mu|\geq\epsilon)\leq\dfrac{\sigma^2}{n\epsilon^2}$

Strong form (Needs$E[X_i^4]<\infty$): $\dfrac{(X_1+\cdots+X_n)}{n}\xrightarrow{\text{a.s.}}\mu$

Proof: Assume $\mu=0$ , Chebychev’s inequality $P(|S_n|\geq n\epsilon)\leq\dfrac{E[S_n^4]}{(n\epsilon)^4}\leq\dfrac{C}{\epsilon^4n^2}$ and then Borel-Cantelli lemma

Q: What is central limit theorem? Does it imply law of large numbers?

A: $X_i$ are i.i.d. with $E[X_i]=\mu$, $Var[X_i]=\sigma^2$

\[Z_n=\frac{\dfrac{X_1+\cdots+X_n}{n}-\mu}{\sigma/\sqrt n}\xrightarrow{\mathcal D}\mathcal N(0,1)\]

Note that $Z_n=\sum_i\dfrac{1}{\sqrt n}Y_i$, where $Y_i=\dfrac{X_i-\mu}{\sigma}$. Then use charateristic function on this.

Q: How do you generated a random variables that follow $\mathcal N(\mu,\sigma^2)$

Q: Variance reduction techniques to improve the efficiency of Monte Carlo simulation

Low-discrepancy sequence:

Statistics

t distribution: $\dfrac{\bar X-\mu}{s/\sqrt n}\sim t_{n-p-1}$

chi-squared distribution:

Suppose $Z_i\sim\mathcal N(0,1)$, then $\sum_{i=1}^k Z_i^2\sim\chi^2_k$
\[\sum_{i=1}^n(X_i-\bar X)^2\sim\sigma^2\chi^2_{n-1}\]
\[\sum_{i=1}^n(X_i-\hat X)^2\sim\sigma^2\chi^2_{n-p-1}\]

F distribution

Q: What is skewness, kurtosis

A: They are the third and fourth standardized moment $\tilde\mu_3$, $\tilde\mu_4$, where $\tilde\mu_n=\mathbb E[(\frac{X-\mu}{\sigma})^n]$

Finance

Let’s denote

$T$: maturity date
$t$: the current time
$\tau=T-t$: time to maturity
$S$: stock price at the $t$
$r$: continuous risk-free interest rate
$u$: continuous dividend yield
$\sigma$: annulaized asset volality
$c,C,p,P$: price of a european/american call, european/american put
$D$: present value of future dividends
$K$: strike price
$PV$: present value

Q: How do vanilla european/american option prices change when $S,K,\tau,\sigma,r,D$ changes?

A: ||c|p|C|P| |-|-|-|-|-| |$S \uparrow$|$\uparrow$|$\downarrow$|$\uparrow$|$\downarrow$| |$K \uparrow$|$\downarrow$|$\uparrow$|$\downarrow$|$\uparrow$| |$\tau \uparrow$|?|?|$\uparrow$|$\uparrow$| |$\sigma \uparrow$|$\uparrow$|$\uparrow$|$\uparrow$|$\uparrow$| |$r \uparrow$|$\uparrow$|$\downarrow$|$\uparrow$|$\downarrow$| |$D \uparrow$|$\downarrow$|$\uparrow$|$\downarrow$|$\uparrow$|

Q: Explain call-put parity

A: $c-p=Se^{-y\tau}-Ke^{-r\tau}-D$, where if we suppose $Se^{(r-y)\tau}+De^{r\tau}=Se^{r\tau}\Rightarrow S-D=Se^{-y\tau}$.

Q: Why should you never exercise an american call on a non-dividend paying stock before maturity?

Q: A european put option on a non-dividend paying stock with strike price $\$ 80$ is currently priced at $\$ 8$ and a put option on the same stock with strike price $\$ 90$ is priced at $\$ 9$. Is there an arbitrage opportunity existing in thesee two options?

Q: What are return on risk and Sharpe ratio

A: $\dfrac{r}{\sigma}$ and $\dfrac{r-r_f}{\sigma}$

Q: Derive Black-Scholes-Merton differential equation

A: Suppose stock price $S$ is log-normal

\[dS=\mu Sdt+\sigma SdW\]

and $V=V(S,t)$ is a derivative, then by ito’s lemma

\[dV=\left(\frac{\partial V}{\partial t}+\mu S\frac{\partial V}{\partial S}+\frac{1}{2}\sigma^2S^2\frac{\partial^2 V}{\partial S^2}\right)dt+\sigma S\frac{\partial V}{\partial S}dW\]

Consider portfolio $\Pi=V-\frac{\partial V}{\partial S}S$, then

\[d\Pi=dV-\frac{\partial V}{\partial S}dS=\left(\frac{\partial V}{\partial t}+\sigma^2S^2\frac{1}{2}\frac{\partial^2 V}{\partial S^2}\right)dt\]

Since there are diffusion term, this should have risk-free rate of return: $d\Pi=r\left(V-\dfrac{\partial V}{\partial S}S\right)dt$. Therefore

\[\frac{\partial V}{\partial t}+rS\frac{\partial V}{\partial S}+\frac{1}{2}\sigma^2S^2\frac{\partial^2 V}{\partial S^2}=rV\]

This is the Black-Scholes-Merton differential equation.

Q: Explain Feynman-Kac theorem (see https://math.nyu.edu/~kohn/pde.finance/2015/section1.pdf)

A: Suppose

\[dX_t=\mu(X_t,t)dt+\sigma(X_t,t)dW_t\]

And

\[u(x,t)=\mathbb E\left[e^{-\int_t^Tr(X_\tau,\tau)d\tau}\phi(X_T)+\int_t^Te^{-\int_t^\tau r(X_s,s)ds}f(X_\tau,\tau)\middle|X_t=x\right]\]

Where

$\mu$ is the mean return rate
$\sigma$ is the volatility
$r$ is the interest rate
$X_t$ is the price of the derivative at time $t$
$W_t$ is a Wiener process
$f$ is the running payoff
$\phi$ is the final time payoff

Then $u$ solves

\[\frac{\partial u}{\partial t}(x,t)+\mu(x,t)\frac{\partial u}{\partial x}(x,t)+\frac{1}{2}\sigma(x,t)^2\frac{\partial^2 u}{\partial x^2}(x,t)-r(x,t)u(x,t)+f(x,t)=0,\quad u(x,T)=\phi(x)\]

Q: Explain the solution to Black-Scholes-Merton equation (see https://www.math.cmu.edu/~gautam/sj/teaching/2016-17/944-scalc-finance1/pdfs/ch4-rnm.pdf)

A: Solutions to the Black-Scholes-Merton model of european call and put are

\[c = Se^{-y\tau}N(d_+) - Ke^{-r\tau}N(d_-),\quad p = Ke^{-r\tau}N(-d_-) - Se^{-y\tau}N(-d_+)\]

Where $d_{\pm}=\dfrac{\ln(\frac{S}{K})+(r-y\pm\frac{\sigma^2}{2})\tau}{\sigma\sqrt\tau}$

Q: Explain the Greek letters

Delta $\Delta$: partial derivative with respect to $S$ $\frac{\partial c}{\partial S}=e^{-y\tau}N(d_+),\quad\frac{\partial p}{\partial S}=-e^{-y\tau}N(-d_+)$
Gamma $\Gamma$: second partial derivative with respect to $S$ $\frac{\partial^2 c}{\partial S^2}=,\quad \frac{\partial^2 p}{\partial S^2}=$
Theta $\Theta$: partial derivative with respect to $t$ $\frac{\partial c}{\partial t}=-\frac{\partial c}{\partial\tau}=,\quad$
vega $\nu$: partial derivative with respect to $\sigma$ $\frac{\partial c}{\partial \sigma}=,\quad$
rho $\rho$: partial derivative with respect to $r$ $\frac{\partial c}{\partial r}=,\quad$

We need the following straightforward yet useful identity

\[Se^{-y\tau}N'(d_+) = Ke^{-r\tau}N'(d_-)\]

Numeric Analysis

Optimization

Suppose problem is to minimize $f(x)$, under the condition $g_i(x)\leq 0$ and $h_j(x)=0$.

Consider Lagrange function

\[L(x,\lambda_i, \mu_j) = f(x) + \sum_i\lambda_ig_i(x) + \sum_i\mu_ih_i(x)\]

So a minimizer of $f$ coincide with a minimizer of its Lagrange dual

\[\max_{\lambda_i\leq0,\,\mu_j}L(x,\lambda_i, \mu_j)\]

Linear optimization

Consider the maximizing $\mathbf c^T\mathbf x$ under the condition $\mathbf x\leq\mathbf0, A\mathbf x=\mathbf b$. Since

\[\min_{\mathbf x\leq \mathbf 0}\max_{\mathbf y\leq 0}-\mathbf c^T\mathbf x + \mathbf y^T(A\mathbf x-\mathbf b) = \min_{\mathbf y\leq \mathbf 0}\max_{\mathbf x\leq 0}\mathbf x^T(A^T\mathbf y-\mathbf c)-\mathbf b^T\mathbf y\]

The dual problem is maximizing $\mathbf b^T\mathbf y$ under the condition $\mathbf y\leq\mathbf0, \mathbf A^T\mathbf y=\mathbf c$

Coding

Master theorem for divide-and-conquer recurrences

A problem with $n$ inputs that can be split into $a$ subproblems with $n/b$ inputs in each subproblem, then the running time is $T(n)=aT(n/b)+f(n)$, where $a\geq1$, $b>1$, $f(n)\geq0$.

If $f(n)=O(n^{\log_ba-\epsilon})$, then $T(n)=\Theta(n^{\log_ba})$
If $f(n)=\Theta(n^{\log_ba}\log^kn)$ for some $k\geq0$, then $T(n)=\Theta(n^{\log_ba}\log^{k+1}n)$
If $f(n)=\Omega(n^{\log_ba+\epsilon})$, then $T(n)=\Theta(f(n))$

Binary search

def bisect_left(a, x, l=0, r=len(a)):
    while l < r:
        m = (l + r) // 2
        if a[m] < x: l = m + 1
        else: r = m
    return l

def bisect_right(a, x, l=0, r=len(a)):
    while l < r:
        m = (l + r) // 2
        if x < a[m]: r = m
        else: l = m + 1
    return l

Binary search without knowing the size of the array.

\[[1], [2,3], [4,7],[8,15],\cdots,[2^k,2^{k+1}-1]\]

Sorting

def merge_sort(a, l=0, r=len(a)):
    if l + 1 >= r: return
    m = (l + r) // 2
    merge_sort(a, l, m)
    merge_sort(a, m, r)
    j, k, A = l, m, []
    while j < m and k < r:
        if a[j] < a[k]:
            A.append(a[j]); j += 1
        else:
            A.append(a[k]); k += 1
    a[l:r] = A + a[j:m] + a[k:r]

Heap (Priority queue)

A heap is an array $a$ such that $a_k\leq a_{2k+1},a_{2k+2}$.

the left and right child of $a_k$ is $a_{2k+1},a_{2k+2}$
the parent of $a_k$ is $a_{(k-1)/2}$
the leaves are $a_{n/2},\cdots,a_{n-1}$

def sift_down(a, p) {
    '''
    Sift down the element at p, and return it
    '''
    while True:
        n, l, r, i = len(a), 2*p+1, 2*p+2, p
        if l < n and a[l] < a[i]: i = l
        if r < n and a[r] <= a[i]: i = r
        if i == p: break
        a[i], a[p] = a[p], a[i]
        p = i
    return a[p]

def sift_up(a, p) {
    '''
    Sift up the element at p, and return it
    '''
    while p > 0:
        n, i = len(a), (p-1)//2
        if a[i] <= a[p]: break
        a[i], a[p] = a[p], a[i]
        p = i
    return a[k]

def heapify(a):
    for i in range(len(a)//2-1, -1):
        sift_down(a, i)

def heappop(a):
    if len(a) == 1:
        return a.pop()
    a[0] = a.pop()
    return sift_down(a, 0)

def heappush(a, x):
    a.append(x)
    return sift_up(a, len(a)-1)

Bitmask

(i+j) % 2 is the same as i ^ j

subset of n is

b = n
while b:
    b = (b-1) & n

Random

Knuth shuffle

Graph

Union-Find

class UnionFind:
    def __init__(self, n):
        self.rank = [0] * n
        self.parent = list(range(n))
        self.components = n
    def find(self, x):
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]
    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry: return
        if self.rank[rx] < self.rank[ry]:
            self.parent[rx] = ry
        elif self.rank[rx] > self.rank[ry]:
            self.parent[ry] = rx
        else:
            self.parent[ry] = rx
            self.rank[rx] += 1
        self.components -= 1

Study Note - Options, Futures and Other Derivatives (John Hull)

2024-08-15T00:00:00+00:00

A study note taken for the book Options, Futures and Other Derivatives by John Hull

Chapter 1 - Introduction

exchange/over-the-counter(OTC) market
forward/spot/futures contract
long/short position
call/put option
exercise or strike price
expiration date or maturity (american/european options diff)
types of traders: hedgers/speculators/arbitrageurs using futures/options

Practice questions

1.29

Chapter 2 - Futures Markets and Central Counterparties

2.2 - Specification of a futures contract

closing out
asset, contract size, delivery arrangement, price quote, price limit, position limit

2.4 - Margin Accounts(exchange market)

daily settlement, variation margin, maintenance margin, margin call, clearing house

2.5 - OTC markets

central counterparty(CCP), bilateral/collateral clearing

2.6 - Market quotes

open/high/low/settlement price
trading volume, open interest
pattern of futures: normal/inverted market

2.7 - Delivery

first/last notice day, last trading day
cash settlement

2.8 - Types of traders and types of orders

types of raders: futures commission merchants(FCM)/locals
types of speculators: scalpers, day traders, position traders
types of orders: market orders, limit orders, stop orders, stop-and-limit orders, market-if-touched(MIT) orders or board order, discretinoary order or market-not-held order
types of orders: day order, time-of-day order, open order or good-till-canceled order, fill-or-kill order

2.9 - Regulation

“corner the market”

2.10 - Accounting and tax

hedge accounting
corporate/noncorporate taxpayer
capital gain/loss, ordinary income
long/short-term capital gains
capital loss deduction, carry back/forward
60/40 rule, hedge transaction

2.11 - Forward vs. Futures contracts

futures prices: us cents per currency
forward prices: currency per usd

Chapter 3 - Hedging strategies using futures

3.1 - Basic principles

short/long hedge

3.2 - Arguments against hedging

shareholder hedge themselves
if not hedging practices, it leads to fluctuation of profit margins
hedging can offset gain so that it leads to a worse outcome

3.3 - Basis risk

Basis = Spot price of asset to be hedged - Futures price of contract used

3.4 - Cross hedging

cross hedging
hedge ratio, minimum variance hedge ratio, hedge effectiveness, optimal number of contracts
tailing the hedge

3.5 - Stock index futures

stock index
Dow Jones Industrial Average
Standard & Poor’s 500 (S&P500)
Nasdaq-100

3.6 - Stack and roll

Appendix - Capital asset pricing model (CAPM)

systematic/nonsystematic risk
Expected return on asset = $R_F+\beta(R_M-R_F)$, $R_M$ is the return on the market, $R_F$ is the return on a risk-free investment, $\beta$ is a parameter measuring systematic risk
assumptions for CAPM

Chapter 4 - Interest rates

4.1 - Types of rates

credit risk, credit spread
treasury rates
overnight rates: (effective) federal funds rate, SONIA, ESTER, SARON, TONAR
repo rates (very slightly below fed funds rate, but secured): overnight repo, term repos, SOFR

4.2 - Reference rates

LIBOR
The new reference rates, US (SOFR), other countries (overnight rates)
Longer rates are determined from overnight rates by compounding them daily. The (annualized) interest rate for a period of $D$ days is $\left[\left(1+r_1\frac{d_1}{360}\right)\cdots\left(1+r_n\frac{d_n}{360}\right)-1\right]\times\frac{360}{D}$
new reference rates are essentially risk-free, so it face the problem of incorprating redit spread

4.3 - The risk-free rate

Banks don’t use treasury rates as the risk-free rates for pricing the derivatives, instead they use the overnight rates

4.4 - Measuring interest rates

compounding frequency: annually/semiannually/quarterly/monthly/weekly/daily. Suppose it is $m$, then after $n$ years, the terminal value of an investment of $A$ at an interest rate of $r$ per annum is $A\left(1+\frac{r}{m}\right)^{mn}$
equivalent annual interest rate $A\left(1+\frac{r_1}{m_1}\right)^{m_1n}=A\left(1+\frac{r_2}{m_2}\right)^{m_2n}\Rightarrow r_2=m_2\left[\left(1+\frac{r_1}{m_1}\right)^{m_1/m_2}-1\right]$
countinuous compounding: $Ae^{rn}$
\[Ae^{r_en}=A\left(1+\frac{r}{m}\right)^{mn}\Rightarrow r_e=m\ln\left(1+\frac{r}{m}\right),\quad r=m(e^{r_e/m}-1)\]

4.5 - zero rates

bond with coupon
$n$-year zero-coupon/spot/zero rate: no intermediate payments

4.6 - Bond pricing

principal or par value or face value. Suppose the (annualized) zero rates for maturities is $r_1,\cdots,r_{mn}$, the coupon interest rate is $c$, bond price $B=A\left[\frac{c}{m}e^{-\frac{r_1}{m}}+\cdots+\left(1+\frac{c}{m}\right)e^{-\frac{r_{mn}}{m}}\right]$
bond yield: the single discount rate $r$ such that $A\left[\frac{c}{m}e^{-\frac{r}{m}}+\cdots+\left(1+\frac{c}{m}\right)e^{-\frac{r}{m}}\right]=B$
par yield: $c$ such that $A\left[\frac{c}{m}e^{-\frac{r_1}{m}}+\cdots+\left(1+\frac{c}{m}\right)e^{-\frac{r_{mn}}{m}}\right]=A$

4.7 - Determining zero rates

bootstrap method
zero curve

4.8 - Forward rates

Suppose $r_1$, $r_2$ are the zero rates for maturities $t_1$, $t_2$, and $r_f$ is the forward interest rate for the period of time between $t_1$ and $t_2$, then $Ae^{r_1t_1}e^{r_f(t_2-t_1)}=Ae^{r_2t_2}\Rightarrow r_f=\frac{r_2t_2-r_1t_1}{t_2-t_1}=r_2+(r_2-r_1)\frac{t_1}{t_2-t_1}$
instantaneous forward rate for a maturity of $t$ $r_f=r+t\frac{\partial r}{\partial t}$

4.9 - Forward rate agreements (FRA)

4.10 - Duration

Suppose the bond provides the holder with cash flows $c_i$ at time $t_i$ ($1\leq i\leq n$), and bond yield is $y$ (continuously compounded), then bond price $B=\sum_{i=1}^nc_ie^{-yt_i}$
The duration is the weighted sum of times $D=\sum_{i=1}^nt_i\frac{c_ie^{-yt_i}}{B}$
When a small change $\Delta y$ in the yield $\Delta B=\frac{dB}{dy}\Delta y=-\Delta y\sum_{i=1}^nt_ic_ie^{-yt_i}=-BD\Delta y\Rightarrow\frac{\Delta B}{B}=-D\Delta y$ describes the percentage changes in the bond price
If $y$ is expressed with a compounding frequency of $m$ times per year, then $B=\sum_{i=1}^nc_i\left(1+\frac{y}{m}\right)^{-mt_i},\quad D=\sum_{i=1}^nt_i\frac{c_i\left(1+\frac{y}{m}\right)^{-mt_i}}{B},\quad \frac{\Delta B}{B}=-\frac{D\Delta y}{1+y/m}$
modified duration $D^*=\frac{D}{1+y/m}$ so that $\frac{\Delta B}{B}=-D^*\Delta y$
dollar duration: $D_{\$}=BD^*$ is product of modified duration and bond price so that $\Delta B=-D_{\$}\Delta y$
The duration of a bond portfolio can be defined as a weighted average of the durations of the individual bonds, with weights being proportional to the value of the bond prices.

4.11 - Convexity

The duration relationship applies only to small changes in yields
convexity $C=\frac{1}{B}\frac{d^2B}{dy^2}=\frac{\sum_{i=1}^nc_it_i^2e^{-yt_i}}{B}$ Then $\Delta B=\frac{dB}{dy}\Delta y+\frac{1}{2}\frac{d^2B}{dy^2}\Delta y^2\Rightarrow \frac{\Delta B}{B}=-D\Delta y+\frac{C}{2}\Delta y^2$

4.12 - Theories of the term structure of interest rates

liquidity preference theory
net interest income

Chapter 5 - Determination of forward and futures prices

5.1 - Investment assets vs consumption assets

investment assets
consumption assets

5.2 - Short selling

5.3 - Assumption and notation

Assumptions:

The market participants are subject to no transaction costs when they trade
The market participants are subject to the same tax rate on all net trading profits
The market participants can borrow money at the same risk-free rate of interest as they can lend money
The market participants take advantage of arbitrage opportunities as they occur

Notations:

$T$: Time until delivery date in a forward or futures contract (in years)
$S_0$ : Price of the asset underlying the forward or futures contract today
$F_0$ : Forward or futures price today
$r$: Zero-coupon risk-free rate of interest per annum, expressed with continuous compounding, for an investment maturing at the delivery date (i.e., in $T$ years)

5.4 - Forward price for an investment asset

The forward price for an investment asset should be $F_0=S_0e^{rT}$. If $F_0>S_0e^{rT}$, then

Borrow $S_0$ dollars with an interest rate of $r$ for $T$ years
But 1 unit of asset
Short a forward contract of 1 unit of asset that delivers in $T$ years

The net gain will be $F_0-S_0e^{rT}$. If \(F_0

Short sale 1 unit of asset for $S_0$ dollars
Invest the proceeds at an interest rate of $r$ for $T$ years
Enter a forward contract of 1 unit of asset that delivers in $T$ years

The net gain will be $S_0e^{rT}-F_0$. Even if the short sale is not possible, as long as there are asset owners who are purely for investment, they will arbitrage

5.5 - Known income

Consider an investment asset will provide a perfectly predictable income with a present value of $I$ during the life of a forward contract, we have $F_0 = (S_0 - I)e^{rT}$

5.6 - Known yield

Suppose the known yield of the forward contract is $q$, compounded continuously, to make sure there is no positive net gain from selling and reinvesting in the asset, we need $S_0e^{rT}=e^{qT}F_0\Rightarrow F_0=S_0e^{(r-q)T}$

5.7 - Valuing forward contracts

Suppose in addition $K$ is the delivery price negotiated some time ago when the forward contract was purchased which delivers in $T$ years, $f$ is the value (if it were be sold) of the forward contract today and $F_0$ is the price of the forward contract that delivery in $T$ years, then $f=(F_0-K)e^{-rT}$ If the forward contract present a known income of present value of $I$, then $f=S_0-I-Ke^{-rT}$ If the forward contract has a known yield $y$, then $f=S_0e^{-qT}-Ke^{-rT}$

5.8 - Are forward prices and futures prices equal

if $r$ is a known function of time, forward prices and futrues prices are equal
if $S$ is a strong positive correlation to $r$, futures contracts is better
if $S$ is a strong negtive correlation to $r$, forward contracts is better

5.9 - Futures prices of stock indices

index arbitrage, program trading

5.10 - Forward and futures contracts on currencies

$S_0$ is the spot exchange rate in domestic currency, $r_f$ is the value of the foreign risk-free interest rate when money is invested for time $T$. $r$ is the domestic risk-free rate when money is invested for this period of time. The relationship between $F_0$ and $S_0$ is $F_0 = S_0e^{(r-r_f)T}$ The equation the same as the for known yeild in 5.6, because a foreign currency can be regarded as an investment asset paying a known yield. The yield is the risk-free rate of interest in the foreign currency

5.11 - Futures on commodities

Storage costs can be treated as negative income. If $U$ is the peresent value of all storage costs, then $F_0 = (S_0 + U) e^{rT}$ If the storage costs incurred at any time are proportional to the price of the commodity, then can be treated as negative yield, so $F_0 = S_0 e^{(r+u)T}$ Consumption commodities usually provide not income, but subject to significant storage costs.

Convenience yield

$F_0e^{yT} = (S_0+U)e^{rT}$ or $F_0e^{yT} = S_0e^{(r+u)T}$ If shortages are more likely occur or if the inventories are low, the convenience yield is higher

5.12 - The cost of carry

The cost of carry $c$ is

$r$ for a non-divident-paying stock
$r-q$ for a stock with divident yield rate $q$
$r-r_f$ for a currency
$r-q+u$ for a commodity that provides income at rate $q$ and requires storage costs at rate $u$

If $y$ is the convenience yield rate, we then have $F_0=S_0e^{(c-y)T}$ For futures contract, the party with the short position can choose to deliver at any time in a certain period (giving intention to deliver in a few days’ notice). If $c>y$, the party with short position will deliver as early as possible, if \(c

5.14 - Futures prices and expected future spot prices

Keynes and Hicks’ argument

Expected future spot price
If hedgers hold short positions and speculators hold long positions, the futures prices will be above the expected spot prices
If hedgers hold long positions and speculators hold short positions, the futures prices will be below the expected spot prices

Risk and return

The modern approach is based on relationship between risk and expected return in the economy. Suppose $k$ is an investor’s required return rate. Then the present value of this investment is $-F_0e^{-rT}+\mathbb E(S_T)e^{-kT}$ So the futures contract should be priced at $F_0 = \mathbb E(S_T)e^{(r-k)T}$. If the return is

uncorrelated with the stock market, then $k=r\Rightarrow F_0=\mathbb E(S_T)$
positively correlated with the stock market, then $k>r\Rightarrow F_0<\mathbb E(S_T)$. This is known as normal backwardation
negatively correlated with the stock market, then $k\mathbb E(S_T)$. This is known as contango

Chapter 6 - Interest rate futures

6.1 - Day count and quotation conventions

Day count between dates is defined as $\frac{\text{Number of days between dates}}{\text{Number of days in reference period}}$ Three day count conventions commonly used in the US are

Actual / actual (in period)
30 / 360
Actual / 360 And the interest earned between dates is $\text{Day count}\times\text{Interest earned in reference period}$

Price Quotations of U.S. Treasury Bills

Suppose the face value of a treasury bill is 100, $P$ is the quoted price (or $P/100$ as discount rate), $n$ is the remaining life measured in calendar days, $Y$ is the cash price (or present value), then $Y=100-100\times\frac{P}{100}\times\frac{n}{360}\Leftrightarrow P=\frac{360}{n}(F-Y)$

Price Quotations of U.S. Treasury Bonds

Treasury bond prices in the United States are quoted in dollars and thirty-seconds of a dollar. For example: 120-15 or $120\frac{5}{32}$. We have $\text{cash price (or dirty price)} = \text{quoted price (or clean price)} + \text{Accrued interest since last coupon date}$ The interest is accrued using the face value and day count

6.2 - Treasury bond futures

Chapter 7 - Swaps

Chapter 8 - Securitization and the financial crisis of 2007-8

Chapter 9 - XVAs

Chapter 10 - Mechanics of options markets

10.2 - Option positions

Suppose the strike price is $K$ and the terminal price is $S_T$, the payoff for european option is

long call $\max(S_T-K,0)$
short call $-\max(S_T-K,0)$
long put $\max(K-S_T,0)$
short put $-\max(K-S_T,0)$

10.3 - Underlying assets

options on stock/ETP/currency/stock index/futures

10.4 - Specification of stock options

expiration dates
strike prices
option class
option series

	in the money	at the money	out of the money
call	$S>K$	$S=K$	\(S
put	\(S	$S=K$	$S>K$

intrinsic/time value
FLEX options
dividends effect on strike price
$n$-for-$m$ stock split, its effect on strike price
position/exercise limits

10.5 - Trading

market makers
bid-ask spread

10.7 - Margin requirements

naked options

10.8 - The options clearing coroperation (OOC)

10.9 - Regulation

10.10 - Taxation

Treat as capital gain/loss

wash sale rule
constructive sales

10.11 - Warrants, employee stock options, and convertibles

warrants
employee stock options
convertibles (bonds)
exotic option

Chapter 11 - Properties of stock options

11.1 - Factors affecting option prices

current stock price $S_0$
strike price $K$
maturity date $T$
volatility $\sigma$
risk-free rate $r$
dividends
call/put american options are more valuable as $T$ increases, not necessarily for european option, as there might be dividends
options are more valuable as $\sigma$ increases, because the benefits are limitless whereas the loss is at most the cost of the option
If $r$ increases, the expected return should increase, while any future cash flow would decrease in present value. The combined impact increase the value of call options and decrease the value of put options
if the ex-dividend date is in the life of a call/put option, the value of the option is negatively/positively related to the size of the dividend

11.2 - Assumputions and notation

Assumptions:

There are no transaction costs
All trading profits (net of trading losses) are subject to the same tax rate
Borrowing and lending are possible at the risk-free interest rate

Notations:

$S_T$: Stock price on the expiration date $T$
$C$: Value of American call option to buy one share
$P$: Value of American put option to sell one share
$c$: Value of European call option to buy one share
$p$: Value of European put option to sell one share

11.3 - Upper and lower bounds for option prices

Upper bounds

\[c, C\leq S_0\]
$P\leq K$, $p\leq Ke^{-rT}$

Lower bound for european calls/puts on non-dividend-paying stocks

$c\geq \max(S_0-Ke^{-rT},0)$, $p\geq\max(Ke^{-rT}-S_0,0)$

11.4 - Put-call parity

Suppose to portforlios

1 european call(no dividends) + 1 bond(0-coupon) with payoff of $K$ at $T$
1 european put(no dividends) + 1 share of stock

	$S_T>K$	\(S_T
call + bond	$(S_T-K)+K=S_T$	$0+K=K$
put + stock	$0+S_T=S_T$	$(K-S_T)+K=K$

The payoff for both are $\max(S_T,K)$, so $c+Ke^{-rT}=p+S_0$

11.5 - Calls on a non-dividend-paying stock

For options with no dividends, you never want to exercise early, so $c=C$

11.6 - Puts on a non-dividend-paying stock

It can be optimal to exercise an american put option on a non-divident-paying stock early when $S_0$ decreases, $r$ increases, $\sigma$ decreases, and $P\geq\max(K-S_0,0)$

11.7 - Effect of dividends

$c\geq\max(S_0-D-Ke^{-rT},0)\quad p\geq\max(D+Ke^{-rT}-S_0,0)$ the put-call parity becomes $c+D+Ke^{-rT}=p+S_0$ $S_0-D-K\leq C-P\leq S_0-Ke^{-rT}$

Practice questions

Q11.23

For american options with no dividends, we have $C+Ke^{-rT}\leq P+S_0\leq C+K$ Which is equivalent to $S_0-K\leq C-P\leq S_0-Ke^{-rT}$

Chapter 12 - Trading strategies involving options

Chapter 13 - Binomial trees

Chapter 14 - Wiener processes and Ito’s lemma

14.1 - The markov property

markov process is a stochastic process that only depends on the current value and time, not the history

14.2 - Continuous-time stochastic processes

wiener process or brownian motion

Suppose $\mathcal N(\mu,\sigma^2)$ stands for the normal distribution of mean $\mu$ and variance $\sigma^2$. $z$ is a wiener process if it has the following two properties

Property 1. The change of $z$ in a small perod of time $\Delta t$ is $\Delta z\sim\mathcal N(0,\Delta t)$
Property 2. The values of $\Delta z$ for two different short intervals of time are independent

Property 2 implies that $z$ follows a markov process. Together with Property 1, we can deduce $z(t_2)-z(t_1)\sim\mathcal N(0,t_2-t_1)$ When $\Delta t$ is small, $\sqrt{\Delta t}$ is much bigger than $\Delta t$, so

The expected length of path followed by $z$ in any time interval is infinite
The expected number of times $z$ equals any particular value in any time interval is infinite

generalized wiener process

Suppose $dz\sim\mathcal N(0,dt)$, the generalized wiener process $x$ is such that $dx=adt+bdz$ Where $a$ is the drift rate and $b$ is the variance rate. We can deduce $\Delta x=a\Delta t+b\Delta z,\quad \Delta z\sim\mathcal N(0,\Delta t)$ Thus $\Delta x\sim\mathcal N(a\Delta t,b^2\Delta t)$ and $x(t_2)-x(t_1)\sim\mathcal N(a(t_2-t_1), b^2(t_2-t_1))$

ito process

$dx=a(x,t)dt+b(x,t)dz$ So for small $\Delta t$, $\Delta x=a(x,t)\Delta t+b(x,t)\Delta z$ with $\Delta z\sim\mathcal N(0,\Delta t)$. This is still a markov process

14.3 - The process for a stock price

If we assume the expected return and the volatility (uncertain) are constant. Then the stock price should satisfy $\frac{dS}{S}=\mu dt+\sigma dz\Rightarrow dS=\mu Sdt+\sigma Sdz$ $\mu$ is the stock’s expected rate of return, $\sigma$ is the volatility of the stock price. $\sigma^2$ is referred to as its variacne rate.

In a risk-neutral work, $\mu$ equals the risk-free rate $r$

This is known as geometric brownian motion, and the discrete-time version of the model is $\frac{\Delta S}{S}=\mu\Delta t+\sigma\Delta z,\quad \Delta z\sim\mathcal N(0,\Delta t)$ $\frac{\Delta S}{S}\sim\mathcal N(\mu\Delta t,\sigma^2\Delta t)$ is the return in a period of time $\Delta t$

monte carlo simulation

14.4 - The parameters

$\mu$ should increase if the risk is higher or the interest rates are higher
$\sigma$ should be approximately the standard deviation of the stock price in 1 year
$\sigma$ is critically important to the determination of value of many derivatives

14.5 - Correlated processes

Suppose $dx_1=a_1dt+b_1dz_1,\quad dx_2=a_2dt+b_2dz_2$ And $dz_1,dz_2$ have correlation $\rho$. In practice we can set $dz_1=u\sqrt{dt},\quad dz_2=(\rho u+\sqrt{1-\rho^2}v)\sqrt{dt}$ Where $u,v\sim\mathcal N(0,1)$

14.6 - Ito’s lemma

Suppose $x$ follows ito process $dx=a(x,t)dt+b(x,t)dz,\quad dz\sim\mathcal N(0,dt)$ Then $G=G(x,t)$ as a function of $x,t$ follows the ito process $dG=\left(\frac{\partial G}{\partial x}a+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2\right)dt+\frac{\partial G}{\partial x}bdz$ So if $G$ is a function of $S, t$, then $dG=\left(\frac{\partial G}{\partial S}\mu S+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial S^2}\sigma^2S^2\right)dt+\frac{\partial G}{\partial S}\sigma Sdz$

Application to forward contracts

Consider a forward contract on a non-dividend-paying stock. $F$ is the forward price at a general time $t$, $S$ is the stock price at time $t$, with \(t

14.7 - The lognormal property

Consider $G=\ln S$, then we have $dG=\left(\mu-\frac{\sigma^2}{2}\right)dt+\sigma dz$ So $\ln S_T-\ln S_0\sim\mathcal N\left(\left(\mu-\frac{\sigma^2}{2}\right)T, \sigma^2T\right)\Rightarrow\ln S_T\sim\mathcal N\left(\ln S_0+\left(\mu-\frac{\sigma^2}{2}\right)T, \sigma^2T\right)$

14.8 - Fractional brownian motion

Suppose $dx=\sigma dz$ with $x_0=0$, $z$ is a wienner process, then $E(x_t-x_s)=0$ and $\begin{align*} E((x_t-x_s)^2)&=E(x_t^2)+E(x_s^2)-2E(x_tx_s)\\ &=E(x_t^2)+E(x_s^2)-2E((x_t-x_s+x_s)x_s)\\ &=E(x_t^2)+E(x_s^2)-2E((x_t-x_s)x_s)-2E(x_s^2)\\ &=E(x_t^2)-E(x_s^2)\\ &=\sigma^2t-\sigma^2s\\ &=\sigma^2(t-s) \end{align*}$ here $x_t-x_s$ and $x_s$ are uncorrelated. In a fractional or fractal brownian motion, we assume $E((x_t-x_s)^2)=\sigma^2(t-s)^{2H}$ $H$ is the Hurst exponent. When $H=0.5$, it becomes a regular brownian motion. Also $E(x_tx_s)=\frac{1}{2}\left(E(x_t^2)+E(x_s^2)-E((x_t-x_s)^2)\right)=\frac{1}{2}\sigma^2[t^{2H}+s^{2H}-(t-s)^{2H}]$ So the correlation between $x_t$ and $x_s$ is $\frac{\sigma^2[t^{2H}+s^{2H}-(t-s)^{2H}]}{2s^Ht^H}$ Fractional brownian motoin is non-markov. If $t>s>u$ $\begin{align*} E[(x_t-x_s)(x_s-x_u)]&=E[x_tx_s-x_s^2-x_tx_u+x_sx_u]\\ &=\frac{\sigma^2}{2}[(t-u)^{2H}-(t-s)^{2H}-(s-u)^{2H}] \end{align*}$ When $H$ or time steps decreases, the volatity increases, so the process becomes more noisy

Appendix - A nonrigorous derivation of ito’s lemma

$\Delta G=\frac{\partial G}{\partial x}\Delta x+\frac{\partial G}{\partial t}\Delta t+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}\Delta x^2+\frac{\partial^2 G}{\partial x\partial t}\Delta x\Delta t+\frac{1}{2}\frac{\partial^2 G}{\partial t^2}\Delta t^2$ and $\Delta x=a(x,t)\Delta t+b(x,t)\Delta z,\quad \Delta z\sim\mathcal N(0,\Delta t)$ so $\Delta x^2=b^2\Delta z^2+a^2\Delta t^2+2ab\Delta z\Delta t=b^2\Delta z^2+O(\Delta t)$, substitute to get $\Delta G=\frac{\partial G}{\partial x}\Delta x+\frac{\partial G}{\partial t}\Delta t+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2\Delta z^2+O(\Delta t)$ And we know that $E(\Delta z^2)=\Delta t$, $Var(\Delta z^2)=2\Delta t^2$. As $\Delta t\to0$, $\Delta t^2/\Delta t$ is getting smaller, so $dz^2=dt$ becomes nonstochastic, therefore we have $\begin{align*} dG&=\frac{\partial G}{\partial x}dx+\frac{\partial G}{\partial t}dt+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2dt\\ &=\frac{\partial G}{\partial x}(adt+bdz)+\frac{\partial G}{\partial t}dt+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2dt\\ &=\left(\frac{\partial G}{\partial x}a+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2\right)dt+\frac{\partial G}{\partial x}bdz\\ \end{align*}$ When $dx=adt+\sum_{i=1}^mb_idz_i$, we have $dG=\left(\frac{\partial G}{\partial x}a+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}\sum_{i=1}^m\sum_{j=1}^mb_ib_j\rho_{ij}\right)dt+\sum_{i=1}^m\frac{\partial G}{\partial x_i}b_idz_i$ Here $\rho_{ij}$ is the correlation coefficient between $dz_i,dz_j$. When $G$ is a function of variables $x_1,x_2,\dots,x_n,t$ and $dx_i = a_i dt + b_i dz_i$, we have $dG=\left(\sum_{i=1}^n\frac{\partial G}{\partial x_i}a_i+\frac{\partial G}{\partial t}+\sum_{i=1}^n\sum_{j=1}^n\frac{1}{2}\frac{\partial^2 G}{\partial x_ix_j}b_ib_j\rho_{ij}\right)dt+\sum_{i=1}^n\frac{\partial G}{\partial x_i}b_idz_i$

Chapter 15 - The Black-Scholes-Merton model

15.1 - Lognormal property of stock prices

Recall from 14.7 $\ln S_T\sim\mathcal N\left(\ln S_0+\left(\mu-\frac{\sigma^2}{2}\right)T, \sigma^2T\right)$ so that $\mathbb E(S_T)=S_0e^{\mu T}$ $\mathrm{Var}(S_T) = S_0^2e^{2\mu T}(e^{\sigma^2T}-1)$

15.2 - The distribution of the rate of return

Suppose the continuously compounded rate of return per annum realized between time 0 and $T$ is $x$, then $S_T=S_0e^{xT}\Rightarrow x=\frac{1}{T}\ln\frac{S_T}{S_0}$, hence $x\sim\mathcal N\left(\mu-\frac{\sigma^2}{2},\frac{\sigma^2}{T}\right)$

15.3 - The expected return

Greater risk or higher level of interest rate would mean higher expected return $\mu$.

Note that $\mathbb E(x)=\mu-\frac{\sigma^2}{2}<\mu$. This is because ambiguity about what expected return is. In the sense of arithmetic mean we have $\ln[\mathbb E(S_T)]=\ln(S_0)+\mu T$ however in the sense of geometric mean we have $\mathbb E[\ln(S_T)]=\ln(S_0)+E(x)T$ In fact $\ln[\mathbb E(S_T)]>\mathbb E[\ln(S_T)]$ since $\ln$ is a concave function (Jensen’s inequality), so $E(x)<\mu$

15.4 - Volatility

To estimate the volatility of a stock price empirically, we observe $n+1$ times with prices $S_0,S_1,\cdots,S_n$ and suppose $\tau$ is the length of time interval in years. The the estimate $s=\sqrt{\frac{\sum_{i=1}^n(u_i-\bar u)^2}{n-1}}$ is the sample standard deviation of $u_i=\ln(S_i/S_{i-1})$

The standard deviation of $u_i$ is $\sigma\sqrt\tau$, and $\hat\sigma=s/\sqrt\tau$ is an unbiased estimator of $\sigma$ of standard error $\frac{\sigma}{\sqrt{n-1}}\sqrt{(n-1)-\frac{2\Gamma(\frac{n}{2})^2}{\Gamma(\frac{n-1}{2})^2}}\approx\frac{\sigma}{\sqrt{2(n-1)}}\approx\hat\sigma/\sqrt{2n}$ Which uses $\frac{\Gamma(\frac{n}{2})}{\Gamma(\frac{n-1}{2})}=\sqrt{\frac{n-1}{2}}\left(1-\frac{1}{4(n-1)}+O\left(\frac{1}{n^2}\right)\right)$ If $D$ is the amount of the dividend in some period, then $u_i=\ln\frac{S_i+D}{S_{i-1}}$

Research shows that volatility is much higher when the exchange is open for trading than when it is closed. We calculate $\text{Volatility per annum}=\text{Volatility per trading day}\times\sqrt{\text{Number of trading days per annum}}$ The life of an option is $T=\frac{\text{Number of trading days until option maturity}}{252}$ Where the number of trading days in a year is usually assumed to be 252

15.5 The idea underlying Black-Scholes-Merton differential equation

Assumptions

The stock price follows the process developed in Chapter 14 with $\mu$ and $\sigma$ constant.
The short selling of securities with full use of proceeds is permitted.
There are no transaction costs or taxes. All securities are perfectly divisible.
There are no dividends during the life of the derivative.
There are no riskless arbitrage opportunities.
Security trading is continuous.
The risk-free rate of interest, $r$, is constant and the same for all maturities.

15.6 Derivation of Black-Scholes-Merton differential equation

The idea is use a portforlio of the stock and the derivative to eliminate the wiener process

Chapter 16 - Employee stock options

Chapter 17 - Options on stock indices and currencies

Chapter 18 - Futures options and Black’s model

Chapter 19 - The Greek letters

Chapter 20 - Volatility smiles and volatility surfaces

Song Lyrics

2024-06-01T00:00:00+00:00

A list of song lyrics

ヒロイン

\[君の毎日に\quad 僕は\;\overset{にあ}{似合}わない\;かな\] \[白い空から\quad 雪が\;\overset{お}{落}ちた\] \[\overset{べつ}{別}に\;\overset{い}{良}いさ\;と\quad \overset{は}{吐}き\overset{だ}{出}した\;ため\overset{いき}{息}\;が\] \[\overset{すこ}{少}し\;\overset{のこ}{残}って\quad\overset{さび}{寂}しそう\;に\;\overset{き}{消}えた\] \[君の街にも\quad 降っている\;かな\] \[ああ\;\overset{いま}{今}\;隣で\] \[雪が\;\overset{きれい}{綺麗}\;と\;\overset{わら}{笑}うのは君が\;いい\] \[でも\;\overset{さむ}{寒}い\;ね\;って\;\overset{うれ}{嬉}しそう\;なの\;も\] \[\overset{ころ}{転}びそう\;になって\;\overset{つか}{掴}んだ\;\overset{て}手のその \overset{さき}先で\] \[ありがとう\;って\;\overset{たの}{楽}しそう\;なのも\] \[それも君がいい\]

Machine learning notes

2023-03-15T00:00:00+00:00

Data Preprocessing

Data scaling And standardizing

$x_i\leftarrow\dfrac{x_i-\mu}{\sigma}$

Pros & Cons

Dataset will be normalized, avoid unbalanced dataset

Usage

sklearn.StandardScaler

Imputation

The process of replace missing values is known as data imputation

Examples

Constant imputation: replace with contants
Linear interpolation/Regression Imputation: replace using a regression model
median/mean/mode/(sample statistic) imputation: replace with median/mean/mode/(sample statistic)
forward/backward fill: replace with previous/next value
KNN: replace using the mode the closest $k$ neighbors

Remark

You should not use the labels or test data set to imputate training data set

Pipelines

Definition

A pipeline is a series of data processing components arranged sequentially, each component in the pipeline performs a specific task.

Pros & Cons

This process streamlines the workflow, makes it easier to combine and expriment different algorithms and models.

Example

Learning cubic polynomial $y=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\epsilon$

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model.LinearRegression

pipe = Pipeline([
    ('poly', PolynomialFeatures(3, interaction_only=False, include_bias=False)),
    ('reg', LinearRegression(copy_X=True))
])

PolynomialFeatures generate higher powers $x^n$ from $x$.

Supervised Learning

Given Dataset $D=\{(\mathbf x^{(i)},\mathbf y^{(i)})\}_{i=1}^N$, where $\mathbf x^{(i)}$ are feature vectors, its entries are called features, and $\mathbf y^{(i)}$ are labels or predictions. Assume $\mathbf y=f(\mathbf x)+\boldsymbol\epsilon$ is the true relation, where $f$ is a (typically continuous) function and $\boldsymbol\epsilon$ is a random noise (typically $\mathbb E(\boldsymbol\epsilon)=\mathbf 0$ and independent). Supervised learning is to “learn” a model $\hat f$ of $f$ and make predictions $\hat{\mathbf y}=\hat f(\mathbf x)$.

Bias-variance trade-off

Suppose $\mathbb E[\boldsymbol\epsilon]=\mathbf 0$, $\hat f(\mathbf x)=\hat f(\mathbf x;D)$ with $D$ sampled from joint probability distribution of $(\mathbf X,\boldsymbol\epsilon)$. Consider the expected total error at a fixed test input $\mathbf x$ (so $\mathbb E=\mathbb E_{\mathbf X,\boldsymbol\epsilon|\mathbf X=\mathbf x}$) \(\begin{align*} \mathbb E[\|\mathbf y-\hat{\mathbf y}\|^2] &= \mathbb E[\|f+\boldsymbol\epsilon-\hat f\|^2]\\ &=\mathbb E[\|f-\hat f\|^2]-2\mathbb E[(f-\hat f)\cdot\boldsymbol\epsilon]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\ &=\mathbb E[\|f-\hat f\|^2]-2\mathbb E[f-\hat f]\cdot\mathbb E[\boldsymbol\epsilon]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\ &=\mathbb E[\|f-\mathbb E\hat f+\mathbb E\hat f-\hat f\|^2]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\ &=\mathbb E[\|f-\mathbb E\hat f\|^2]+2\mathbb E[(f-\mathbb E\hat f)\cdot(\mathbb E\hat f-\hat f)]+\mathbb E[\|\mathbb E\hat f-\hat f\|^2]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\ &=\|f-\mathbb E\hat f\|^2+2(f-\mathbb E\hat f)\cdot\mathbb E[\mathbb E\hat f-\hat f]+\mathbb E[\|\mathbb E\hat f\|^2-2\mathbb E\hat f\cdot\hat f+\|\hat f\|^2]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\ &=\|f-\mathbb E\hat f\|^2+\mathbb E[\|\hat f\|^2]-\|\mathbb E\hat f\|^2+\mathbb E[\|\boldsymbol\epsilon\|^2]\\ &=\|f-\mathbb E\hat f\|^2+\text{Var}[\hat f]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\ \end{align*}\) Here $\sigma^2$ is referred to as the irreducible error, so we have the simplified version $\text{total error = bias}^2 + \text{Variance + irreducible error}$

When underfitting the model, the model is too simple so that the model bias is huge (e.g., using a linear equation approximate a quadratic). When overfitting the model, the model is too complex, so the model variance is great (e.g., use a high-degree polynomial to approximate a linear relation with small random noise). One has to make tradeoff between bias and variance so that both aren’t significant.

Objective & Loss function

To improve the model, we need loss functions

Mean squared error (MSE): $\displaystyle\frac{1}{N}\sum_{i=1}^N\|\mathbf y^{(i)}-\hat{\mathbf y}^{(i)}\|^2$. Used in regression
Mean Absolute Error (MAE): $\displaystyle\frac{1}{N}\sum_{i=1}^N\|\mathbf y^{(i)}-\hat{\mathbf y}^{(i)}\|$.
Root Mean Square Error (RMSE): $\displaystyle\sqrt{\frac{1}{N}\sum_{i=1}^N\|\mathbf y^{(i)}-\hat{\mathbf y}^{(i)}\|^2}$.
Logistic Loss or Cross-Entropy Loss: $\displaystyle-\sum_i[y_i\log(\hat p_i)+(1-y_i)\log(1-\hat p_i)]$. Used in binary classification. Or $\displaystyle-\sum_{c=1}^C\sum_{i}y_{i,c}\log(\hat p_{i,c})$. Used in multiclass classification

But to prevent overfitting, we also include a regularization term $\Omega(\theta)$. The objective function is then the sum of $L(\theta)$ and $\Omega(\theta)$

$k$-fold cross validation & grid search

$k$-fold cross validation divide the dataset into $k$ subsets. Train the model on $k-1$ subsets independently $k$ times by single out each as the validation set. And eventually take the average of the parameters.
Grid search provide an array of values for each parameter and test model with every value and choose the best one.

Usage

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

GridSearchCV(
    cv = KFold(n_splits=5, random_state=30293, shuffle=True),
    estimator = KNeighborsRegressor(),
    param_grid = {
        'n_neighbors': range(1, 50),
        'weights': ['uniform', 'distance']
    },
    scoring = 'neg_mean_squared_error'
)

Remark

When should you not use cross-validation, and use simple validation instead?

Dataset size is too small. This can lead to deficiencies in both model fitting and estimation.
Model training time is too long. It might not worth the time.

Gradient descent

The method of gradient descent is to decrease the loss function $\ell$ by $\beta\leftarrow\beta-\alpha\nabla(\beta)$. Some common adjustments are

Mini-batch gradient descent: instead of use the entire dataset, cycling through mini batches to generate gradients.
Stochastic gradient descent: Randomly generates learning rates $\alpha$ each time.
- Pros: Avoid of being stuck in a local minimum.

Comparisons of common gradient descent methods

Stochastic gradient descent(SGD) is to update the parameter according to individual gradient Gradient descent is $\theta_{t+1}=\theta_t-\lambda\cdot\nabla L(\theta_t)$ And SGD is when $L(\theta)=\dfrac{1}{N}\sum_iL_i(\theta)$
Momentum $v$ is defined by $\begin{cases} v_{t+1}=\beta\cdot v_t+(1-\beta)\cdot\nabla L(\theta_t)\\ \theta_{t+1}=\theta_t-\lambda\cdot v_{t+1} \end{cases}$ This includes the “inertia” from the previous momentums and gradients, it helps accelerate convergence in the direction of persistent gradient, and reduce oscillations.
Adaptive gradient(Adagrad) is mathematically described by $\begin{cases} G_{t+1}=G_t+|\nabla L(\theta_t)|^2\\ \theta_{t+1}=\theta_t-\dfrac{\lambda}{\sqrt{G_{t+1}+\epsilon}}\cdot \nabla L(\theta_{t+1}) \end{cases}$ $G_t$ is the accumulated squared gradients (like the second momentum). This method ensures that the learning rate doesn’t get too small when having really large gradients. This method is good with sparse data but might overly reduce learning rate when encountering some frequently occuring features with large gradients.
Root mean squared propagation(RMSprop) is slightly changing Adagrad $\begin{cases} G_{t+1}=\beta\cdot G_t+(1-\beta)|\nabla L(\theta_t)|^2\\ \theta_{t+1}=\theta_t-\dfrac{\lambda}{\sqrt{G_{t+1}+\epsilon}}\cdot \nabla L(\theta_{t+1}) \end{cases}$ This method helps mitigate the problem of diminishing learning rate
Adaptive moment estimate(Adam) combines momentum and RMSProp $\begin{cases} m_{t+1}=\beta_1\cdot m_t+(1-\beta_1)\cdot\nabla L(\theta_t)\\ v_{t+1}=\beta_2v_t+|\nabla L(\theta_t)|^2\\ \hat m_{t+1} = \dfrac{m_t}{1-\beta_1^{t+1}}\\ \hat v_{t+1} = \dfrac{v_t}{1-\beta_2^{t+1}}\\ \theta_{t+1}=\theta_t-\dfrac{\lambda}{\sqrt{\hat v_{t+1}+\epsilon}}\cdot \hat m_{t+1} \end{cases}$ The normalization prevents bias from early initialization (for example, $m_0=v_0=0$, dividing $1-\beta_1,1-\beta_2$ could make them less biased, as time progress, $\beta^t$ has exponential decay and goes to 0 and has no effect in normalizing).

Regularization

Regularization is adding penalty terms to reduce the loss function. It controls the magnitude of the feature vector $\beta$

Ridge regularization is to add $\lambda\|\beta\|_2^2$
Lasso regularization is to add $\lambda\|\beta\|_1$

Pros & Cons

Lasso works better for feature selection, so it is better if there are a large amount of features. But it might only randomly choose some of highly correlated features (colinearity).
Ridge is better if it depends on almost all the features, because it handles colinearity better. However it is computationally costly with a large number of predictors

Elastic net

Sometimes it might be better to simply use the elastic net regurlarization which add $\lambda_1\|\beta\|_1+\lambda_2\|\beta\|_2^2$

Confusion matrix

The confusion matrix is the $2\times2$ contingency table, where the rows are the predicted values, and columns are the actual values.

	Positive	Negative
Positive	TP	FP
Negative	FN	TN

We define

Accuracy = $\dfrac{TP+TN}{TP+FP+FN+TN}$ Accuracy is used if the dataset is balanced and equally distributed, e.g. spam detection
Precision = $\dfrac{TP}{TP+FP}$ Precision is used if the cost for false positive is high, e.g. Fraud detection
Recall(Sensitivity) = $\dfrac{TP}{TP+FN}$ Recall is used if the cost for false negative is high, e.g. disease detection
Specificity = $\dfrac{TN}{TN+FP}$
F1 score = harmonic mean of Precision and Recall, i.e. $\text{F1 score} = \dfrac{2}{\dfrac{1}{\text{Precision}}+\dfrac{1}{\text{Recall}}} = 2\dfrac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}$ F1 score is the single metric of both the precision and recall which balances the Precision-Recall tradeoff by taking both into account, especially if there is an uneven class distribution, e.g. search engine ranking for relevance.

ROC

In binary classification, $\hat Y$ is usually a continuous random variable. Receivers operating characteristic curve (ROC) is the parametrized curve $(\text{fpr}(t),\text{tpr}(t))$, $t\in\mathbb R$ where

$$\displaystyle\text{tpr(t)}=\frac{TP}{TP+FN}=\mathbb P(\hat Y\geq t Y=1)$$ is true positive rate (recall)
$$\displaystyle\text{fpr(t)}=\frac{FP}{FP+TN}=\mathbb P(\hat Y\geq t Y=0)$$ is false positive rate (1-specificity)
$t$ is cut-off

It is not hard to conclude

A total random model corresponds to the diagonal line, where $\hat Y$ is independent of $Y$ and thus $\text{tpr}(t)=\text{fpr}(t)=\mathbb P(\hat Y\geq t)$
The perfect model corresponds to two segments $(0,0)\to(0,1)$ and $(0,1)\to(1,1)$, where $\mathbb P(\hat Y\geq t_0)=\mathbb P(Y=1)$ for some $t_0$, and $\text{tpr}(t)=1,\text{fpr}(t)=0$
$\text{tpr}(-\infty)=\text{fpr}(-\infty)=1$, $\text{tpr}(\infty)=\text{fpr}(\infty)=0$, $\text{tpr},\text{fpr}$ are non-increasing

The Area under ROC (AUROC/AUC) measures a comprehensive classifier’s performance, if it is $\frac{1}{2}$, and it is like random, if it is 1, then it is outstanding discrimination. AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Suppose $Z_1\sim\hat Y|Y=1$ has cdf $1-\text{tpr}$ and $Z_0\sim\hat Y|Y=0$ has cdf $1-\text{fpr}$ are independent $\begin{align*} \text{AUC}&=\int_0^1ydx=\int_{+\infty}^{-\infty}\text{tpr}(t)d\text{fpr}(t)\\ &=\int_{+\infty}^{-\infty}\text{tpr}(t)\text{fpr}'(t)dt\\ &=\int_{+\infty}^{-\infty}\left(-\int_t^\infty\text{tpr}'(s)ds\right)\text{fpr}'(t)dt\\ &=\int_{-\infty}^\infty\int_{-\infty}^\infty\text{tpr}'(s)\text{fpr}'(t)\mathbf 1_{s\geq t}(s,t)dsdt\\ &=\mathbb P(Z_1\geq Z_0) \end{align*}$ Suppose $\{Z_0^{(i)}\}_{i=1}^{n_0}\sim\hat Y|Y=0$, $\{Z_1^{(j)}\}_{j=1}^{n_1}\sim\hat Y|Y=1$ are (independent) samples, then an unbiased estimator of AUC is $\dfrac{U}{n_0n_1}$, where $U=\sum_{i=1}^{n_0}\sum_{j=1}^{n_1}\mathbf 1_{Z_1^{(j)}\geq Z_0^{(i)}}=n_0n_1+\frac{n_0(n_0+1)}{2}-R_0$ Note that this is precisely the Wilcoxon-Mann-Whitney (WMW) $U$-statistic. Where $R_0$ is the sum of ranks $Z_0^{(i)}$ in all of $Z$’s

Proof: Suppose the ranks of $Z_0^{(i)}$ are \(r_1<\cdots

R Squared

Total sum of squares $SS_{tot}=\sum_i(y_i-\bar y)^2$

Residual sum of squares $SS_{res}=\sum_i(y_i-\hat y)^2$

Coefficient of determination ($R^2$) is defined to be $1-\dfrac{SS_{res}}{SS_{tot}}$

If $R^2=0$, it means the model have worst predictions since it is a constant average prediction, if $R^2=1$, then the model is accurate.

GAIN/LIFT charts

$k$-nearest neighbors

The $k$-nearest neighbors algorithm assigns the most likely label from the nearest neighbors.

Linear and logistic regression

Logistic regression used in binary classification by $p(x)=\dfrac{1}{1+e^{-\beta x}}$.

Interaction terms in linear regression

When you have categorical vairables, you should add interaction terms since it might has a impact on other variables.

Residual plots

Residual vs features. It helps find missing signals and identify missing interaction terms.
Residual vs predicted.

Feature selection

The best subsets selection tries every possible subset of features and then choose the best one. This is very computational costly. Instead we could do

Forwards selection: Start with baseline model (no features selected), and each step, try all the remaining features with the current model, choose the best performing one (minimal MSE), and discard the others, and iterate, if non is better than current, then stop and use current model.
Backwards selection: Start with a model that includes all features, then lose features one at a time. If losing any is worse, then stop and use current model.

We could also try simply lasso regularization.

Regressino version of classification algorithms

$k$ nearest neighbors regression takse the average of the $k$ nearest values.
Tree regression use MSE as loss function
Supported vector regression

Supported vector machine

In binary classification, given a dataset $\{(\mathbf x_i,y_i)\}_{i=1}^N$, where $y_i=\pm1$ is the label. Naively, supported vector machine is used to find a border that maximize the margin between two classes.

Hard margin

If the data is linear separable, we wish to find a hyperplane $\mathbf w\cdot\mathbf x-b=\mathbf0$ that separate these two classes with maximal margin. Equivalently, it is to solve the following problem: Find $\mathbf w$ and $b$ that minimize $\|\mathbf w\|_2^2$ and subject to $y_i(\mathbf w\cdot\mathbf x_i-b)\geq 1$ The geometric interpretation depends on the fact: $\text{The distance between the origin and the plane }\mathbf w\cdot\mathbf x-b=0\text{ is }\frac{|b|}{\|\mathbf w\|_2}$ We want to choose $\mathbf w,b$ such that $\mathbf w\cdot\mathbf x-b=1$, $\mathbf w\cdot\mathbf x-b=-1$ barely touches two classes. So the margin between each class and the border would be $\frac{1}{\|\mathbf w\|_2}$. Note that this max-margin hyperplane is completely determined by those $\mathbf x_i$ that lie nearset to it, they are called support vectors.

Hinge loss

The hinge loss is a function like $\ell(y)=\max(0,1-t\cdot y)$.

Soft margin

If the dataset is not linearly separable, we introduce the hinge function $\max(0,1-y_i(\mathbf w\cdot\mathbf x_i-b))$, this penalize data on the wrong side of the margin. We can define a loss function $\lambda\|\mathbf w\|_2^2+\frac{1}{N}\sum_{i=1}^N\max(0,1-y_i(\mathbf w\cdot\mathbf x_i-b))$ If $\lambda$ is small, then it is basically hard-margin SVM. A soft-margin optimization problem could be to minimize $\lambda\|\mathbf w\|_2^2+\frac{1}{N}\sum_{i=1}^N\zeta_i$ subject to $y_i(\mathbf w\cdot\mathbf x_i-b)\geq 1-\zeta_i,\quad \zeta_i\geq0$

Nonlinear kernels

Sometimes it is very hard to separate data, we consider transformations $\varphi$ that takes $\mathbf x_i$ into higher dimensional spaces (even infinite dimensions!). And if we make sufficiently good choices, we don’t need to care what $\varphi$ really does and we simply need to know what $\kappa(\mathbf x_i,\mathbf x_j)=\varphi(\mathbf x_i)\cdot\varphi(\mathbf x_j)$ is, $\kappa$ is called kernel. Common examples are

Linear: $\kappa(\mathbf x_i,\mathbf x_j)=\mathbf x_i\cdot\mathbf x_j$.
Polynomlial: $\kappa(\mathbf x_i,\mathbf x_j)=(\mathbf x_i\cdot\mathbf x_j+r)^d$. Note that for example if we choose $\varphi(x_1,x_2)=(x_1^2,\sqrt2x_1x_2,x_2^2)$, then $\varphi(\mathbf x)\cdot\varphi(\mathbf y)=x_1^2x_2^+2x_1x_2y_2y_2+y_1^2y_2^2=(x_1y_1+x_2y_2)^2=(\mathbf x\cdot\mathbf y)^2$
Gaussian Radial Kernel: $\kappa(\mathbf x_i,\mathbf x_j)=\exp(-\gamma\|\mathbf x_i-\mathbf x_j\|_2^2)$.
Sigmoid: $\kappa(\mathbf x_i,\mathbf x_j)=\tanh(\gamma\mathbf x_i\cdot\mathbf x_j+r)$.

We can solve the dual optimization problem.

Bayes’ based classifiers

Linear discriminant analysis (LDA)

Assume $X|y=c\sim\mathcal N(\mu_c,\sigma^2)$, in the case where $X$ has one feature, we have $f_c(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu_c)^2}{2\sigma^2}\right)$ Then Bayes’ rule tells us $P(y=c|X)=\frac{\pi_c\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu_c)^2}{2\sigma^2}\right)}{\sum_{l=1}^C\pi_l\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu_l)^2}{2\sigma^2}\right)}$ Here $\pi_c$ denotes $P(y=c)$. So we could estimate $\hat\mu_c=\frac{1}{N_c}\sum_{y_i=c}X_i$ $\hat\sigma^2=\frac{1}{N-C}\sum_{c=1}^C\sum_{y_i=c}(X_i-\hat\mu_c)^2$ We make predictions by choosing $c$ rendering maximum likelihood $P(y=c|X)$, this is equivalent to choose largest discriminant function $\delta_c(X)=X\frac{\mu_c}{\sigma^2}-\frac{\mu_c^2}{2\sigma^2}+\log(\pi_c)$ Here we should use $\hat\mu_c,\hat\sigma$. In the case where $X$ has $m$ features, we have $X|y=c\sim\mathcal N(\mu_c,\Sigma)$, and $f_c(\mathbf x)=\frac{1}{(2\pi)^{m/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf x-\mu_c)^T\Sigma^{-1}(\mathbf x-\mu_c)\right)$ And the discriminant function will be $\delta_c(X)X^T\Sigma^{-1}\mu_c-\frac{1}{2}\mu_c^T\Sigma^{-1}\mu_c+\log(\pi_c)$

Quadratic discriminant analysis (QDA)

Assume $X|y=c\sim\mathcal N(\mu_c,\Sigma_c)$, we get discriminant $\begin{align*} \delta_c(X)& = -\frac{1}{2} \left( X - \mu_c \right)^T \Sigma_c^{-1} \left(X- \mu_c \right) - \frac{1}{2}\log\left(|\Sigma_c| \right) + \log(\pi_c)\\ &= -\frac{1}{2} X^{T} \sigma^{-1}_c X + X^{T} \sigma^{-1}_c \mu_c - \frac{1}{2} \mu_c^T \sigma_c^{-1} \mu_c - \frac{1}{2}\log\left(|\Sigma_c| \right) + \log(\pi_c) \end{align*}$

Naive Bayes classifier

Assume for each given class $c$, each of the $m$ features are independent, we then have $f_c(X)=f^{(1)}_c(X_1)\cdots f^{(m)}_c(X_m)$ Then by Bayes rule $P(y=c|X)=\frac{\pi_cf^{(1)}_c(X_1)\cdots f^{(m)}_c(X_m)}{\sum_{l=1}^C\pi_lf^{(1)}_l(X_1)\cdots f^{(m)}_l(X_m)}$ To estimate $f^{i}_c$ we assume some kind of distribution and hten estimate the parameters

If $X_i$ are quantitative, we assume it is a normal distribution
If $X_i$ are categorical, we assume it is a Bernouli distribution

Pros & Cons

LDA works better for smaller datasets and QDA works for large datasets
LDA works better if the data can be mostly separated by linear decision boundaries. QDA works better if the decision boundaries are not linear.
If we have really small amount of data, we can use naive Bayes model. This in general a decent classifier.

Decision trees

Pros & Cons

Pros
- Very fast and very needs little data preprocessing
Cons
- This algorithm is greedy, so might not create an optimal tree
- Decision trees have orthogonal boundaries, which might not be ideal
- Decision trees are sensitive to training data

Gini impurity

Gini impurity $I_G$ is defined by $I_G(p)=\sum_ip_i(1-p_i)=\sum_i(p_i-p_i^2)=1-\sum_ip_i^2$ $I_G(p)$ is between 0 and 1, if $I_G(p)=0$, then it is of a single class, if it is $1-\dfrac{1}{N}$, it is evenly distributed.

Cross entropy

The information content (surprisal) of an event $A$ is quantified as $\log\left(\dfrac{1}{P(A)}\right)=-\log P(A)$. The expected surprisal of $A$ is $-P(A)\log P(A)$. The Entropy under $P$ is the sum of expected surprisal $H(P)=-\mathbb E_P[\log P]=-\sum_ip_i\log(p_i)$ The Cross-entropy of $Q$ under $P$ is $H(P,Q)=-\mathbb E_P[\log Q]=-\sum_ip_i\log(q_i)$ which measures the discrepancy using $Q$ as predictions given the actual distribution is $P$.

The (KL convergence) of $P$ from $Q$ is $D_{KL}(P||Q)=\sum_ip_i\log\left(\dfrac{p_i}{q_i}\right)=H(P,Q)-H(P)$ This is always nonnegative (Gibb’s inequality) since $\begin{align*} -D_{KL}(P||Q)&=\sum_ip_i\ln\left(\dfrac{q_i}{p_i}\right)\\ &\leq\sum_ip_i\left(\dfrac{q_i}{p_i}-1\right)\\ &=\sum_iq_i-\sum_ip_i\\ &=0 \end{align*}$ So the minimum of the cross entropy $H(P)$ is attained $\iff P=Q$

Note that cross entropy and relative entropy measures the same because the entropy of $P$ is fixed.

CART algorithm

The CART (Classification and Regression Trees) algorithm is a decision tree-based algorithm that can be used for both classification and regression problems in machine learning. It works by recursively partitioning the training data into smaller subsets using binary splits.

Random Forest

The random forest model is made by building many different decision trees. These trees are made “different” through a variety of random perturbations. Finally take the average of all trees.

XGBoost Tutorials

Boosting

A statistical learning algorithm is said to be a

weak learner if it does slightly better thant random guessing.
strong learner if can be made arbitrarily close the true relationship

Thanks to PAC (Probably approximately correct) learnability, one can show that there exists boosting algorithms that can turn weak learners into strong learners.

For example a decision tree with a single layer (decision stump) is a weak learner, whereas a decision tree is a strong leaner.

Adaptive boosting

Adaptive boosting is building stronger learners iteratively by learning the weakness of the previous weak leaner. Suppose we have iteratively built up the first $j$ weak learners, we now construct the $j+1$-th weak learner. Suppose the prediction of $y_i$ by the $j$-th weak learner is $\hat y^{(j)}_i$, and assume the current weight assigned to $y_i$ is $w_i$, then we calculate the weighted error rate = 1 - weighted accuracy $r_j=\frac{\sum_{\hat y^{(j)}_i\neq y_i}w_i}{\sum_{i=1}^Nw_i}$ We then calculate the wieght assigned to the $j$-th weak learner $\alpha_j=\eta\log\left(\frac{1-r_j}{r_j}\right)$ $\eta$ is the learning rate. Finally we update the traning sample weights for $j+1$-th weak learner $w_i\leftarrow\begin{cases} w_i, \hat y^{(j)}_i=y_i\\ w_i\exp(\alpha_j), \hat y^{(j)}_i\neq y_i \end{cases}$

Gradient boosting

Gradient boosting is iteratively building an ensenble of weak learners where a learner is directly trained to model the previous learner’s errors. Suppose we have built the first $j$ weak learners, we build the $j+1$-th weak learner by trained to learn to predict ther residual $r_j$ of the previous learner, and set $h_{j+1}(X)=\hat r_j$ as its estimate of the residual, and then calculate the residual of this weak learner $r_{j+1}=r_j-h_j(X)$. By the end the strong learner $h(X)$ found is the sum of all the weak learners $h_j(X)$.

XGBoost (extreme Gradient boosting) is a specific implementation of gradient boosting that is optimized for performance, efficiency, and scalability. So it is very popular.

Time series

A time series is a sequence of data points $\{(\mathbf x_t,y_t)\}$ where $\mathbf x_t$ is a collection of features, $y_t$ is a numeric variable of interest, and $t$ stands for time.
Given a time series $\{(\mathbf x_{t_i},y_{t_i})\}_{i=1}^n$, a forecast is \(y_t=f(\mathbf x_t,t|\{y_\tau\}_{\tau
A model for time series is a series of random variables $\{y_t\}_{t\in T}$, where $y_t$ only depends on $\mathbf x_t,t$, and $\mathbf x_t$ is a collection of features that only depends on $t$.

Baseline forecasting models

without trend nor seasonality
- Average forecast assumes $y_t$ are independent and identically distributed. The forecast $y_t=\dfrac{1}{n}\sum\limits_{i=1}^ny_i+\epsilon$ takes the historical average.
- Naive forecast assumes $y_t$ is a random walk. The forecast $y_{t}=y_n+\epsilon$ only uses the last observation.
with trend but not seasonality
- Linear trend forecast assumes $E(y_t)=\beta t$. The forecast is $y_t=\hat\beta t+\epsilon$ with $\hat\beta$ being the average of first differences $y_{i+1}-y_i$. An intercept term can be added.
- Random walk with drift assumes $y_{t+1}=y_t+\beta+\epsilon$. The forecast is $y_t=y_n+\hat\beta(t-n)+\epsilon$ with $\hat\beta$ being the average of first differences.
with seasonality but not trend
- Seasonal average forecast assume $\{y_{r+km}\}_{k}$ are independent and identically distributed for each $0\leq r
- Seasonal naive forecast assumes \(\{y_{r+km}\}_{k}$ are random walks. The forecast is $y_t=y_\tau+\epsilon,\quad \tau=t-\left(\left\lfloor\frac{t-n}{m}\right\rfloor+1\right)m$

Stationary series

A time series is strictly stationary if $y_{t_1},\cdots,y_{t_n}$ and $y_{t_1+\tau},\cdots,y_{t_n+\tau}$ has the same joint probability distribution for any $n,\tau,t_1,\cdots, t_n$. In particular, we would have
- $E(y_t)=\mu$ and $\operatorname{Var}(y_t)=\sigma^2$.
- The joint distribution of $y_{t_1},\cdots,y_{t_n}$ only depends on $t_{i+1}-t_i$, these are referred to as the lags.
A time series is stationary if $E(y_t)=\mu,\qquad\operatorname{Cov}(y_t,y_{t+\tau})=\gamma(\tau)$ here $\gamma(\tau)$ is called the autovariance, and note that $\operatorname{Var}(y_t)=\gamma(0)=\sigma^2$.

Examples

White noise is a stationary time series with zero mean constant variance and zero correlation between different times.
The first differences $y_{t+1}-y_t$ of a random walk $y_{t+1}=y_t+\epsilon$.
A moving average process $y_t=\beta_0\epsilon_t+\beta_1\epsilon_{t-1}+\cdots+\beta_q\epsilon_{t-q}$.

Differencing

The $d$-th differences $\nabla^{d}y_t=\nabla^{d-1}y_t-\nabla^{d-1}y_{t-1}$ often produce a stationary series from a non-stationary one.

ARIMA

A time series is autoregressive (AR) of order $p$ if $y_t=\alpha_1y_{t-1}+\cdots+\alpha_py_{t-p}+\epsilon_t$
A time series is autoregressive of order $p$ with moving average noise (ARMA) of order $q$ if $y_t=\alpha_1y_{t-1}+\cdots+\alpha_py_{t-p}+\beta_0\epsilon_t+\beta_1\epsilon_{t-1}+\cdots+\beta_q\epsilon_{t-q}$
An autoregressive integrated moving average model (ARIMA($p,d,q$)) is a time series that its $d$-th difference is an ARMA($p,q$).

Unsupervised Learning

Principal components analysis (Following scikit-learn)

Principal components analysis (PCA) is a dimension reduction algorithm. Its goal is to project into a lower dimensional space that maximizes variance.

Suppose there are $N$ observations $\{\mathbf x^{(i)}=(x^{(i)}_1,\cdots,x^{(i)}_p)\}_{i=1}^N$ of $p$ features $\mathbf X=(X_1,\cdots,X_p)$, then $\mathbb EX_q=\frac{1}{N}\sum_{i=1}^Nx^{(i)}_q, \quad \text{Cov}(X_q,X_r)=\mathbb E[(X_q-\mathbb EX_q)(X_r-\mathbb EX_r)]$ Denote $A=\begin{bmatrix}\mathbf x^{(1)}\\\vdots\\\mathbf x^{(N)}\end{bmatrix}$, and $\bar A$ whose $q$-th column consists of only $\mathbb EX_q$, then the covariance matrix is $\Sigma=\text{Cov}(\mathbf X,\mathbf X)=\mathbb E[(\mathbf X-\mathbb E\mathbf X)^T(\mathbf X-\mathbb E\mathbf X)]=\frac{1}{N-1}(A-\bar A)^T(A-\bar A)$ A heuristic algorithm could be

Center the dataset so that each feature has zero mean $\iff A\leftarrow A-\bar A$
Induction on $k$. Choose the $k$-th weight vector $\mathbf w^{(k)}=(w^{(k)}_1,\cdots, w^{(k)}_N)^T\in\mathbb R^N$ such that \(\|\mathbf w^{(k)}\|=1, \qquad \mathbf w^{(k)}\perp\mathbf w^{(i)},\quad\forall i

This is just singular value decomposition for $A-\bar A$. Suppose $A-\bar A=V^TSW$ is the singular decomposition, then $\Sigma=W^T\frac{S^2}{N-1}W$

The $k$-th principal component of $\mathbf x^{(i)}$ is $\mathbf x^{(i)}\cdot\mathbf w^{(k)}$. The explained variances are the diagonal elements in $\dfrac{S^2}{N-1}$.

$t$-distributed stochastic neighbor embedding

$t$-distributed stochastic neighbor embedding (tSNE) typically reduce the dimension of the set of $m$ features down to 2 to 3 for visualization. Suppose $y_i$ is a low dimensional projection of $x_i$, we could define conditional probabilities

\[p_{j|i}=\frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_{k\neq i}\exp(-\|x_i-x_k\|^2/2\sigma_i^2)}\] \[q_{j|i}=\frac{(1+\|y_i-y_j\|^2)^{-1}}{\sum_{k\neq i}(1+\|y_i-y_k\|^2)^{-1}}\]

Assuming $p_{i|i}=q_{i|i}=0$. $p_{j|i}$ and $q_{j|i}$ are expected to be close. We can choose the cost function to be KL convergence, this would optimize $y_i$’s.

Pros & Cons

Since it is stochastic, it generates slightly different results each time.
Unlike, this is not reusable on making predictions for new data.
The magnitude of the distances between clusters shouldn’t be interpreted.
tSNE results should not be used as statistical evidence or proof of something, and it sometimes can produce clusters that aren’t actually true. Thus it is always a good practice to run it a few times to ensure that the cluster persists.

$K$ means clustering

$K$ means clustering tries to divide a dataset into $k$ clusters. Start with random guess of $k$ centroids. Then group all points according to distance to the centroids. Recalculate the centroid as the average of each group. Repeat these steps until you see no change of groups.

How to choose the best $K$?

Typically we run the algorithm multiple times examing the behaviour of the model depending on different values of $K$ according to some metric, and then choose the best.

The elbow method. We first calculate the inertia of the resulting clustering, which is defined to be $\sum_{i=1}^n\operatorname{dist}(X^{(i)},c^{(i)})^2$ And need to find the “elbow” in the graph.
The Silhouette method. The Silhouette score for the data point $x_i$ with $i$ in cluster $I$ is defined to be $\dfrac{b-a}{\max(a,b)}=\begin{cases} 1-a/b, ab \end{cases}$ Where $a=\dfrac{1}{|I|-1}\sum_{i\in I,i\neq j}d(x_i,d_j)$ is the average of the distances between $x_i$ and other points with indices in $I$ and $b=\min\limits_{J\neq I}\dfrac{1}{|J|}\sum_{j\in J}d(x_i,x_j)$ is the minimal average of distances between $x_i$ with all points with indices in some $J\neq I$. Note that this score ranges from -1 to 1. The higher the score, the better the clustering.

We can use it to generate silhouette plots.

Hierarchical clustering

Hierarchical clustering starts with each point as its own cluster and work its way up by merging clusters, generating a dendrogram, to have a measure for deciding when to merge clusters, we need cluster linkage

single linkage. The minimal distance between two points in two clusters.
complete linkage. The maximal distance between two points in two clusters.
centroid linkage. The distance between centroids.

Neural Networks

Start with $n$ observations with $m$ features

Perceptron

A perceptron is to neutron is as a artificial neural network is to an actual neural network. With a predefined activation function (some non linear function) It output $\hat y=\sigma(w_1x_1+\cdots+w_mx_m+b)=\sigma(\mathbf x\cdot\mathbf w)$ here augmented $\mathbf x$ by 1 and $\mathbf w$ by $b$ adds a bias term. The decision boudary is still linear which is not ideal, so we need to introduce multilayer neural network.

Feed forward network architecture

A feed forward network architecture is a multilayered neural network where each layer consists of many perceptrons. Feeding forward each layer is equivalent to a matrix multiplication. So terms of equations $h_1=\sigma(W_1\mathbf x),\cdots,\hat y=\sigma(W_{k+1}\mathbf x_k)$

Backpropagation

To adjust the weights in the neural network, we need back propagation. If we take the loss function to be $\ell=(\hat y-y)^2$, we know that $\nabla\ell$ can be computed using the chain rule. Then we need to update weights by $\mathbf w\leftarrow\mathbf w-\eta\nabla\ell(\mathbf w)$. This process should run through the whole of training points. A complete cycle is referred to as an epoch.

Convolution neural network

Convolution layers perform convolutions over the image (square matrix of data points). A pooling layer with a stride takes the maximal/minimal/average of points in that layer, which could downsample our observations or degrade the image. Then we feed into a fully connected layer. The reason for pooling is that the computation in the dense layer could be huge.

Padding is simply add zeros or constants at the boundary of the image.

Recurrent neural network

The set up in equations is $h_1=\sigma(W_{xh}X^{(1)}),\cdots,h_t=\sigma(W_{xh}X^{(t)}+W_{hh}h_{t-1}),\hat y^{(t)}=\sigma'(W_{hy}h_t)$

Long short-term memory

Long short-term memory (LSTM) improve RNN model that overcomes the the issue of vanishing gradient and capture the long term dependencies much better than RNN.

Transformer model

Input embedding
Positional embedding
Multi-head Attention

LLM & Gen AI Notes

2023-03-15T00:00:00+00:00

Large Language Model and Generative AI

A language model is model that estimating the probability p(s) of occurrence of a sentence s.

Natural Language Processing

Word sense disambiguation and uncertainty in language

Lexical ambiguity. For example, “silver” can be a noun, an adjective, or a verb.
Syntactic ambiguity. For example, “The man saw the girl with the telescope”. It is ambiguous whether the man saw the girl carrying a telescope or he saw her through his telescope.
Semantic ambiguity. For example, “The car hit the dog while it was moving”. It is unclear whether the dog or the car is moving.

There are several potential phases of natural language processing. Summerized below

Morphological processing refers to the cognitive mechanisms involved in recognizing and understanding the structure and meaning of words based on their constituent morphemes. Morphemes are the smallest units of meaning in a language, including prefixes, suffixes, roots, and other meaningful elements. For example, a word like “uneasy” can be broken into two sub-word tokens as “un-easy”.
Syntax analysis, also known as parsing, is the process of analyzing the grammatical structure of sentences in natural language to determine their syntactic relationships and properties. It involves breaking down sentences into their constituent parts and representing them according to the rules of a formal grammar.
Semantic analysis, is the process of understanding the meaning of text or speech in natural language. It involves analyzing the semantics, or meaning, of words, phrases, sentences, and larger units of discourse to extract and represent the underlying information.
Pragmatic analysis, studies how language is used in context to convey meaning beyond the literal interpretation of words and sentences. It focuses on the social, cultural, and situational aspects of language use, as well as the intentions and beliefs of speakers and listeners.

Part-of-speech (POS) tagging is the process of assigning a specific part of speech (such as noun, verb, adjective, etc.) to each word in a given text corpus. POS tagging is essential for understanding the syntactic structure and meaning of a sentence. It helps disambiguate the meaning of words that may have multiple possible interpretations based on their context.

Text normalization

Text normalization refers to the process of transforming text into a canonical, standardized form. Some common techniques are

Lowercasing
Tokenization
Removing punctuation, special characters, numbers, stop words
Stemming
Lemmatization
Spelling corrections
Handling contractions

Recurrent Neural Network

A recurrent neural network (RNN) is a class of neural networks that discover the sequential nature of the input data. Inputs could be text, speech, time series, etc.

The architecture of a simplest RNN is $h_t=\tanh(W_{h}\cdot h_{t-1}+W_{x}\cdot x_{t}+b)$

Types of RNN

one-to-one: traditional neural network
one-to-many: music generation
many-to-one: sentiment classification
many-to-many (equal): name entity recognition
many-to-many (unequal): machine translation

The loss function is the sum of losses of all time steps.

Due to the number of layers in the deep neural network, the gradients as continuous matrix multiplications because of the chain rule will shrink exponentially if they start from small values (<1) and will blow up if they start from large values (>1). This is called the vanishing or exploding gradient problem.

Long Short Term Memory

Long short term memory (LSTM) is a special kind of RNN, designed to avoid long-term dependency problem. All RNN have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module has a single tanh layer, whereas LSRMs has four, interacting as below $\begin{cases} f_t=\sigma(W_{fh}\cdot h_{t-1}+W_{fx}\cdot x_t+b_f) \\ i_t=\sigma(W_{ih}\cdot h_{t-1}+W_{ix}\cdot x_t+b_i) \\ o_t=\sigma(W_{oh}\cdot h_{t-1}+W_{ox}\cdot x_t+b_o) \\ \tilde C_t=\tanh(W_{Ch}\cdot h_{t-1}+W_{Cx}\cdot x_t+b_C) \\ C_t=f_t\cdot C_{t-1}+i_t\cdot \tilde C_t \\ h_t=o_t\cdot \tanh(C_t) \end{cases}$ Here \sigma is the sigmoid function, f_t, i_t, o_t are the forget, input, output gates, C_t is the cell state, and h_t is the hidden state.

Gated Recurrent Unit

Gated recurrent unit (GRU) is a variant of LSTM that has a simpler internal structure, and uses gating mechanisms to control and manage the flow of information between cells in the neural network. $\begin{cases} z_t=\sigma(W_{zh}\cdot h_{t-1}+W_{zx}\cdot x_t) \\ r_t=\sigma(W_{rh}\cdot h_{t-1}+W_{rx}\cdot x_t) \\ \tilde h_t=\tanh(W_{hh}\cdot r_t h_{t-1}+W_{hx}\cdot x_t) \\ h_t=(1-z_t)\cdot h_{t-1}+z_t\cdot \tilde h_t \\ \end{cases}$

Here r_t,\tilde h_t are the relevance, update gates.

Other variants includes |Bidirectional (BRNN)|Deep (DRNN)| |—|—| |||

Word Representations

There are two main ways of presenting words

1-hot representation, denoted o_w.
word embedding, denoted e_w.

The embedding matrix E such that e_w = Eo_w can be learnt using target/context likelihood models by defining the conditional probability as $p(w_o|w_i)=\frac{\exp(e_{w_o}\cdot e_{w_i})}{\sum_{w\in V}\exp(e_w\cdot e_{w_i})}$

BOW

Bag-of-words (BOW) model treats a document as a collection of words without considering the order and the structure of the words. It is the sum of 1-hot encodings. The size of the representation huge and sparse, it also disregards the semantic understanding, and it cannot deal with out-of-vocabulary words.

TF-IDF is the product of term frequency (TF) and inverse document frequency (IDF). TF-IDF helps rank documents based on their relevance to a query. Documents containing rare or distinctive terms (with high TF-IDF scores) are considered more relevant.

n-grams are contiguous sequences of n items from a given sequence of text or speech.

Word2Vec

word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words, popular models include

skip-gram maximize $\frac{1}{T}\sum_{t=1}^T\sum_{-c\leq j\leq c,j\neq0}\log p(w_{t+j}|w_t)$
continuous bag-of-words (CBOW) maximize $\frac{1}{T}\sum_{t=1}^T\sum_{-c\leq j\leq c,j\neq0}\log p(w_t|w_{t+j})$

Computing softmax probabilities for all words is computationally expensive. To address this, we could perform the computations in some other ways.

Hierarchical softmax is a way to make the calculation of the sum in the denominator of the conditional probability faster is with the help of a binary tree structure. All the leaves nodes represent words, and internal nodes measure connections between children nodes. Concretely, let n(w_o,k) to be the node on the unique path from root to w_o with w_o being its k-th generation descendant, stored with the node a weight vector v_n, we define $\begin{align*} p(n\to\text{left}|w_i)&=\sigma(v_n\cdot e_i)\\ p(n\to\text{right}|w_i)&=1-p(n\to\text{left}|w_i)=\sigma(-v_n\cdot e_i) \end{align*}$ and $p(w_o|w_i)=\prod_{k}p(n(w_o,k+1)\to n(w_o,k))$ The internal nodes embeddings are learnt during model training. The tree structure helps greatly reduce the complexity of the denominator estimation from O(V) to O(\log V).

Negative sampling transforms the objective of predicting words into a binary classification problem, the model is trained to distinguish between positive (actual context words, label y=1) and negative (randomly sampled noise, label y=0) examples. Concretely, we use probabilities $\begin{align*} p(y=1|w_o,w_i)&=\sigma(e_o\cdot e_i)\\ p(y=0|w_o,w_i)&=1-\sigma(e_o\cdot e_i)=\sigma(-e_o\cdot e_i) \end{align*}$ Here \sigma(x)=\dfrac{1}{1+e^{-x}} is the sigmoidal function. We define the loss to be $\mathcal L=-\sum_{i,o}\log p(y=1|w_o,w_i)+\sum_{i,o}\sum_{w\sim P_n}\log p(y=0|w,w_i)$ Here w\sim P_n is a negative sampled noised words, and the noise distribution P_n(w)=U(w)^{3/4}/Z is the Zipf’s law, U meaning word frequency.

Pros & Cons

skip-gram is better suited for rare words because rare words often have unique contexts.
skip-gram is known for capturing fine-grained semantic relationships between words. Since it learns separate embeddings for each word, which can represent subtle semantic nuances and capture relationships between words that may appear in diverse contexts.
CBOW is faster to train. Since it aggregates context information from multiple words to predict a single target word. This approach tends to be computationally more efficient, especially for large vocabularies.
CBOW performs better for frequent words because it average context vectors. Frequent words tend to occur in various contexts, and CBOW can effectively aggregate this information to learn robust representations for these words.
skip-gram tends to perform better with larger datasets, while CBOW may perform better with smaller datasets.

GloVe

GloVe (global vectors) is a word embedding technique that uses a co-occurence matrix X where each X_{ij} denotes the number of times w_j occurs in the context of w_i. The co-occurrence probability is defined to be $P_{ij}=p(w_j|w_i)=\frac{X_{ij}}{X_i}$ Here X_i=\sum_kX_{ik} is the number of occurrence of w_i. Define $F(w_i,w_j,w_k)=\frac{P_{ik}}{P_{jk}}$ This ratio shed some light on the corelation of the probe word w_k with the words w_i and w_j. If the ratio is large, then the probe word is related to w_i but not w_j and vice versa, it equals to 1, then w_k is likely to be unrelated to both w_i,w_j. Since we would like the linearity on the word embeddings, We expect F to satisfy $F((e_{w_i}-e_{w_j})\cdot e_{w_k})=\frac{F(e_{w_i}\cdot e_{w_k})}{F(e_{w_j}\cdot e_{w_k})}=\frac{P_{ik}}{P_{jk}}$ The solution would be F=\exp, so $F(e_{w_i}\cdot e_{w_k})=\exp(e_{w_i}\cdot e_{w_k})$ Hence $e_{w_i}\cdot e_{w_k}=\log P_{ik}=\log X_{ik} - \log X_i$ Since \log X_i is independent of k and break the symmetry between i,k, we can add a bias term b_{w_i} to e_{w_i} to absorb -\log X_i and add b_{w_k} to e_{w_k} to make it symmetric. The cost function can then be defined simply as $J(\theta)=\frac{1}{2}\sum_{i,j}f(X_{ij})(e_{w_i}\cdot e_{w_j}+b_{w_i}+b_{w_j}'-\log X_{ij})^2$ Where f(c) is a weighting function should be non-decreasing and go to zero as c\to 0. For example, with some adjustable c_{\max} \(f(c)=\begin{cases} \left(c/c_{\max}\right)^\alpha, &\text{if $ce_{w_i},e_{w_j}, the final word embedding is \dfrac{e_{w_i}+e_{w_j}}{2}.

Perplexity

Perplexity quantifies how surprised the model is when it sees new data. Suppose s_1,\cdots,s_N are new sentences for testing, each with m_i words, then the perplexity is defined as $PP=\prod_{i=1}^N\left(\frac{1}{p(s_i)}\right)^{1/m_i}=\prod_{i=1}^N\left(\prod_{k=1}^{m_i}\frac{1}{p(w_k|w_0w_1\cdots w_{k-1})}\right)^{1/m_i}$

Machine Translation

A naive approach is to do a greedy search, meaning choosing the most likely next word each time, until the token is selected. This doesn’t necessarily give the best outcome.

Beam search picks the top N most likely sequence each time, N is known as the beam width. This process ends when it meets some stopping criteria, for example the token is selected.

Beam search tends to favor shorter sequences, to avoid this, length normalization uses the loss function $\frac{1}{T^\alpha}\sum_{t=1}^{T}\log p(w_t|w_0,w_1,\cdots,w_{t-1})$ on the sentence s=w_0w_1\cdots w_T.

When obtaining predicted translation \hat y that is bad, we can perform error analysis using a good translation y^*

Case	`p(y^*\\|x)>p(\hat y\\|x)`	`p(y^*\\|x)\leq p(\hat y\\|x)`
Root cause	Beam search faulty	RNN faulty
Remedies	Increase beam width	Try different achitecture Regularize Get more data

The bilingual evaluation understudy (bleu) is a metric that quantifies how good a machine translation is by computing a similarity score $\text{bleu score}=(\text{brebity penalty})\times\exp\left(\frac{1}{n}\sum_{k=1}^np_k\right)$ based on n-gram precision $p_n=\dfrac{\text{number of matching $n$-grams}}{\text{total number of $n$-grams}}$ and brevity penalty is a factor that penalizes the short length translation, for example, it could be $\text{brevity penalty}=\begin{cases} 1,&\operatorname{len}(\hat y)\geq \operatorname{len}(y^*)\\ e^{1-\operatorname{len}(y^*)/\operatorname{len}(\hat y)},&\operatorname{len}(\hat y)<\operatorname{len}(y^*) \end{cases}$

Seq2Seq

Sequence to sequence is a popular model used in tasks like machine translation, video captioning, question answering, speech recognition, etc.

It employs a encoder-decoder architecture. Both the encoder and the decoder are LSTM/GRU models. Encoder reads the input sequence and summarizes the information into context vectors. The decoder is initialized to the final state of the encoder, i.e. the context vector of the encoder’s final cell is input to the first cell of the decoder network.

Transformer

Attention

Query
Key
Value

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Scaled dot-product attention computes attention score as $\text{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ Here queries and keys are of dimension d_k and values are of dimension d_v.

Another most commonly used attention function is additive $\text{Attention}(Q,K,V)=\operatorname{softmax}(V^T\tanh(W[Q,K]+b))$ Dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code
Multi-head attention instead of performing a single attention function with d_{\text{model}}-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d_k, d_k and d_v dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding d_v-dimensional output values. These are concatenated and once again projected, resulting in the final values.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. $\begin{align*} \text{MultiHead}(Q,K,V)&=\operatorname{Concat}(\text{head}_1,\cdots,\text{head}_h)W^O\\ \text{head}_i&=\text{Attention}(QW_i^{Q},KW_i^{K},VW_i^{V}) \end{align*}$ Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

Positional encoding

We want inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add positional encodings to the input embeddings at the bottoms of the encoder and decoder stacks. $\begin{align*} \text{PE}_{(\text{pos},2i)}&=\sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)\\ \text{PE}_{(\text{pos},2i+1)}&=\cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) \end{align*}$ The model easily learns to attend by relative positions, since for any fixed offset k, \text{PE}_{\text{pos}+k} can be represented as a linear function of \text{PE}_{\text{pos}}.

The architecture of the transformer model looks as follows

BERT

Bidirectional encoder representations from transformers (BERT) is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create models for a wide range of tasks.

Input/Output representation

Input will be in the form of a pair of sentence is represented as one token seqeunce. The first token is always [CLS], and the final hidden state vector C corresponding to this token is used as the aggregate sequence representation for classification tasks. We differentiate the pair with a [SEP] token. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

Pre-training

Masked LM (LLM) is mask some percentage of the input tokens at random, and then predict those masked tokens. This more powerful than shallowly concatenate a left-to-right and a right-to-left model. This percentage is always taken to be 15%, for too little masking, it is too expansive in training, and for too much masking, there is not enough context.
Next sentence prediction (NSP) tries to train the binary classification of whether sentence B is the next sentence of sentence A. Here only C is used for this.

Domain adaptation refers to the process of adapting a model trained on data from one domain (source domain) to perform well on data from a different domain (target domain).

GAN

Generative Adversarial Networks (GAN) has two neural networks, a generator G and a discriminator D, representing the probability that x comes from the data rather than the generator. Define z to be a random noise. We want to simultaneously train D to maximize the probability of assigning the correct label to both training samples and samples from G, and train G to minimize \log(1-D(G(z))). In other words, D and G play the following minimax game with value function $\min_{G}\max_{D} V(D,G)=E_{x}[\log D(x)] + E_z[\log(1-D(G(z)))]$

VAE

Variational Autoencoders (VAE)

MATH868C 2020Fall Several Complex Variables

2020-09-01T00:00:00+00:00

Please see PDF

Some LaTeXified old paper

2020-02-01T00:00:00+00:00

A LaTeXified paper: A naive guide to mixed Hodge theory by Alan H. Durfee

please see PDF

一元五次方程不可解之证明

2020-01-01T00:00:00+00:00

An exposition of insolvability of the quintic equations, please see PDF

Qualification Examination Solutions

2019-08-01T00:00:00+00:00

An partial solution set to UMD Math Qualification Examinations, please see PDF

Undergraduate thesis

2018-07-01T00:00:00+00:00

My undergraduate thesis, please see PDF

	\(S_T>K\)	\(S_T
call + bond	\((S_T-K)+K=S_T\)	\(0+K=K\)
put + stock	\(0+S_T=S_T\)	\((K-S_T)+K=K\)