<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://lihaoranicefire.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://lihaoranicefire.github.io/" rel="alternate" type="text/html" /><updated>2025-09-06T04:07:55+00:00</updated><id>https://lihaoranicefire.github.io/feed.xml</id><title type="html">Home</title><subtitle>Ph.D in Mathematics, Fixed-Income Quant Researcher at LSEG</subtitle><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><entry><title type="html">Quant interview prep</title><link href="https://lihaoranicefire.github.io/QuantPrep/" rel="alternate" type="text/html" title="Quant interview prep" /><published>2024-10-01T00:00:00+00:00</published><updated>2024-10-01T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/QuantPrep</id><content type="html" xml:base="https://lihaoranicefire.github.io/QuantPrep/"><![CDATA[<h1 id="quant-prep">Quant Prep</h1>

<ul>
  <li><a href="#quant-prep">Quant Prep</a>
    <ul>
      <li><a href="#brain-teaser">Brain Teaser</a>
        <ul>
          <li><a href="#question">Question:</a></li>
        </ul>
      </li>
      <li><a href="#mathematics">Mathematics</a></li>
      <li><a href="#statistics">Statistics</a></li>
      <li><a href="#finance">Finance</a></li>
      <li><a href="#numeric-analysis">Numeric Analysis</a>
        <ul>
          <li><a href="#optimization">Optimization</a></li>
          <li><a href="#linear-optimization">Linear optimization</a></li>
        </ul>
      </li>
      <li><a href="#coding">Coding</a>
        <ul>
          <li><a href="#master-theorem-for-divide-and-conquer-recurrences">Master theorem for divide-and-conquer recurrences</a></li>
          <li><a href="#binary-search">Binary search</a></li>
          <li><a href="#sorting">Sorting</a></li>
          <li><a href="#heap-priority-queue">Heap (Priority queue)</a></li>
          <li><a href="#bitmask">Bitmask</a></li>
          <li><a href="#random">Random</a></li>
          <li><a href="#graph">Graph</a></li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<h2 id="brain-teaser">Brain Teaser</h2>

<h4 id="question">Question:</h4>

<p>One hundred tigers and one sheep are put on a magic island that only has grass. Tigers can eat grass, but they would rather eat sheep. Assume: A. Each time only one tiger can eat one sheep, and that tiger itself will become a sheep after it eats the seep. B. All tigers are smart and perfectly rational and they want to survive. So will the sheep be eaten?</p>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> Four people, \(A,B,C\) and \(D\) need to get across a river. The only way to cross the river is by an old bridge, which holds at most 2 people at a time. Being dark, they can’t cross the bridge without a torch, of which they only have one. So each pair can only walk at the speed of the slower person. They need to get all of them across to the other side as quckly as poosible. \(A\) is the slowerst and takes 10 minutes to cross; \(B\) takes 5 minutes; \(C\) takes 2 minutes; and \(D\) takes 1 minute. What is the minimum time to get all of them across to the other side?</p>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> Suppose that you are blind-folded in a room and told that there are 1000 coins on the floor. 980 of the coins have tails up and the other 20 coins have heads up. Can you separate the coins into two piles so to guarantee both piles have euqal number of heads? Asssume that you cannot tell a coin’s side by touching it, but you are allowed to turn over any number of coins.</p>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> One hundred prisoners are given the chance to be set free tomorrow. They are all told that each will be given a red or blue hat to wear. Each prisoner can see everyone else’s hat but not his own. The hat colors are assigned randomly and once the hats are placed on the top of each prisoner’s head they cannot communicate with one another in any form, or else they are immediately executed. The prisoners will be called out in random order and the prisoner called out will guess the color of his hat. Each prisoner declares the color of his hat so that everyone else can hear it. If a prisoner guesses correctly the color of his hat, he is set free immediately; otherwise he is executed.</p>

<p>They are given the night to come up with a strategy among themselves to save as many prisoners as possible. What is the best strategy they can adopt and how many prisoners can they guarantee to save? What if there are 3 possible hat colors?</p>

<p><strong>Q:</strong> Seven prisoners are given the chance to be set free tomorrow. An executioner will put a hat on each prisoner’s head. Each hat can be one of the seven colors of the rainbow and the hat colors are assigned completely at the executioner’s discretion. Every prisoner can see that hat colors of the other six prisoners, but not his own. They cannot communicate with others in any form, or else they are immediately executed. Then each prisoner writes down his guess of his own hat color. If at least one prisoner correctly guesses the color of his hat, they all will be set free immediately; otherwise they will be executed. They are given the night to come up with a strategy. Is there a strategy that they can guarantee that they will be set free?</p>

<h2 id="mathematics">Mathematics</h2>

<p><strong>Q:</strong> Can you pack 53 bricks of dimensions \(1\times1\times4\) into a \(6\times6\times6\) box?</p>

<p><strong>Q:</strong> A basketball player is taking 100 free throws. She scores one point if the ball passes through the hoop and zero point if she misses. She has scored on her first throw and missed on her second. For each of the following throw the prababilty of her scoring is the fraction of throws she has made so far. For example, if she has scored 23 points after the 40th throw, the probability that she will score in the 41th throw is 23/40. After 100 throws (including the first and the second), what is the probability that she scores exactly 50 baskets?</p>

<p><strong>A:</strong> Note that the probability of scoring one and missing the next is the same as missing one and scoring the next at any point in the game. For simplicity, we consider the probability of she scores the 3-51th throws and missing the 52-100th throws. Which would be</p>

\[\frac{1}{2}\cdot\frac{2}{3}\cdot\frac{3}{4}\cdots\frac{49}{50}\cdot\frac{1}{51}\cdot\frac{2}{52}\cdots\frac{49}{99}=\frac{(49!)^2}{99!}\]

<p>The final answer would be this multiplies the total number of possible records that counts 50 scorings, each of the 3-100th throws could be either missing or not. So the answer is</p>

\[\frac{(49!)^2}{99!}\cdot\binom{98}{49}=\frac{1}{99}\]

<p><strong>Q:</strong> What is the expected number of cards that need to be turned over in a regular 52-card deck in order to see the first ace?</p>

<p><strong>A:</strong> Suppose \(X_i\) is 1 if some card \(i\) is turned over before 4 aces, then the number of cards turned over until seeing the first ace would be \(X=X_1+\cdots + X_{48}+1\), and thus</p>

\[E(X)=1+\sum_iE(X_i)=1+48\cdot\frac{1}{5}=\frac{53}{5}\]

<p><strong>Q:</strong> There are \(N\) distinct types of coupons in cereal boxes and each type, independent of prior selections, is equally likely to be in a box.</p>

<ol>
  <li>If a child wants to collect a complete set of coupons with at least one of each type, how many coupons(boxes) on average are needed to make such a complete set?</li>
  <li>If the child has collected \(n\) coupons, what is the expected number of distinct coupon types?</li>
</ol>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> You just had two dice custom-made. Instead of numbers -6, you place single-digit numbers on the faces of each dice so that every morning you can arrange the dice in a way as to make the two front faces show the current day of the month. You must use both dice (in other words, days 1-9 must be shown as 01-09), but you can switch the order of the dice if you want. What numbers do you have to put on the six faces of each of the two dice to achieve that?</p>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> A sultan has captured 50 wise men. He has a glass currently standing bottom down. Every minute he calls one of the wise men who can choose either to turn it over (set it upside down or bottom down) or to do nothing. The wise men will be called randomly, possibly for an infinite number of times. When someone called to the sultan correctly states that all wise men have already been called to the sultan at least once, everyone goes free. But if his statement is wrong, the sultan puts everyone to death. The wise men are allwed to communicate only once before they get imprisoned into separate rooms (one per room). Design a strategy that lets the wise men go free.</p>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> You are holding two glass balls in a 100-story building. If a ball is thrown out of the window, it will not break if the floor number is less than \(X\), and it will always break if the floor number is equal to or greater tan \(X\). You would like to determine \(X\). What is the strategy that will minimize the number of drops for the worst case scenario?</p>

<p><strong>Q:</strong> At a theater ticket office, \(2n\) people are waiting to buy tickets. \(n\) of them have only \(\$ 5\) bills and the other \(n\) people have only \(\$ 10\) bills. Then ticket seller has no change to start with. If each person buys one \(\$ 5\) ticket, what is the probability that all people will be able to buy tickets without having to change positions?</p>

<p><strong>Q:</strong> Assume you have a fair coin. What is the expected number of coin tosses to get \(n\) heads in a row?</p>

<p><strong>Q:</strong> The Boston Red Sox and the Colorado Rockes are playing in the World Series finals. In case you are not familiar with the World Series, there are a maximum of 7 games and the first team that wins 4 games claims the championship. You have \(\$ 100\) dollars to place a double-or-nothing bet on the Red Sox. Unfortunately, you can only bet on each individual game. not the series as a whole. How much should you bet on each game so that if the Red Sox wins the whole series, you win exactly \(\$ 100\), and if Red Sox loses, you lose exactly \(\$ 100\)?</p>

<p><strong>Q:</strong> A casino comes up with a fancy dice game. It allows you to roll a dice as many times as you want unless a 6 appears. After each roll, if 1 appears, you will win \(\$ 1\); if 2 appears, you will win \(\$ 2\); …; if 5 appears, you win \(\$ 5\); but if 6 appears all the moneys you have won in the game is lost and the game stops. After each roll, if the dice number is 1-5, you can decide whether to keep the money or keep on rolling. How much are you willing to pay to play the game (if you are risk neutral)?</p>

<p><strong>Q:</strong> Suppose \(X\) is a Brownian motion with no drift, i.e. \(dX_t=dW_t\). If \(X\) starts at 0, what is the probability that \(X\) hits 3 before hitting -5? What if \(X\) has drift \(m\), i,e, \(dX_t=mdt+dW_t\)?</p>

<p><strong>Q:</strong> A couple decide to start having children and keep having children until they have more girls than boys. How many children do they expect to have?</p>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> Consider a shuffled deck of 52 cards. How many cards on average do you need to draw before you draw a King?</p>

<p><strong>Q:</strong> Suppose \(S_n\) is a biased random walk with probability \(p&lt;1/2\) of moving up and \(1-p\) of moving down with \(S_0=0\). What is the expected steps that \(S_n\) reaches \(\alpha\) or \(-\beta\) (\(\alpha,\beta\in\mathbb Z_{\geq1}\))</p>

<p><strong>A:</strong> First we verify two martingales</p>

\[\begin{align*}
\mathbb E\left[S_{n+1}+(1-2p)(n+1)\right]&amp;=\mathbb E\left[S_{n+1}\right]+(1-2p)(n+1)\\
&amp;=p(s_n+1)+(1-p)(s_n-1)+(1-2p)(n+1)\\
&amp;=s_n+(1-2p)n
\end{align*}\]

<p>and</p>

\[\begin{align*}
\mathbb E\left[\left(\frac{1-p}{p}\right)^{S_{n+1}}\right]&amp;=p\left(\frac{1-p}{p}\right)^{s_n+1}+(1-p)\left(\frac{1-p}{p}\right)^{s_n-1}\\
&amp;=p\left(\frac{1-p}{p}\right)\left(\frac{1-p}{p}\right)^{s_n}+(1-p)\left(\frac{p}{1-p}\right)\left(\frac{1-p}{p}\right)^{s_n}\\
&amp;=\left(\frac{1-p}{p}\right)^{s_n}
\end{align*}\]

<p>Suppose the probability of \(S_n\) first reaches \(\alpha\) is \(p_\alpha\) and \(T\) is the stopping time, then we have</p>

\[1=\left(\frac{1-p}{p}\right)^0=\mathbb E\left[\left(\frac{1-p}{p}\right)^{S_{T}}\right]=p_\alpha\left(\frac{1-p}{p}\right)^\alpha+(1-p_\alpha)\left(\frac{1-p}{p}\right)^{-\beta}\]

<p>From which we get</p>

\[p_\alpha=\frac{1-\left(\frac{1-p}{p}\right)^{-\beta}}{\left(\frac{1-p}{p}\right)^\alpha-\left(\frac{1-p}{p}\right)^{-\beta}}\]

<p>On the other hand, we also have</p>

\[0=\mathbb E\left[S_T+(1-2p)T\right]=\mathbb E\left[S_T\right]+(1-2p)\mathbb E\left[T\right]=p_\alpha\alpha+(1-p_\alpha)\cdot(-\beta)+(1-2p)\mathbb E\left[T\right]\]

<p>From which we can deduce</p>

\[\mathbb E\left[T\right]=\frac{1}{1-2p}\left(\frac{1-\left(\frac{1-p}{p}\right)^{-\beta}}{\left(\frac{1-p}{p}\right)^\alpha-\left(\frac{1-p}{p}\right)^{-\beta}}\alpha - \frac{\left(\frac{1-p}{p}\right)^\alpha-1}{\left(\frac{1-p}{p}\right)^\alpha-\left(\frac{1-p}{p}\right)^{-\beta}}\beta\right)\]

<p><strong>Q:</strong> You play a game with a biased coin where there is a 40% chance of heads and 60% chance of tails. You may place a bet; If heads is flipped then you receive your bet back plus the same in winnings. If tails is flipped then you lose your bet. You have \(\$ 10\) and you want to turn this into \(\$ 20\) by continuously betting \(\$ 1\) at a time, walking away when you either have a total of \(\$ 20\) or are bankrupt. What is the probability you will leave with \(\$ 20\)?</p>

<p>Properties of characteristic functions \(\varphi_X(t)=E[e^{itx}]\)</p>

<ul>
  <li>If \(X\sim\mathcal N(\mu,\sigma^2)\), \(\varphi_X(t)=e^{it\mu-\frac{1}{2}\sigma^2t^2}\)</li>
  <li>
\[E[X^k]=\frac{1}{i^k}\varphi^{(k)}_X(0)\]
  </li>
  <li>\(\varphi_{c_1X_1+\cdots c_nX_n+b}(t)=e^{itb}\varphi_{X_1}(c_1t)\cdots\varphi_{X_n}(c_nt)\) if \(X_i\) are independent</li>
  <li>\(\varphi_{X,Y}(s,t)=\varphi_X(s)\varphi_Y(t)\) if \(X,Y\) are independent</li>
</ul>

<p>Properties of Fourier transform</p>

<ul>
  <li>
\[\hat f(-\xi)=\overline{\hat f(\xi)}\]
  </li>
  <li>
\[\widehat{af(x)+bg(x)}=a\hat f(\xi)+b\hat g(\xi)\]
  </li>
  <li>
\[\widehat{f(x-a)}=\hat f(\xi)e^{-2\pi ia\xi}\]
  </li>
  <li>
\[\widehat{f(x)e^{2\pi iax}}=\hat f(\xi-a)\]
  </li>
  <li>
\[\widehat{f(ax)}=\frac{1}{|a|}\hat f(\frac{\xi}{a})\]
  </li>
  <li>
\[\widehat{\hat f(\xi)}=f(-x)\]
  </li>
  <li>
\[\widehat{(f*g)(x)}=\hat f(\xi)\hat g(\xi)\]
  </li>
  <li>
\[\widehat{f(x)g(x)}=(\hat f*\hat g)(\xi)\]
  </li>
  <li>
\[\widehat{f^{(n)}(x)}=(2\pi i\xi)^n\hat f(\xi)\]
  </li>
  <li>
\[\widehat{x^nf(x)}=(\frac{i}{2\pi})^n\hat f^{(n)}(\xi)\]
  </li>
</ul>

<p>Common distributions</p>

<ul>
  <li>binomial distribution</li>
  <li>geometric distribution</li>
  <li>negative binomial distribution</li>
  <li>Poisson distribution</li>
  <li>Poisson process</li>
  <li>Exponential distribution</li>
</ul>

<p><strong>Q:</strong> What is the law of large numbers?</p>

<p><strong>A:</strong></p>

<p>Weak form: \(\dfrac{(X_1+\cdots+X_n)}{n}\xrightarrow{P}\mu\)</p>

<p>Proof: Chebychev’s inequality
\(P(|\bar X - \mu|\geq\epsilon)\leq\dfrac{\sigma^2}{n\epsilon^2}\)</p>

<p>Strong form (Needs\(E[X_i^4]&lt;\infty\)):
\(\dfrac{(X_1+\cdots+X_n)}{n}\xrightarrow{\text{a.s.}}\mu\)</p>

<p>Proof: Assume
\(\mu=0\)
, Chebychev’s inequality
\(P(|S_n|\geq n\epsilon)\leq\dfrac{E[S_n^4]}{(n\epsilon)^4}\leq\dfrac{C}{\epsilon^4n^2}\)
and then Borel-Cantelli lemma</p>

<p><strong>Q:</strong> What is central limit theorem? Does it imply law of large numbers?</p>

<p><strong>A:</strong> \(X_i\) are i.i.d. with \(E[X_i]=\mu\), \(Var[X_i]=\sigma^2\)</p>

\[Z_n=\frac{\dfrac{X_1+\cdots+X_n}{n}-\mu}{\sigma/\sqrt n}\xrightarrow{\mathcal D}\mathcal N(0,1)\]

<p>Note that \(Z_n=\sum_i\dfrac{1}{\sqrt n}Y_i\), where \(Y_i=\dfrac{X_i-\mu}{\sigma}\). Then use charateristic function on this.</p>

<p><strong>Q:</strong> How do you generated a random variables that follow \(\mathcal N(\mu,\sigma^2)\)</p>

<p><strong>Q:</strong> Variance reduction techniques to improve the efficiency of Monte Carlo simulation</p>

<ul>
  <li>Low-discrepancy sequence:</li>
</ul>

<h2 id="statistics">Statistics</h2>

<p>t distribution: \(\dfrac{\bar X-\mu}{s/\sqrt n}\sim t_{n-p-1}\)</p>

<p>chi-squared distribution:</p>

<ul>
  <li>Suppose \(Z_i\sim\mathcal N(0,1)\), then \(\sum_{i=1}^k Z_i^2\sim\chi^2_k\)</li>
  <li>
\[\sum_{i=1}^n(X_i-\bar X)^2\sim\sigma^2\chi^2_{n-1}\]
  </li>
  <li>
\[\sum_{i=1}^n(X_i-\hat X)^2\sim\sigma^2\chi^2_{n-p-1}\]
  </li>
</ul>

<p>F distribution</p>

<p><strong>Q:</strong> What is skewness, kurtosis</p>

<p><strong>A:</strong> They are the third and fourth standardized moment \(\tilde\mu_3\), \(\tilde\mu_4\), where \(\tilde\mu_n=\mathbb E[(\frac{X-\mu}{\sigma})^n]\)</p>

<h2 id="finance">Finance</h2>

<p>Let’s denote</p>

<ul>
  <li>\(T\): maturity date</li>
  <li>\(t\): the current time</li>
  <li>\(\tau=T-t\): time to maturity</li>
  <li>\(S\): stock price at the \(t\)</li>
  <li>\(r\): continuous risk-free interest rate</li>
  <li>\(u\): continuous dividend yield</li>
  <li>\(\sigma\): annulaized asset volality</li>
  <li>\(c,C,p,P\): price of a european/american call, european/american put</li>
  <li>\(D\): present value of future dividends</li>
  <li>\(K\): strike price</li>
  <li>\(PV\): present value</li>
</ul>

<p><strong>Q:</strong> How do vanilla european/american option prices change when \(S,K,\tau,\sigma,r,D\) changes?</p>

<p><strong>A:</strong>
||c|p|C|P|
|-|-|-|-|-|
|\(S \uparrow\)|\(\uparrow\)|\(\downarrow\)|\(\uparrow\)|\(\downarrow\)|
|\(K \uparrow\)|\(\downarrow\)|\(\uparrow\)|\(\downarrow\)|\(\uparrow\)|
|\(\tau \uparrow\)|?|?|\(\uparrow\)|\(\uparrow\)|
|\(\sigma \uparrow\)|\(\uparrow\)|\(\uparrow\)|\(\uparrow\)|\(\uparrow\)|
|\(r \uparrow\)|\(\uparrow\)|\(\downarrow\)|\(\uparrow\)|\(\downarrow\)|
|\(D \uparrow\)|\(\downarrow\)|\(\uparrow\)|\(\downarrow\)|\(\uparrow\)|</p>

<p><strong>Q:</strong> Explain call-put parity</p>

<p><strong>A:</strong> \(c-p=Se^{-y\tau}-Ke^{-r\tau}-D\), where if we suppose \(Se^{(r-y)\tau}+De^{r\tau}=Se^{r\tau}\Rightarrow S-D=Se^{-y\tau}\).</p>

<p><strong>Q:</strong> Why should you never exercise an american call on a non-dividend paying stock before maturity?</p>

<p><strong>A:</strong></p>

<p><strong>Q:</strong> A european put option on a non-dividend paying stock with strike price \(\$ 80\) is currently priced at \(\$ 8\) and a put option on the same stock with strike price \(\$ 90\) is priced at \(\$ 9\). Is there an arbitrage opportunity existing in thesee two options?</p>

<p><strong>Q:</strong> What are return on risk and Sharpe ratio</p>

<p><strong>A:</strong> \(\dfrac{r}{\sigma}\) and \(\dfrac{r-r_f}{\sigma}\)</p>

<p><strong>Q:</strong> Derive Black-Scholes-Merton differential equation</p>

<p><strong>A:</strong> Suppose stock price \(S\) is log-normal</p>

\[dS=\mu Sdt+\sigma SdW\]

<p>and \(V=V(S,t)\) is a derivative, then by ito’s lemma</p>

\[dV=\left(\frac{\partial V}{\partial t}+\mu S\frac{\partial V}{\partial S}+\frac{1}{2}\sigma^2S^2\frac{\partial^2 V}{\partial S^2}\right)dt+\sigma S\frac{\partial V}{\partial S}dW\]

<p>Consider portfolio \(\Pi=V-\frac{\partial V}{\partial S}S\), then</p>

\[d\Pi=dV-\frac{\partial V}{\partial S}dS=\left(\frac{\partial V}{\partial t}+\sigma^2S^2\frac{1}{2}\frac{\partial^2 V}{\partial S^2}\right)dt\]

<p>Since there are diffusion term, this should have risk-free rate of return: \(d\Pi=r\left(V-\dfrac{\partial V}{\partial S}S\right)dt\). Therefore</p>

\[\frac{\partial V}{\partial t}+rS\frac{\partial V}{\partial S}+\frac{1}{2}\sigma^2S^2\frac{\partial^2 V}{\partial S^2}=rV\]

<p>This is the Black-Scholes-Merton differential equation.</p>

<p><strong>Q:</strong> Explain Feynman-Kac theorem (see <a href="https://math.nyu.edu/~kohn/pde.finance/2015/section1.pdf">https://math.nyu.edu/~kohn/pde.finance/2015/section1.pdf</a>)</p>

<p><strong>A:</strong> Suppose</p>

\[dX_t=\mu(X_t,t)dt+\sigma(X_t,t)dW_t\]

<p>And</p>

\[u(x,t)=\mathbb E\left[e^{-\int_t^Tr(X_\tau,\tau)d\tau}\phi(X_T)+\int_t^Te^{-\int_t^\tau r(X_s,s)ds}f(X_\tau,\tau)\middle|X_t=x\right]\]

<p>Where</p>

<ul>
  <li>\(\mu\) is the mean return rate</li>
  <li>\(\sigma\) is the volatility</li>
  <li>\(r\) is the interest rate</li>
  <li>\(X_t\) is the price of the derivative at time \(t\)</li>
  <li>\(W_t\) is a Wiener process</li>
  <li>\(f\) is the running payoff</li>
  <li>\(\phi\) is the final time payoff</li>
</ul>

<p>Then \(u\) solves</p>

\[\frac{\partial u}{\partial t}(x,t)+\mu(x,t)\frac{\partial u}{\partial x}(x,t)+\frac{1}{2}\sigma(x,t)^2\frac{\partial^2 u}{\partial x^2}(x,t)-r(x,t)u(x,t)+f(x,t)=0,\quad u(x,T)=\phi(x)\]

<p><strong>Q:</strong> Explain the solution to Black-Scholes-Merton equation (see <a href="https://www.math.cmu.edu/~gautam/sj/teaching/2016-17/944-scalc-finance1/pdfs/ch4-rnm.pdf">https://www.math.cmu.edu/~gautam/sj/teaching/2016-17/944-scalc-finance1/pdfs/ch4-rnm.pdf</a>)</p>

<p><strong>A:</strong> Solutions to the Black-Scholes-Merton model of european call and put are</p>

\[c = Se^{-y\tau}N(d_+) - Ke^{-r\tau}N(d_-),\quad p = Ke^{-r\tau}N(-d_-) - Se^{-y\tau}N(-d_+)\]

<p>Where \(d_{\pm}=\dfrac{\ln(\frac{S}{K})+(r-y\pm\frac{\sigma^2}{2})\tau}{\sigma\sqrt\tau}\)</p>

<p><strong>Q:</strong> Explain the Greek letters</p>

<p><strong>A:</strong></p>

<ul>
  <li>Delta \(\Delta\): partial derivative with respect to \(S\)
\(\frac{\partial c}{\partial S}=e^{-y\tau}N(d_+),\quad\frac{\partial p}{\partial S}=-e^{-y\tau}N(-d_+)\)</li>
  <li>Gamma \(\Gamma\): second partial derivative with respect to \(S\)
\(\frac{\partial^2 c}{\partial S^2}=,\quad \frac{\partial^2 p}{\partial S^2}=\)</li>
  <li>Theta \(\Theta\): partial derivative with respect to \(t\)
\(\frac{\partial c}{\partial t}=-\frac{\partial c}{\partial\tau}=,\quad\)</li>
  <li>vega \(\nu\): partial derivative with respect to \(\sigma\)
\(\frac{\partial c}{\partial \sigma}=,\quad\)</li>
  <li>rho \(\rho\): partial derivative with respect to \(r\)
\(\frac{\partial c}{\partial r}=,\quad\)</li>
</ul>

<p>We need the following straightforward yet useful identity</p>

\[Se^{-y\tau}N'(d_+) = Ke^{-r\tau}N'(d_-)\]

<h2 id="numeric-analysis">Numeric Analysis</h2>

<h3 id="optimization">Optimization</h3>

<p>Suppose problem is to minimize \(f(x)\), under the condition \(g_i(x)\leq 0\) and \(h_j(x)=0\).</p>

<p>Consider Lagrange function</p>

\[L(x,\lambda_i, \mu_j) = f(x) + \sum_i\lambda_ig_i(x) + \sum_i\mu_ih_i(x)\]

<p>So a minimizer of \(f\) coincide with a minimizer of its Lagrange dual</p>

\[\max_{\lambda_i\leq0,\,\mu_j}L(x,\lambda_i, \mu_j)\]

<h3 id="linear-optimization">Linear optimization</h3>

<p>Consider the maximizing \(\mathbf c^T\mathbf x\) under the condition \(\mathbf x\leq\mathbf0, A\mathbf x=\mathbf b\). Since</p>

\[\min_{\mathbf x\leq \mathbf 0}\max_{\mathbf y\leq 0}-\mathbf c^T\mathbf x + \mathbf y^T(A\mathbf x-\mathbf b)
= \min_{\mathbf y\leq \mathbf 0}\max_{\mathbf x\leq 0}\mathbf x^T(A^T\mathbf y-\mathbf c)-\mathbf b^T\mathbf y\]

<p>The dual problem is maximizing \(\mathbf b^T\mathbf y\) under the condition \(\mathbf y\leq\mathbf0, \mathbf A^T\mathbf y=\mathbf c\)</p>

<h2 id="coding">Coding</h2>

<h3 id="master-theorem-for-divide-and-conquer-recurrences">Master theorem for divide-and-conquer recurrences</h3>

<p>A problem with \(n\) inputs that can be split into \(a\) subproblems with \(n/b\) inputs in each subproblem, then the running time is \(T(n)=aT(n/b)+f(n)\), where \(a\geq1\), \(b&gt;1\), \(f(n)\geq0\).</p>

<ul>
  <li>If \(f(n)=O(n^{\log_ba-\epsilon})\), then \(T(n)=\Theta(n^{\log_ba})\)</li>
  <li>If \(f(n)=\Theta(n^{\log_ba}\log^kn)\) for some \(k\geq0\), then \(T(n)=\Theta(n^{\log_ba}\log^{k+1}n)\)</li>
  <li>If \(f(n)=\Omega(n^{\log_ba+\epsilon})\), then \(T(n)=\Theta(f(n))\)</li>
</ul>

<h3 id="binary-search">Binary search</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">bisect_left</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">l</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">r</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)):</span>
    <span class="k">while</span> <span class="n">l</span> <span class="o">&lt;</span> <span class="n">r</span><span class="p">:</span>
        <span class="n">m</span> <span class="o">=</span> <span class="p">(</span><span class="n">l</span> <span class="o">+</span> <span class="n">r</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span>
        <span class="k">if</span> <span class="n">a</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">x</span><span class="p">:</span> <span class="n">l</span> <span class="o">=</span> <span class="n">m</span> <span class="o">+</span> <span class="mi">1</span>
        <span class="k">else</span><span class="p">:</span> <span class="n">r</span> <span class="o">=</span> <span class="n">m</span>
    <span class="k">return</span> <span class="n">l</span>

<span class="k">def</span> <span class="nf">bisect_right</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">l</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">r</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)):</span>
    <span class="k">while</span> <span class="n">l</span> <span class="o">&lt;</span> <span class="n">r</span><span class="p">:</span>
        <span class="n">m</span> <span class="o">=</span> <span class="p">(</span><span class="n">l</span> <span class="o">+</span> <span class="n">r</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span>
        <span class="k">if</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">a</span><span class="p">[</span><span class="n">m</span><span class="p">]:</span> <span class="n">r</span> <span class="o">=</span> <span class="n">m</span>
        <span class="k">else</span><span class="p">:</span> <span class="n">l</span> <span class="o">=</span> <span class="n">m</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="n">l</span>
</code></pre></div></div>

<p>Binary search without knowing the size of the array.</p>

\[[1], [2,3], [4,7],[8,15],\cdots,[2^k,2^{k+1}-1]\]

<h3 id="sorting">Sorting</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">merge_sort</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">l</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">r</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)):</span>
    <span class="k">if</span> <span class="n">l</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&gt;=</span> <span class="n">r</span><span class="p">:</span> <span class="k">return</span>
    <span class="n">m</span> <span class="o">=</span> <span class="p">(</span><span class="n">l</span> <span class="o">+</span> <span class="n">r</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span>
    <span class="n">merge_sort</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">l</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span>
    <span class="n">merge_sort</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
    <span class="n">j</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">A</span> <span class="o">=</span> <span class="n">l</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="p">[]</span>
    <span class="k">while</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">m</span> <span class="ow">and</span> <span class="n">k</span> <span class="o">&lt;</span> <span class="n">r</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">a</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">a</span><span class="p">[</span><span class="n">k</span><span class="p">]:</span>
            <span class="n">A</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span> <span class="n">j</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">A</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="n">k</span><span class="p">]);</span> <span class="n">k</span> <span class="o">+=</span> <span class="mi">1</span>
    <span class="n">a</span><span class="p">[</span><span class="n">l</span><span class="p">:</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span> <span class="o">+</span> <span class="n">a</span><span class="p">[</span><span class="n">j</span><span class="p">:</span><span class="n">m</span><span class="p">]</span> <span class="o">+</span> <span class="n">a</span><span class="p">[</span><span class="n">k</span><span class="p">:</span><span class="n">r</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="heap-priority-queue">Heap (Priority queue)</h3>

<p>A heap is an array \(a\) such that \(a_k\leq a_{2k+1},a_{2k+2}\).</p>

<ul>
  <li>the left and right child of \(a_k\) is \(a_{2k+1},a_{2k+2}\)</li>
  <li>the parent of \(a_k\) is \(a_{(k-1)/2}\)</li>
  <li>the leaves are \(a_{n/2},\cdots,a_{n-1}\)</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sift_down</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span> <span class="p">{</span>
    <span class="s">'''
    Sift down the element at p, and return it
    '''</span>
    <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
        <span class="n">n</span><span class="p">,</span> <span class="n">l</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">i</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="mi">2</span><span class="o">*</span><span class="n">p</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">p</span><span class="o">+</span><span class="mi">2</span><span class="p">,</span> <span class="n">p</span>
        <span class="k">if</span> <span class="n">l</span> <span class="o">&lt;</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">a</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span> <span class="n">i</span> <span class="o">=</span> <span class="n">l</span>
        <span class="k">if</span> <span class="n">r</span> <span class="o">&lt;</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">a</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span> <span class="n">i</span> <span class="o">=</span> <span class="n">r</span>
        <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="n">p</span><span class="p">:</span> <span class="k">break</span>
        <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">a</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">p</span><span class="p">],</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">i</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">[</span><span class="n">p</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">sift_up</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span> <span class="p">{</span>
    <span class="s">'''
    Sift up the element at p, and return it
    '''</span>
    <span class="k">while</span> <span class="n">p</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
        <span class="n">n</span><span class="p">,</span> <span class="n">i</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="p">(</span><span class="n">p</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">//</span><span class="mi">2</span>
        <span class="k">if</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="n">a</span><span class="p">[</span><span class="n">p</span><span class="p">]:</span> <span class="k">break</span>
        <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">a</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">p</span><span class="p">],</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">i</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">[</span><span class="n">k</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">heapify</span><span class="p">(</span><span class="n">a</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="o">//</span><span class="mi">2</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">):</span>
        <span class="n">sift_down</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">heappop</span><span class="p">(</span><span class="n">a</span><span class="p">):</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">a</span><span class="p">.</span><span class="n">pop</span><span class="p">()</span>
    <span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">pop</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">sift_down</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">heappush</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="n">a</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">sift_up</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="bitmask">Bitmask</h3>

<p><code class="language-plaintext highlighter-rouge">(i+j) % 2</code> is the same as <code class="language-plaintext highlighter-rouge">i ^ j</code></p>

<p>subset of <code class="language-plaintext highlighter-rouge">n</code> is</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b</span> <span class="o">=</span> <span class="n">n</span>
<span class="k">while</span> <span class="n">b</span><span class="p">:</span>
    <span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="n">b</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">n</span>
</code></pre></div></div>

<h3 id="random">Random</h3>

<p>Knuth shuffle</p>

<h3 id="graph">Graph</h3>

<p>Union-Find</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">UnionFind</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">rank</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">n</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">parent</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">components</span> <span class="o">=</span> <span class="n">n</span>
    <span class="k">def</span> <span class="nf">find</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">parent</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">!=</span> <span class="n">x</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">parent</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">parent</span><span class="p">[</span><span class="n">x</span><span class="p">])</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">parent</span><span class="p">[</span><span class="n">x</span><span class="p">]</span>
    <span class="k">def</span> <span class="nf">union</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="n">rx</span><span class="p">,</span> <span class="n">ry</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="bp">self</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">rx</span> <span class="o">==</span> <span class="n">ry</span><span class="p">:</span> <span class="k">return</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">rank</span><span class="p">[</span><span class="n">rx</span><span class="p">]</span> <span class="o">&lt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">rank</span><span class="p">[</span><span class="n">ry</span><span class="p">]:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">parent</span><span class="p">[</span><span class="n">rx</span><span class="p">]</span> <span class="o">=</span> <span class="n">ry</span>
        <span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">rank</span><span class="p">[</span><span class="n">rx</span><span class="p">]</span> <span class="o">&gt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">rank</span><span class="p">[</span><span class="n">ry</span><span class="p">]:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">parent</span><span class="p">[</span><span class="n">ry</span><span class="p">]</span> <span class="o">=</span> <span class="n">rx</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">parent</span><span class="p">[</span><span class="n">ry</span><span class="p">]</span> <span class="o">=</span> <span class="n">rx</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">rank</span><span class="p">[</span><span class="n">rx</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">components</span> <span class="o">-=</span> <span class="mi">1</span>
</code></pre></div></div>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="notes" /><summary type="html"><![CDATA[Quant Prep]]></summary></entry><entry><title type="html">Study Note - Options, Futures and Other Derivatives (John Hull)</title><link href="https://lihaoranicefire.github.io/FinanceTheoryNotes/" rel="alternate" type="text/html" title="Study Note - Options, Futures and Other Derivatives (John Hull)" /><published>2024-08-15T00:00:00+00:00</published><updated>2024-08-15T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/FinanceTheoryNotes</id><content type="html" xml:base="https://lihaoranicefire.github.io/FinanceTheoryNotes/"><![CDATA[<p>A study note taken for the book <em>Options, Futures and Other Derivatives</em> by John Hull</p>

<h2 id="chapter-1---introduction">Chapter 1 - Introduction</h2>

<ul>
  <li>exchange/over-the-counter(OTC) market</li>
  <li>forward/spot/futures contract</li>
  <li>long/short position</li>
  <li>call/put option</li>
  <li>exercise or strike price</li>
  <li>expiration date or maturity (american/european options diff)</li>
  <li>types of traders: hedgers/speculators/arbitrageurs using futures/options</li>
</ul>

<h3 id="practice-questions">Practice questions</h3>
<p>1.29</p>

<h2 id="chapter-2---futures-markets-and-central-counterparties">Chapter 2 - Futures Markets and Central Counterparties</h2>

<h3 id="22---specification-of-a-futures-contract">2.2 - Specification of a futures contract</h3>
<ul>
  <li>closing out</li>
  <li>asset, contract size, delivery arrangement, price quote, price limit, position limit</li>
</ul>

<h3 id="24---margin-accountsexchange-market">2.4 - Margin Accounts(exchange market)</h3>
<ul>
  <li>daily settlement, variation margin, maintenance margin, margin call, clearing house</li>
</ul>

<h3 id="25---otc-markets">2.5 - OTC markets</h3>
<ul>
  <li>central counterparty(CCP), bilateral/collateral clearing</li>
</ul>

<h3 id="26---market-quotes">2.6 - Market quotes</h3>
<ul>
  <li>open/high/low/settlement price</li>
  <li>trading volume, open interest</li>
  <li>pattern of futures: normal/inverted market</li>
</ul>

<h3 id="27---delivery">2.7 - Delivery</h3>
<ul>
  <li>first/last notice day, last trading day</li>
  <li>cash settlement</li>
</ul>

<h3 id="28---types-of-traders-and-types-of-orders">2.8 - Types of traders and types of orders</h3>
<ul>
  <li>types of raders: futures commission merchants(FCM)/locals</li>
  <li>types of speculators: scalpers, day traders, position traders</li>
  <li>types of orders: market orders, limit orders, stop orders, stop-and-limit orders, market-if-touched(MIT) orders or board order, discretinoary order or market-not-held order</li>
  <li>types of orders: day order, time-of-day order, open order or good-till-canceled order, fill-or-kill order</li>
</ul>

<h3 id="29---regulation">2.9 - Regulation</h3>
<ul>
  <li>“corner the market”</li>
</ul>

<h3 id="210---accounting-and-tax">2.10 - Accounting and tax</h3>
<ul>
  <li>hedge accounting</li>
  <li>corporate/noncorporate taxpayer</li>
  <li>capital gain/loss, ordinary income</li>
  <li>long/short-term capital gains</li>
  <li>capital loss deduction, carry back/forward</li>
  <li>60/40 rule, hedge transaction</li>
</ul>

<h3 id="211---forward-vs-futures-contracts">2.11 - Forward vs. Futures contracts</h3>
<ul>
  <li>futures prices: us cents per currency</li>
  <li>forward prices: currency per usd</li>
</ul>

<h2 id="chapter-3---hedging-strategies-using-futures">Chapter 3 - Hedging strategies using futures</h2>

<h3 id="31---basic-principles">3.1 - Basic principles</h3>
<ul>
  <li>short/long hedge</li>
</ul>

<h3 id="32---arguments-against-hedging">3.2 - Arguments against hedging</h3>
<ul>
  <li>shareholder hedge themselves</li>
  <li>if not hedging practices, it leads to fluctuation of profit margins</li>
  <li>hedging can offset gain so that it leads to a worse outcome</li>
</ul>

<h3 id="33---basis-risk">3.3 - Basis risk</h3>
<ul>
  <li>Basis = Spot price of asset to be hedged - Futures price of contract used</li>
</ul>

<h3 id="34---cross-hedging">3.4 - Cross hedging</h3>
<ul>
  <li>cross hedging</li>
  <li>hedge ratio, minimum variance hedge ratio, hedge effectiveness, optimal number of contracts</li>
  <li>tailing the hedge</li>
</ul>

<h3 id="35---stock-index-futures">3.5 - Stock index futures</h3>
<ul>
  <li>stock index</li>
  <li>Dow Jones Industrial Average</li>
  <li>Standard &amp; Poor’s 500 (S&amp;P500)</li>
  <li>Nasdaq-100</li>
</ul>

<h3 id="36---stack-and-roll">3.6 - Stack and roll</h3>

<h3 id="appendix---capital-asset-pricing-model-capm">Appendix - Capital asset pricing model (CAPM)</h3>
<ul>
  <li>systematic/nonsystematic risk</li>
  <li>Expected return on asset = \(R_F+\beta(R_M-R_F)\), \(R_M\) is the return on the market, \(R_F\) is the return on a risk-free investment, \(\beta\) is a parameter measuring systematic risk</li>
  <li>assumptions for CAPM</li>
</ul>

<h2 id="chapter-4---interest-rates">Chapter 4 - Interest rates</h2>

<h3 id="41---types-of-rates">4.1 - Types of rates</h3>
<ul>
  <li>credit risk, credit spread</li>
  <li>treasury rates</li>
  <li>overnight rates: (effective) federal funds rate, SONIA, ESTER, SARON, TONAR</li>
  <li>repo rates (very slightly below fed funds rate, but secured): overnight repo, term repos, SOFR</li>
</ul>

<h3 id="42---reference-rates">4.2 - Reference rates</h3>
<ul>
  <li>LIBOR</li>
  <li>The new reference rates, US (SOFR), other countries (overnight rates)</li>
  <li>Longer rates are determined from overnight rates by compounding them daily. The (annualized) interest rate for a period of \(D\) days is
  \(\left[\left(1+r_1\frac{d_1}{360}\right)\cdots\left(1+r_n\frac{d_n}{360}\right)-1\right]\times\frac{360}{D}\)</li>
  <li>new reference rates are essentially risk-free, so it face the problem of incorprating redit spread</li>
</ul>

<h3 id="43---the-risk-free-rate">4.3 - The risk-free rate</h3>
<ul>
  <li>Banks don’t use treasury rates as the risk-free rates for pricing the derivatives, instead they use the overnight rates</li>
</ul>

<h3 id="44---measuring-interest-rates">4.4 - Measuring interest rates</h3>
<ul>
  <li>compounding frequency: annually/semiannually/quarterly/monthly/weekly/daily. Suppose it is \(m\), then after \(n\) years, the terminal value of an investment of \(A\) at an interest rate of \(r\) per annum is
  \(A\left(1+\frac{r}{m}\right)^{mn}\)</li>
  <li>equivalent annual interest rate
  \(A\left(1+\frac{r_1}{m_1}\right)^{m_1n}=A\left(1+\frac{r_2}{m_2}\right)^{m_2n}\Rightarrow r_2=m_2\left[\left(1+\frac{r_1}{m_1}\right)^{m_1/m_2}-1\right]\)</li>
  <li>countinuous compounding: \(Ae^{rn}\)</li>
  <li>
\[Ae^{r_en}=A\left(1+\frac{r}{m}\right)^{mn}\Rightarrow r_e=m\ln\left(1+\frac{r}{m}\right),\quad r=m(e^{r_e/m}-1)\]
  </li>
</ul>

<h3 id="45---zero-rates">4.5 - zero rates</h3>
<ul>
  <li>bond with coupon</li>
  <li>\(n\)-year zero-coupon/spot/zero rate: no intermediate payments</li>
</ul>

<h3 id="46---bond-pricing">4.6 - Bond pricing</h3>
<ul>
  <li>principal or par value or face value. Suppose the (annualized) zero rates for maturities is \(r_1,\cdots,r_{mn}\), the coupon interest rate is \(c\), bond price
  \(B=A\left[\frac{c}{m}e^{-\frac{r_1}{m}}+\cdots+\left(1+\frac{c}{m}\right)e^{-\frac{r_{mn}}{m}}\right]\)</li>
  <li>bond yield: the single discount rate \(r\) such that
  \(A\left[\frac{c}{m}e^{-\frac{r}{m}}+\cdots+\left(1+\frac{c}{m}\right)e^{-\frac{r}{m}}\right]=B\)</li>
  <li>par yield: \(c\) such that
  \(A\left[\frac{c}{m}e^{-\frac{r_1}{m}}+\cdots+\left(1+\frac{c}{m}\right)e^{-\frac{r_{mn}}{m}}\right]=A\)</li>
</ul>

<h3 id="47---determining-zero-rates">4.7 - Determining zero rates</h3>
<ul>
  <li>bootstrap method</li>
  <li>zero curve</li>
</ul>

<h3 id="48---forward-rates">4.8 - Forward rates</h3>
<ul>
  <li>Suppose \(r_1\), \(r_2\) are the zero rates for maturities \(t_1\), \(t_2\), and \(r_f\) is the forward interest rate for the period of time between \(t_1\) and \(t_2\), then
  \(Ae^{r_1t_1}e^{r_f(t_2-t_1)}=Ae^{r_2t_2}\Rightarrow r_f=\frac{r_2t_2-r_1t_1}{t_2-t_1}=r_2+(r_2-r_1)\frac{t_1}{t_2-t_1}\)</li>
  <li>instantaneous forward rate for a maturity of \(t\)
  \(r_f=r+t\frac{\partial r}{\partial t}\)</li>
</ul>

<h3 id="49---forward-rate-agreements-fra">4.9 - Forward rate agreements (FRA)</h3>

<h3 id="410---duration">4.10 - Duration</h3>
<ul>
  <li>Suppose the bond provides the holder with cash flows \(c_i\) at time \(t_i\) (\(1\leq i\leq n\)), and bond yield is \(y\) (continuously compounded), then bond price
  \(B=\sum_{i=1}^nc_ie^{-yt_i}\)</li>
  <li>The duration is the weighted sum of times
  \(D=\sum_{i=1}^nt_i\frac{c_ie^{-yt_i}}{B}\)</li>
  <li>When a small change \(\Delta y\) in the yield
  \(\Delta B=\frac{dB}{dy}\Delta y=-\Delta y\sum_{i=1}^nt_ic_ie^{-yt_i}=-BD\Delta y\Rightarrow\frac{\Delta B}{B}=-D\Delta y\)
  describes the percentage changes in the bond price</li>
  <li>If \(y\) is expressed with a compounding frequency of \(m\) times per year, then
  \(B=\sum_{i=1}^nc_i\left(1+\frac{y}{m}\right)^{-mt_i},\quad D=\sum_{i=1}^nt_i\frac{c_i\left(1+\frac{y}{m}\right)^{-mt_i}}{B},\quad \frac{\Delta B}{B}=-\frac{D\Delta y}{1+y/m}\)</li>
  <li>modified duration \(D^*=\frac{D}{1+y/m}\) so that \(\frac{\Delta B}{B}=-D^*\Delta y\)</li>
  <li>dollar duration: \(D_{\$}=BD^*\) is product of modified duration and bond price so that \(\Delta B=-D_{\$}\Delta y\)</li>
  <li>The duration of a bond portfolio can be defined as a weighted average of the durations of the individual bonds, with weights being proportional to the value of the bond prices.</li>
</ul>

<h3 id="411---convexity">4.11 - Convexity</h3>
<ul>
  <li>The duration relationship applies only to small changes in yields</li>
  <li>convexity
  \(C=\frac{1}{B}\frac{d^2B}{dy^2}=\frac{\sum_{i=1}^nc_it_i^2e^{-yt_i}}{B}\)
  Then
  \(\Delta B=\frac{dB}{dy}\Delta y+\frac{1}{2}\frac{d^2B}{dy^2}\Delta y^2\Rightarrow \frac{\Delta B}{B}=-D\Delta y+\frac{C}{2}\Delta y^2\)</li>
</ul>

<h3 id="412---theories-of-the-term-structure-of-interest-rates">4.12 - Theories of the term structure of interest rates</h3>
<ul>
  <li>liquidity preference theory</li>
  <li>net interest income</li>
</ul>

<h2 id="chapter-5---determination-of-forward-and-futures-prices">Chapter 5 - Determination of forward and futures prices</h2>

<h3 id="51---investment-assets-vs-consumption-assets">5.1 - Investment assets vs consumption assets</h3>
<ul>
  <li>investment assets</li>
  <li>consumption assets</li>
</ul>

<h3 id="52---short-selling">5.2 - Short selling</h3>

<h3 id="53---assumption-and-notation">5.3 - Assumption and notation</h3>
<p>Assumptions:</p>
<ol>
  <li>The market participants are subject to no transaction costs when they trade</li>
  <li>The market participants are subject to the same tax rate on all net trading profits</li>
  <li>The market participants can borrow money at the same risk-free rate of interest as
they can lend money</li>
  <li>The market participants take advantage of arbitrage opportunities as they occur</li>
</ol>

<p>Notations:</p>
<ul>
  <li>\(T\): Time until delivery date in a forward or futures contract (in years)</li>
  <li>\(S_0\) : Price of the asset underlying the forward or futures contract today</li>
  <li>\(F_0\) : Forward or futures price today</li>
  <li>\(r\): Zero-coupon risk-free rate of interest per annum, expressed with continuous compounding, for an investment maturing at the delivery date (i.e., in \(T\) years)</li>
</ul>

<h3 id="54---forward-price-for-an-investment-asset">5.4 - Forward price for an investment asset</h3>
<p>The forward price for an investment asset should be \(F_0=S_0e^{rT}\). If \(F_0&gt;S_0e^{rT}\), then</p>
<ol>
  <li>Borrow \(S_0\) dollars with an interest rate of \(r\) for \(T\) years</li>
  <li>But 1 unit of asset</li>
  <li>Short a forward contract of 1 unit of asset that delivers in \(T\) years</li>
</ol>

<p>The net gain will be \(F_0-S_0e^{rT}\). If \(F_0&lt;S_0e^{rT}\), then</p>
<ol>
  <li>Short sale 1 unit of asset for \(S_0\) dollars</li>
  <li>Invest the proceeds at an interest rate of \(r\) for \(T\) years</li>
  <li>Enter a forward contract of 1 unit of asset that delivers in \(T\) years</li>
</ol>

<p>The net gain will be \(S_0e^{rT}-F_0\). Even if the short sale is not possible, as long as there are asset owners who are purely for investment, they will arbitrage</p>

<h3 id="55---known-income">5.5 - Known income</h3>
<p>Consider an investment asset will provide a perfectly predictable income with a present value of \(I\) during the life of a forward contract, we have
\(F_0 = (S_0 - I)e^{rT}\)</p>

<h3 id="56---known-yield-">5.6 - Known yield <a name="5.6KnownYield"></a></h3>
<p>Suppose the known yield of the forward contract is \(q\), compounded continuously, to make sure there is no positive net gain from selling and reinvesting in the asset, we need \(S_0e^{rT}=e^{qT}F_0\Rightarrow F_0=S_0e^{(r-q)T}\)</p>

<h3 id="57---valuing-forward-contracts">5.7 - Valuing forward contracts</h3>
<p>Suppose in addition \(K\) is the delivery price negotiated some time ago when the forward contract was purchased which delivers in \(T\) years, \(f\) is the value (if it were be sold) of the forward contract today and \(F_0\) is the price of the forward contract that delivery in \(T\) years, then
\(f=(F_0-K)e^{-rT}\)
If the forward contract present a known income of present value of \(I\), then
\(f=S_0-I-Ke^{-rT}\)
If the forward contract has a known yield \(y\), then
\(f=S_0e^{-qT}-Ke^{-rT}\)</p>

<h3 id="58---are-forward-prices-and-futures-prices-equal">5.8 - Are forward prices and futures prices equal</h3>
<ul>
  <li>if \(r\) is a known function of time, forward prices and futrues prices are equal</li>
  <li>if \(S\) is a strong positive correlation to \(r\), futures contracts is better</li>
  <li>if \(S\) is a strong negtive correlation to \(r\), forward contracts is better</li>
</ul>

<h3 id="59---futures-prices-of-stock-indices">5.9 - Futures prices of stock indices</h3>
<ul>
  <li>index arbitrage, program trading</li>
</ul>

<h3 id="510---forward-and-futures-contracts-on-currencies">5.10 - Forward and futures contracts on currencies</h3>
<p>\(S_0\) is the spot exchange rate in domestic currency, \(r_f\) is the value of the foreign risk-free interest rate when money is invested for time \(T\). \(r\) is the domestic risk-free rate when money is invested for this period of time. The relationship between \(F_0\) and \(S_0\) is
\(F_0 = S_0e^{(r-r_f)T}\)
The equation the same as the for known yeild in <a href="#5.6KnownYield">5.6</a>, because a foreign currency can be regarded as an investment asset paying a known yield. The yield is the risk-free rate of interest in the foreign currency</p>

<h3 id="511---futures-on-commodities">5.11 - Futures on commodities</h3>
<p>Storage costs can be treated as negative income. If \(U\) is the peresent value of all storage costs, then
\(F_0 = (S_0 + U) e^{rT}\)
If the storage costs incurred at any time are proportional to the price of the commodity, then can be treated as negative yield, so
\(F_0 = S_0 e^{(r+u)T}\)
Consumption commodities usually provide not income, but subject to significant storage costs.</p>

<h4 id="convenience-yield">Convenience yield</h4>
<p>\(F_0e^{yT} = (S_0+U)e^{rT}\)
or
\(F_0e^{yT} = S_0e^{(r+u)T}\)
If shortages are more likely occur or if the inventories are low, the convenience yield is higher</p>

<h3 id="512---the-cost-of-carry">5.12 - The cost of carry</h3>
<p>The cost of carry \(c\) is</p>
<ul>
  <li>\(r\) for a non-divident-paying stock</li>
  <li>\(r-q\) for a stock with divident yield rate \(q\)</li>
  <li>\(r-r_f\) for a currency</li>
  <li>\(r-q+u\) for a commodity that provides income at rate \(q\) and requires storage costs at rate \(u\)</li>
</ul>

<p>If \(y\) is the convenience yield rate, we then have
\(F_0=S_0e^{(c-y)T}\)
For futures contract, the party with the short position can choose to deliver at any time in a certain period (giving intention to deliver in a few days’ notice). If \(c&gt;y\), the party with short position will deliver as early as possible, if \(c&lt;y\), the party with short position will deliver as late as possible</p>

<h3 id="514---futures-prices-and-expected-future-spot-prices">5.14 - Futures prices and expected future spot prices</h3>
<h4 id="keynes-and-hicks-argument">Keynes and Hicks’ argument</h4>
<ul>
  <li>Expected future spot price</li>
  <li>If hedgers hold short positions and speculators hold long positions, the futures prices will be above the expected spot prices</li>
  <li>If hedgers hold long positions and speculators hold short positions, the futures prices will be below the expected spot prices</li>
</ul>

<h4 id="risk-and-return">Risk and return</h4>
<p>The modern approach is based on relationship between risk and expected return in the economy. Suppose \(k\) is an investor’s required return rate. Then the present value of this investment is
\(-F_0e^{-rT}+\mathbb E(S_T)e^{-kT}\)
So the futures contract should be priced at \(F_0 = \mathbb E(S_T)e^{(r-k)T}\). If the return is</p>
<ul>
  <li>uncorrelated with the stock market, then \(k=r\Rightarrow F_0=\mathbb E(S_T)\)</li>
  <li>positively correlated with the stock market, then \(k&gt;r\Rightarrow F_0&lt;\mathbb E(S_T)\). This is known as normal backwardation</li>
  <li>negatively correlated with the stock market, then \(k&lt;r\Rightarrow F_0&gt;\mathbb E(S_T)\). This is known as contango</li>
</ul>

<h2 id="chapter-6---interest-rate-futures">Chapter 6 - Interest rate futures</h2>

<h3 id="61---day-count-and-quotation-conventions">6.1 - Day count and quotation conventions</h3>
<p>Day count between dates is defined as
\(\frac{\text{Number of days between dates}}{\text{Number of days in reference period}}\)
Three day count conventions commonly used in the US are</p>
<ul>
  <li>Actual / actual (in period)</li>
  <li>30 / 360</li>
  <li>Actual / 360
And the interest earned between dates is
\(\text{Day count}\times\text{Interest earned in reference period}\)</li>
</ul>

<h4 id="price-quotations-of-us-treasury-bills">Price Quotations of U.S. Treasury Bills</h4>
<p>Suppose the face value of a treasury bill is 100, \(P\) is the quoted price (or \(P/100\) as discount rate), \(n\) is the remaining life measured in calendar days, \(Y\) is the cash price (or present value), then
\(Y=100-100\times\frac{P}{100}\times\frac{n}{360}\Leftrightarrow P=\frac{360}{n}(F-Y)\)</p>

<h4 id="price-quotations-of-us-treasury-bonds">Price Quotations of U.S. Treasury Bonds</h4>
<p>Treasury bond prices in the United States are quoted in dollars and thirty-seconds of a dollar. For example: 120-15 or \(120\frac{5}{32}\). We have
\(\text{cash price (or dirty price)} = \text{quoted price (or clean price)} + \text{Accrued interest since last coupon date}\)
The interest is accrued using the face value and day count</p>

<h3 id="62---treasury-bond-futures">6.2 - Treasury bond futures</h3>

<h2 id="chapter-7---swaps">Chapter 7 - Swaps</h2>

<h2 id="chapter-8---securitization-and-the-financial-crisis-of-2007-8">Chapter 8 - Securitization and the financial crisis of 2007-8</h2>

<h2 id="chapter-9---xvas">Chapter 9 - XVAs</h2>

<h2 id="chapter-10---mechanics-of-options-markets">Chapter 10 - Mechanics of options markets</h2>

<h3 id="102---option-positions">10.2 - Option positions</h3>
<p>Suppose the strike price is \(K\) and the terminal price is \(S_T\), the payoff for european option is</p>
<ul>
  <li>long call \(\max(S_T-K,0)\)</li>
  <li>short call \(-\max(S_T-K,0)\)</li>
  <li>long put \(\max(K-S_T,0)\)</li>
  <li>short put \(-\max(K-S_T,0)\)</li>
</ul>

<h3 id="103---underlying-assets">10.3 - Underlying assets</h3>
<p>options on stock/ETP/currency/stock index/futures</p>

<h3 id="104---specification-of-stock-options">10.4 - Specification of stock options</h3>
<ul>
  <li>expiration dates</li>
  <li>strike prices</li>
  <li>option class</li>
  <li>option series</li>
  <li>
    <table>
      <thead>
        <tr>
          <th> </th>
          <th>in the money</th>
          <th>at the money</th>
          <th>out of the money</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>call</td>
          <td>\(S&gt;K\)</td>
          <td>\(S=K\)</td>
          <td>\(S&lt;K\)</td>
        </tr>
        <tr>
          <td>put</td>
          <td>\(S&lt;K\)</td>
          <td>\(S=K\)</td>
          <td>\(S&gt;K\)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>intrinsic/time value</li>
  <li>FLEX options</li>
  <li>dividends effect on strike price</li>
  <li>\(n\)-for-\(m\) stock split, its effect on strike price</li>
  <li>position/exercise limits</li>
</ul>

<h3 id="105---trading">10.5 - Trading</h3>
<ul>
  <li>market makers</li>
  <li>bid-ask spread</li>
</ul>

<h3 id="107---margin-requirements">10.7 - Margin requirements</h3>
<ul>
  <li>naked options</li>
</ul>

<h3 id="108---the-options-clearing-coroperation-ooc">10.8 - The options clearing coroperation (OOC)</h3>

<h3 id="109---regulation">10.9 - Regulation</h3>

<h3 id="1010---taxation">10.10 - Taxation</h3>
<p>Treat as capital gain/loss</p>
<ul>
  <li>wash sale rule</li>
  <li>constructive sales</li>
</ul>

<h3 id="1011---warrants-employee-stock-options-and-convertibles">10.11 - Warrants, employee stock options, and convertibles</h3>
<ul>
  <li>warrants</li>
  <li>employee stock options</li>
  <li>convertibles (bonds)</li>
  <li>exotic option</li>
</ul>

<h2 id="chapter-11---properties-of-stock-options">Chapter 11 - Properties of stock options</h2>

<h3 id="111---factors-affecting-option-prices">11.1 - Factors affecting option prices</h3>
<ul>
  <li>current stock price \(S_0\)</li>
  <li>strike price \(K\)</li>
  <li>maturity date \(T\)</li>
  <li>volatility \(\sigma\)</li>
  <li>risk-free rate \(r\)</li>
  <li>dividends</li>
  <li>call/put american options are more valuable as \(T\) increases, not necessarily for european option, as there might be dividends</li>
  <li>options are more valuable as \(\sigma\) increases, because the benefits are limitless whereas the loss is at most the cost of the option</li>
  <li>If \(r\) increases, the expected return should increase, while any future cash flow would decrease in present value.  The combined impact increase the value of call options and decrease the value of put options</li>
  <li>if the ex-dividend date is in the life of a call/put option, the value of the option is negatively/positively related to the size of the dividend</li>
</ul>

<h3 id="112---assumputions-and-notation">11.2 - Assumputions and notation</h3>
<p>Assumptions:</p>
<ol>
  <li>There are no transaction costs</li>
  <li>All trading profits (net of trading losses) are subject to the same tax rate</li>
  <li>Borrowing and lending are possible at the risk-free interest rate</li>
</ol>

<p>Notations:</p>
<ul>
  <li>\(S_T\): Stock price on the expiration date \(T\)</li>
  <li>\(C\): Value of American call option to buy one share</li>
  <li>\(P\): Value of American put option to sell one share</li>
  <li>\(c\): Value of European call option to buy one share</li>
  <li>\(p\): Value of European put option to sell one share</li>
</ul>

<h3 id="113---upper-and-lower-bounds-for-option-prices">11.3 - Upper and lower bounds for option prices</h3>

<h4 id="upper-bounds">Upper bounds</h4>
<ul>
  <li>
\[c, C\leq S_0\]
  </li>
  <li>\(P\leq K\), \(p\leq Ke^{-rT}\)</li>
</ul>

<h4 id="lower-bound-for-european-callsputs-on-non-dividend-paying-stocks">Lower bound for european calls/puts on non-dividend-paying stocks</h4>
<ul>
  <li>\(c\geq \max(S_0-Ke^{-rT},0)\), \(p\geq\max(Ke^{-rT}-S_0,0)\)</li>
</ul>

<h3 id="114---put-call-parity">11.4 - Put-call parity</h3>
<p>Suppose to portforlios</p>
<ul>
  <li>1 european call(no dividends) + 1 bond(0-coupon) with payoff of \(K\) at \(T\)</li>
  <li>1 european put(no dividends) + 1 share of stock</li>
</ul>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>\(S_T&gt;K\)</th>
      <th>\(S_T&lt;K\)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>call + bond</td>
      <td>\((S_T-K)+K=S_T\)</td>
      <td>\(0+K=K\)</td>
    </tr>
    <tr>
      <td>put + stock</td>
      <td>\(0+S_T=S_T\)</td>
      <td>\((K-S_T)+K=K\)</td>
    </tr>
  </tbody>
</table>

<p>The payoff for both are \(\max(S_T,K)\), so
\(c+Ke^{-rT}=p+S_0\)</p>

<h3 id="115---calls-on-a-non-dividend-paying-stock">11.5 - Calls on a non-dividend-paying stock</h3>
<p>For options with no dividends, you never want to exercise early, so \(c=C\)</p>

<h3 id="116---puts-on-a-non-dividend-paying-stock">11.6 - Puts on a non-dividend-paying stock</h3>
<p>It can be optimal to exercise an american put option on a non-divident-paying stock early when \(S_0\) decreases, \(r\) increases, \(\sigma\) decreases, and \(P\geq\max(K-S_0,0)\)</p>

<h3 id="117---effect-of-dividends">11.7 - Effect of dividends</h3>

<p>\(c\geq\max(S_0-D-Ke^{-rT},0)\quad p\geq\max(D+Ke^{-rT}-S_0,0)\)
the put-call parity becomes
\(c+D+Ke^{-rT}=p+S_0\)
\(S_0-D-K\leq C-P\leq S_0-Ke^{-rT}\)</p>

<h3 id="practice-questions-1">Practice questions</h3>
<h4 id="q1123">Q11.23</h4>
<p>For american options with no dividends, we have
\(C+Ke^{-rT}\leq P+S_0\leq C+K\)
Which is equivalent to
\(S_0-K\leq C-P\leq S_0-Ke^{-rT}\)</p>

<h2 id="chapter-12---trading-strategies-involving-options">Chapter 12 - Trading strategies involving options</h2>

<h2 id="chapter-13---binomial-trees">Chapter 13 - Binomial trees</h2>

<h2 id="chapter-14---wiener-processes-and-itos-lemma">Chapter 14 - Wiener processes and Ito’s lemma</h2>

<h3 id="141---the-markov-property">14.1 - The markov property</h3>
<ul>
  <li>markov process is a stochastic process that only depends on the current value and time, not the history</li>
</ul>

<h3 id="142---continuous-time-stochastic-processes">14.2 - Continuous-time stochastic processes</h3>
<h4 id="wiener-process-or-brownian-motion">wiener process or brownian motion</h4>
<p>Suppose \(\mathcal N(\mu,\sigma^2)\) stands for the normal distribution of mean \(\mu\) and variance \(\sigma^2\). \(z\) is a wiener process if it has the following two properties</p>
<ul>
  <li>Property 1. The change of \(z\) in a small perod of time \(\Delta t\) is \(\Delta z\sim\mathcal N(0,\Delta t)\)</li>
  <li>Property 2. The values of \(\Delta z\) for two different short intervals of time are independent</li>
</ul>

<p>Property 2 implies that \(z\) follows a markov process. Together with Property 1, we can deduce \(z(t_2)-z(t_1)\sim\mathcal N(0,t_2-t_1)\)
When \(\Delta t\) is small, \(\sqrt{\Delta t}\) is much bigger than \(\Delta t\), so</p>
<ol>
  <li>The expected length of path followed by \(z\) in any time interval is infinite</li>
  <li>The expected number of times \(z\) equals any particular value in any time interval is infinite</li>
</ol>

<h4 id="generalized-wiener-process">generalized wiener process</h4>
<p>Suppose \(dz\sim\mathcal N(0,dt)\), the generalized wiener process \(x\) is such that \(dx=adt+bdz\) Where \(a\) is the drift rate and \(b\) is the variance rate. We can deduce
\(\Delta x=a\Delta t+b\Delta z,\quad \Delta z\sim\mathcal N(0,\Delta t)\)
Thus \(\Delta x\sim\mathcal N(a\Delta t,b^2\Delta t)\) and \(x(t_2)-x(t_1)\sim\mathcal N(a(t_2-t_1), b^2(t_2-t_1))\)</p>

<h4 id="ito-process">ito process</h4>
<p>\(dx=a(x,t)dt+b(x,t)dz\)
So for small \(\Delta t\), \(\Delta x=a(x,t)\Delta t+b(x,t)\Delta z\) with \(\Delta z\sim\mathcal N(0,\Delta t)\). This is still a markov process</p>

<h3 id="143---the-process-for-a-stock-price">14.3 - The process for a stock price</h3>
<p>If we assume the expected return and the volatility (uncertain) are constant. Then the stock price should satisfy
\(\frac{dS}{S}=\mu dt+\sigma dz\Rightarrow dS=\mu Sdt+\sigma Sdz\)
\(\mu\) is the stock’s expected rate of return, \(\sigma\) is the volatility of the stock price. \(\sigma^2\) is referred to as its variacne rate.</p>

<p>In a risk-neutral work, \(\mu\) equals the risk-free rate \(r\)</p>

<p>This is known as geometric brownian motion, and the discrete-time version of the model is
\(\frac{\Delta S}{S}=\mu\Delta t+\sigma\Delta z,\quad \Delta z\sim\mathcal N(0,\Delta t)\)
\(\frac{\Delta S}{S}\sim\mathcal N(\mu\Delta t,\sigma^2\Delta t)\) is the return in a period of time \(\Delta t\)</p>
<h4 id="monte-carlo-simulation">monte carlo simulation</h4>

<h3 id="144---the-parameters">14.4 - The parameters</h3>
<ul>
  <li>\(\mu\) should increase if the risk is higher or the interest rates are higher</li>
  <li>\(\sigma\) should be approximately the standard deviation of the stock price in 1 year</li>
  <li>\(\sigma\) is critically important to the determination of value of many derivatives</li>
</ul>

<h3 id="145---correlated-processes">14.5 - Correlated processes</h3>
<p>Suppose
\(dx_1=a_1dt+b_1dz_1,\quad dx_2=a_2dt+b_2dz_2\)
And \(dz_1,dz_2\) have correlation \(\rho\). In practice we can set
\(dz_1=u\sqrt{dt},\quad dz_2=(\rho u+\sqrt{1-\rho^2}v)\sqrt{dt}\)
Where \(u,v\sim\mathcal N(0,1)\)</p>

<h3 id="146---itos-lemma">14.6 - Ito’s lemma</h3>
<p>Suppose \(x\) follows ito process
\(dx=a(x,t)dt+b(x,t)dz,\quad dz\sim\mathcal N(0,dt)\)
Then \(G=G(x,t)\) as a function of \(x,t\) follows the ito process
\(dG=\left(\frac{\partial G}{\partial x}a+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2\right)dt+\frac{\partial G}{\partial x}bdz\)
So if \(G\) is a function of \(S, t\), then
\(dG=\left(\frac{\partial G}{\partial S}\mu S+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial S^2}\sigma^2S^2\right)dt+\frac{\partial G}{\partial S}\sigma Sdz\)</p>

<h4 id="application-to-forward-contracts">Application to forward contracts</h4>
<p>Consider a forward contract on a non-dividend-paying stock. \(F\) is
the forward price at a general time \(t\), \(S\) is the stock price at time \(t\), with \(t&lt;T\). Then \(F = Se^{r(T-t)}\), and
\(dF=\left(e^{r(T-t)}\mu S-Sre^{r(T-t)}\right)dt+e^{r(T-t)}\sigma Sdz=(\mu-r)Fdt+\sigma Fdz\)
is a geometric brownian motion. It has the same volatility as \(S\) and an expected growth rate of \(\mu - r\) rather than \(\mu\).</p>

<h3 id="147---the-lognormal-property-">14.7 - The lognormal property <a name="14.7LogNormalProperty"></a></h3>
<p>Consider \(G=\ln S\), then we have
\(dG=\left(\mu-\frac{\sigma^2}{2}\right)dt+\sigma dz\)
So
\(\ln S_T-\ln S_0\sim\mathcal N\left(\left(\mu-\frac{\sigma^2}{2}\right)T, \sigma^2T\right)\Rightarrow\ln S_T\sim\mathcal N\left(\ln S_0+\left(\mu-\frac{\sigma^2}{2}\right)T, \sigma^2T\right)\)</p>

<h3 id="148---fractional-brownian-motion">14.8 - Fractional brownian motion</h3>
<p>Suppose \(dx=\sigma dz\) with \(x_0=0\), \(z\) is a wienner process, then \(E(x_t-x_s)=0\) and
\(\begin{align*}
E((x_t-x_s)^2)&amp;=E(x_t^2)+E(x_s^2)-2E(x_tx_s)\\
&amp;=E(x_t^2)+E(x_s^2)-2E((x_t-x_s+x_s)x_s)\\
&amp;=E(x_t^2)+E(x_s^2)-2E((x_t-x_s)x_s)-2E(x_s^2)\\
&amp;=E(x_t^2)-E(x_s^2)\\
&amp;=\sigma^2t-\sigma^2s\\
&amp;=\sigma^2(t-s)
\end{align*}\)
here \(x_t-x_s\) and \(x_s\) are uncorrelated. In a fractional or fractal brownian motion, we assume
\(E((x_t-x_s)^2)=\sigma^2(t-s)^{2H}\)
\(H\) is the Hurst exponent. When \(H=0.5\), it becomes a regular brownian motion. Also
\(E(x_tx_s)=\frac{1}{2}\left(E(x_t^2)+E(x_s^2)-E((x_t-x_s)^2)\right)=\frac{1}{2}\sigma^2[t^{2H}+s^{2H}-(t-s)^{2H}]\)
So the correlation between \(x_t\) and \(x_s\) is
\(\frac{\sigma^2[t^{2H}+s^{2H}-(t-s)^{2H}]}{2s^Ht^H}\)
Fractional brownian motoin is non-markov. If \(t&gt;s&gt;u\)
\(\begin{align*}
E[(x_t-x_s)(x_s-x_u)]&amp;=E[x_tx_s-x_s^2-x_tx_u+x_sx_u]\\
&amp;=\frac{\sigma^2}{2}[(t-u)^{2H}-(t-s)^{2H}-(s-u)^{2H}]
\end{align*}\)
When \(H\) or time steps decreases, the volatity increases, so the process becomes more noisy</p>

<h3 id="appendix---a-nonrigorous-derivation-of-itos-lemma">Appendix - A nonrigorous derivation of ito’s lemma</h3>
<p>\(\Delta G=\frac{\partial G}{\partial x}\Delta x+\frac{\partial G}{\partial t}\Delta t+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}\Delta x^2+\frac{\partial^2 G}{\partial x\partial t}\Delta x\Delta t+\frac{1}{2}\frac{\partial^2 G}{\partial t^2}\Delta t^2\)
and
\(\Delta x=a(x,t)\Delta t+b(x,t)\Delta z,\quad \Delta z\sim\mathcal N(0,\Delta t)\)
so \(\Delta x^2=b^2\Delta z^2+a^2\Delta t^2+2ab\Delta z\Delta t=b^2\Delta z^2+O(\Delta t)\), substitute to get
\(\Delta G=\frac{\partial G}{\partial x}\Delta x+\frac{\partial G}{\partial t}\Delta t+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2\Delta z^2+O(\Delta t)\)
And we know that \(E(\Delta z^2)=\Delta t\), \(Var(\Delta z^2)=2\Delta t^2\). As \(\Delta t\to0\), \(\Delta t^2/\Delta t\) is getting smaller, so \(dz^2=dt\) becomes nonstochastic, therefore we have
\(\begin{align*}
dG&amp;=\frac{\partial G}{\partial x}dx+\frac{\partial G}{\partial t}dt+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2dt\\
&amp;=\frac{\partial G}{\partial x}(adt+bdz)+\frac{\partial G}{\partial t}dt+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2dt\\
&amp;=\left(\frac{\partial G}{\partial x}a+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}b^2\right)dt+\frac{\partial G}{\partial x}bdz\\
\end{align*}\)
When \(dx=adt+\sum_{i=1}^mb_idz_i\), we have
\(dG=\left(\frac{\partial G}{\partial x}a+\frac{\partial G}{\partial t}+\frac{1}{2}\frac{\partial^2 G}{\partial x^2}\sum_{i=1}^m\sum_{j=1}^mb_ib_j\rho_{ij}\right)dt+\sum_{i=1}^m\frac{\partial G}{\partial x_i}b_idz_i\)
Here \(\rho_{ij}\) is the correlation coefficient between \(dz_i,dz_j\). When \(G\) is a function of variables \(x_1,x_2,\dots,x_n,t\) and \(dx_i = a_i dt + b_i dz_i\), we have
\(dG=\left(\sum_{i=1}^n\frac{\partial G}{\partial x_i}a_i+\frac{\partial G}{\partial t}+\sum_{i=1}^n\sum_{j=1}^n\frac{1}{2}\frac{\partial^2 G}{\partial x_ix_j}b_ib_j\rho_{ij}\right)dt+\sum_{i=1}^n\frac{\partial G}{\partial x_i}b_idz_i\)</p>

<h2 id="chapter-15---the-black-scholes-merton-model">Chapter 15 - The Black-Scholes-Merton model</h2>

<h3 id="151---lognormal-property-of-stock-prices">15.1 - Lognormal property of stock prices</h3>
<p>Recall from <a href="#14.7LogNormalProperty">14.7</a>
\(\ln S_T\sim\mathcal N\left(\ln S_0+\left(\mu-\frac{\sigma^2}{2}\right)T, \sigma^2T\right)\)
so that
\(\mathbb E(S_T)=S_0e^{\mu T}\)
\(\mathrm{Var}(S_T) = S_0^2e^{2\mu T}(e^{\sigma^2T}-1)\)</p>

<h3 id="152---the-distribution-of-the-rate-of-return">15.2 - The distribution of the rate of return</h3>
<p>Suppose the continuously compounded rate of return per annum realized between time 0 and \(T\) is \(x\), then \(S_T=S_0e^{xT}\Rightarrow x=\frac{1}{T}\ln\frac{S_T}{S_0}\), hence
\(x\sim\mathcal N\left(\mu-\frac{\sigma^2}{2},\frac{\sigma^2}{T}\right)\)</p>

<h3 id="153---the-expected-return">15.3 - The expected return</h3>
<p>Greater risk or higher level of interest rate would mean higher expected return \(\mu\).</p>

<p>Note that \(\mathbb E(x)=\mu-\frac{\sigma^2}{2}&lt;\mu\). This is because ambiguity about what expected return is. In the sense of arithmetic mean we have
\(\ln[\mathbb E(S_T)]=\ln(S_0)+\mu T\)
however in the sense of geometric mean we have
\(\mathbb E[\ln(S_T)]=\ln(S_0)+E(x)T\)
In fact \(\ln[\mathbb E(S_T)]&gt;\mathbb E[\ln(S_T)]\) since \(\ln\) is a concave function (Jensen’s inequality), so \(E(x)&lt;\mu\)</p>

<h3 id="154---volatility">15.4 - Volatility</h3>
<p>To estimate the volatility of a stock price empirically, we observe \(n+1\) times with prices \(S_0,S_1,\cdots,S_n\) and suppose \(\tau\) is the length of time interval in years. The the estimate
\(s=\sqrt{\frac{\sum_{i=1}^n(u_i-\bar u)^2}{n-1}}\)
is the sample standard deviation of \(u_i=\ln(S_i/S_{i-1})\)</p>

<p>The standard deviation of \(u_i\) is \(\sigma\sqrt\tau\), and \(\hat\sigma=s/\sqrt\tau\) is an unbiased estimator of \(\sigma\) of standard error
\(\frac{\sigma}{\sqrt{n-1}}\sqrt{(n-1)-\frac{2\Gamma(\frac{n}{2})^2}{\Gamma(\frac{n-1}{2})^2}}\approx\frac{\sigma}{\sqrt{2(n-1)}}\approx\hat\sigma/\sqrt{2n}\)
Which uses
\(\frac{\Gamma(\frac{n}{2})}{\Gamma(\frac{n-1}{2})}=\sqrt{\frac{n-1}{2}}\left(1-\frac{1}{4(n-1)}+O\left(\frac{1}{n^2}\right)\right)\)
If \(D\) is the amount of the dividend in some period, then \(u_i=\ln\frac{S_i+D}{S_{i-1}}\)</p>

<p>Research shows that volatility is much higher when the exchange is open for trading than when it is closed. We calculate
\(\text{Volatility per annum}=\text{Volatility per trading day}\times\sqrt{\text{Number of trading days per annum}}\)
The life of an option is
\(T=\frac{\text{Number of trading days until option maturity}}{252}\)
Where the number of trading days in a year is usually assumed to be 252</p>

<h3 id="155-the-idea-underlying-black-scholes-merton-differential-equation">15.5 The idea underlying Black-Scholes-Merton differential equation</h3>

<p>Assumptions</p>
<ol>
  <li>The stock price follows the process developed in Chapter 14 with \(\mu\) and \(\sigma\) constant.</li>
  <li>The short selling of securities with full use of proceeds is permitted.</li>
  <li>There are no transaction costs or taxes. All securities are perfectly divisible.</li>
  <li>There are no dividends during the life of the derivative.</li>
  <li>There are no riskless arbitrage opportunities.</li>
  <li>Security trading is continuous.</li>
  <li>The risk-free rate of interest, \(r\), is constant and the same for all maturities.</li>
</ol>

<h3 id="156-derivation-of-black-scholes-merton-differential-equation">15.6 Derivation of Black-Scholes-Merton differential equation</h3>

<p>The idea is use a portforlio of the stock and the derivative to eliminate the wiener process</p>

<h2 id="chapter-16---employee-stock-options">Chapter 16 - Employee stock options</h2>

<h2 id="chapter-17---options-on-stock-indices-and-currencies">Chapter 17 - Options on stock indices and currencies</h2>

<h2 id="chapter-18---futures-options-and-blacks-model">Chapter 18 - Futures options and Black’s model</h2>

<h2 id="chapter-19---the-greek-letters">Chapter 19 - The Greek letters</h2>

<h2 id="chapter-20---volatility-smiles-and-volatility-surfaces">Chapter 20 - Volatility smiles and volatility surfaces</h2>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="notes" /><summary type="html"><![CDATA[A study note taken for the book Options, Futures and Other Derivatives by John Hull]]></summary></entry><entry><title type="html">Song Lyrics</title><link href="https://lihaoranicefire.github.io/JapaneseSongLyrics/" rel="alternate" type="text/html" title="Song Lyrics" /><published>2024-06-01T00:00:00+00:00</published><updated>2024-06-01T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/JapaneseSongLyrics</id><content type="html" xml:base="https://lihaoranicefire.github.io/JapaneseSongLyrics/"><![CDATA[<p>A list of song lyrics</p>

<h2 id="ヒロイン">ヒロイン</h2>

\[君の毎日に\quad 僕は\;\overset{にあ}{似合}わない\;かな\]

\[白い空から\quad 雪が\;\overset{お}{落}ちた\]

\[\overset{べつ}{別}に\;\overset{い}{良}いさ\;と\quad \overset{は}{吐}き\overset{だ}{出}した\;ため\overset{いき}{息}\;が\]

\[\overset{すこ}{少}し\;\overset{のこ}{残}って\quad\overset{さび}{寂}しそう\;に\;\overset{き}{消}えた\]

\[君の街にも\quad 降っている\;かな\]

\[ああ\;\overset{いま}{今}\;隣で\]

\[雪が\;\overset{きれい}{綺麗}\;と\;\overset{わら}{笑}うのは君が\;いい\]

\[でも\;\overset{さむ}{寒}い\;ね\;って\;\overset{うれ}{嬉}しそう\;なの\;も\]

\[\overset{ころ}{転}びそう\;になって\;\overset{つか}{掴}んだ\;\overset{て}手のその \overset{さき}先で\]

\[ありがとう\;って\;\overset{たの}{楽}しそう\;なのも\]

\[それも君がいい\]]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="notes" /><summary type="html"><![CDATA[A list of song lyrics]]></summary></entry><entry><title type="html">Machine learning notes</title><link href="https://lihaoranicefire.github.io/MLNotes/" rel="alternate" type="text/html" title="Machine learning notes" /><published>2023-03-15T00:00:00+00:00</published><updated>2023-03-15T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/MLNotes</id><content type="html" xml:base="https://lihaoranicefire.github.io/MLNotes/"><![CDATA[<h2 id="data-preprocessing">Data Preprocessing</h2>

<h3 id="data-scaling-and-standardizing">Data scaling And standardizing</h3>
<p>\(x_i\leftarrow\dfrac{x_i-\mu}{\sigma}\)</p>

<h4 id="pros--cons">Pros &amp; Cons</h4>
<p>Dataset will be normalized, avoid unbalanced dataset</p>

<h4 id="usage">Usage</h4>
<p><code class="language-plaintext highlighter-rouge">sklearn.StandardScaler</code></p>

<h3 id="imputation">Imputation</h3>
<p>The process of replace missing values is known as data <em>imputation</em></p>

<h4 id="examples">Examples</h4>
<ul>
  <li>Constant imputation: replace with contants</li>
  <li>Linear interpolation/Regression Imputation: replace using a regression model</li>
  <li>median/mean/mode/(sample statistic) imputation: replace with median/mean/mode/(sample statistic)</li>
  <li>forward/backward fill: replace with previous/next value</li>
  <li>KNN: replace using the mode the closest \(k\) neighbors</li>
</ul>

<h6 id="remark">Remark</h6>
<p>You should not use the labels or test data set to imputate training data set</p>

<h3 id="pipelines">Pipelines</h3>

<h5 id="definition">Definition</h5>
<p>A <em>pipeline</em> is a series of data processing components arranged sequentially, each component in the pipeline performs a specific task.</p>

<h6 id="pros--cons-1">Pros &amp; Cons</h6>
<p>This process streamlines the workflow, makes it easier to combine and expriment different algorithms and models.</p>

<h6 id="example">Example</h6>
<p>Learning cubic polynomial \(y=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\epsilon\)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model.LinearRegression

pipe = Pipeline([
    ('poly', PolynomialFeatures(3, interaction_only=False, include_bias=False)),
    ('reg', LinearRegression(copy_X=True))
])
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">PolynomialFeatures</code> generate higher powers \(x^n\) from \(x\).</p>

<h2 id="supervised-learning">Supervised Learning</h2>

<p>Given Dataset \(D=\{(\mathbf x^{(i)},\mathbf y^{(i)})\}_{i=1}^N\), where \(\mathbf x^{(i)}\) are <em>feature vectors</em>, its entries are called <em>features</em>, and \(\mathbf y^{(i)}\) are <em>labels</em> or <em>predictions</em>. Assume \(\mathbf y=f(\mathbf x)+\boldsymbol\epsilon\) is the true relation, where \(f\) is a (typically continuous) function and \(\boldsymbol\epsilon\) is a random noise (typically \(\mathbb E(\boldsymbol\epsilon)=\mathbf 0\) and independent). <em>Supervised learning</em> is to “learn” a <em>model</em> \(\hat f\) of \(f\) and make predictions \(\hat{\mathbf y}=\hat f(\mathbf x)\).</p>

<h3 id="bias-variance-trade-off">Bias-variance trade-off</h3>

<p>Suppose \(\mathbb E[\boldsymbol\epsilon]=\mathbf 0\), \(\hat f(\mathbf x)=\hat f(\mathbf x;D)\) with \(D\) sampled from joint probability distribution of \((\mathbf X,\boldsymbol\epsilon)\). Consider the expected total error at a fixed test input \(\mathbf x\) (so \(\mathbb E=\mathbb E_{\mathbf X,\boldsymbol\epsilon|\mathbf X=\mathbf x}\))
\(\begin{align*}
\mathbb E[\|\mathbf y-\hat{\mathbf y}\|^2] &amp;= \mathbb E[\|f+\boldsymbol\epsilon-\hat f\|^2]\\
&amp;=\mathbb E[\|f-\hat f\|^2]-2\mathbb E[(f-\hat f)\cdot\boldsymbol\epsilon]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\
&amp;=\mathbb E[\|f-\hat f\|^2]-2\mathbb E[f-\hat f]\cdot\mathbb E[\boldsymbol\epsilon]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\
&amp;=\mathbb E[\|f-\mathbb E\hat f+\mathbb E\hat f-\hat f\|^2]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\
&amp;=\mathbb E[\|f-\mathbb E\hat f\|^2]+2\mathbb E[(f-\mathbb E\hat f)\cdot(\mathbb E\hat f-\hat f)]+\mathbb E[\|\mathbb E\hat f-\hat f\|^2]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\
&amp;=\|f-\mathbb E\hat f\|^2+2(f-\mathbb E\hat f)\cdot\mathbb E[\mathbb E\hat f-\hat f]+\mathbb E[\|\mathbb E\hat f\|^2-2\mathbb E\hat f\cdot\hat f+\|\hat f\|^2]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\
&amp;=\|f-\mathbb E\hat f\|^2+\mathbb E[\|\hat f\|^2]-\|\mathbb E\hat f\|^2+\mathbb E[\|\boldsymbol\epsilon\|^2]\\
&amp;=\|f-\mathbb E\hat f\|^2+\text{Var}[\hat f]+\mathbb E[\|\boldsymbol\epsilon\|^2]\\
\end{align*}\)
Here \(\sigma^2\) is referred to as the <em>irreducible error</em>, so we have the simplified version
\(\text{total error = bias}^2 + \text{Variance + irreducible error}\)</p>

<p>When underfitting the model, the model is too simple so that the model bias is huge (e.g., using a linear equation approximate a quadratic). When overfitting the model, the model is too complex, so the model variance is great (e.g., use a high-degree polynomial to approximate a linear relation with small random noise). One has to make tradeoff between bias and variance so that both aren’t significant.</p>

<h3 id="objective--loss-function">Objective &amp; Loss function</h3>

<p>To improve the model, we need loss functions</p>
<ul>
  <li><em>Mean squared error (MSE)</em>: \(\displaystyle\frac{1}{N}\sum_{i=1}^N\|\mathbf y^{(i)}-\hat{\mathbf y}^{(i)}\|^2\). Used in regression</li>
  <li><em>Mean Absolute Error (MAE)</em>: \(\displaystyle\frac{1}{N}\sum_{i=1}^N\|\mathbf y^{(i)}-\hat{\mathbf y}^{(i)}\|\).</li>
  <li><em>Root Mean Square Error (RMSE)</em>: \(\displaystyle\sqrt{\frac{1}{N}\sum_{i=1}^N\|\mathbf y^{(i)}-\hat{\mathbf y}^{(i)}\|^2}\).</li>
  <li><em>Logistic Loss or Cross-Entropy Loss</em>: \(\displaystyle-\sum_i[y_i\log(\hat p_i)+(1-y_i)\log(1-\hat p_i)]\). Used in binary classification. Or \(\displaystyle-\sum_{c=1}^C\sum_{i}y_{i,c}\log(\hat p_{i,c})\). Used in multiclass classification</li>
</ul>

<p>But to prevent overfitting, we also include a regularization term \(\Omega(\theta)\). The objective function is then the sum of \(L(\theta)\) and \(\Omega(\theta)\)</p>

<h3 id="k-fold-cross-validation--grid-search">\(k\)-fold cross validation &amp; grid search</h3>

<ol>
  <li><em>\(k\)-fold cross validation</em> divide the dataset into \(k\) subsets. Train the model on \(k-1\) subsets independently \(k\) times by single out each as the validation set. And eventually take the average of the parameters.</li>
  <li><em>Grid search</em> provide an array of values for each parameter and test model with every value and choose the best one.</li>
</ol>

<h4 id="usage-1">Usage</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

GridSearchCV(
    cv = KFold(n_splits=5, random_state=30293, shuffle=True),
    estimator = KNeighborsRegressor(),
    param_grid = {
        'n_neighbors': range(1, 50),
        'weights': ['uniform', 'distance']
    },
    scoring = 'neg_mean_squared_error'
)
</code></pre></div></div>

<h6 id="remark-1">Remark</h6>
<p>When should you not use cross-validation, and use simple validation instead?</p>
<ol>
  <li>Dataset size is too small. This can lead to deficiencies in both model fitting and estimation.</li>
  <li>Model training time is too long. It might not worth the time.</li>
</ol>

<h3 id="gradient-descent">Gradient descent</h3>
<p>The method of <em>gradient descent</em> is to decrease the loss function \(\ell\) by \(\beta\leftarrow\beta-\alpha\nabla(\beta)\).
Some common adjustments are</p>
<ol>
  <li><em>Mini-batch gradient descent</em>: instead of use the entire dataset, cycling through mini batches to generate gradients.</li>
  <li><em>Stochastic gradient descent</em>: Randomly generates learning rates \(\alpha\) each time.
    <ul>
      <li>Pros: Avoid of being stuck in a local minimum.</li>
    </ul>
  </li>
</ol>

<h4 id="comparisons-of-common-gradient-descent-methods">Comparisons of common gradient descent methods</h4>
<ul>
  <li><em>Stochastic gradient descent(SGD)</em> is to update the parameter according to individual gradient
  Gradient descent is
  \(\theta_{t+1}=\theta_t-\lambda\cdot\nabla L(\theta_t)\)
  And SGD is when \(L(\theta)=\dfrac{1}{N}\sum_iL_i(\theta)\)</li>
  <li><em>Momentum</em> \(v\) is defined by
  \(\begin{cases}
  v_{t+1}=\beta\cdot v_t+(1-\beta)\cdot\nabla L(\theta_t)\\
  \theta_{t+1}=\theta_t-\lambda\cdot v_{t+1}
  \end{cases}\)
  This includes the “inertia” from the previous momentums and gradients, it helps accelerate convergence in the direction of persistent gradient, and reduce oscillations.</li>
  <li><em>Adaptive gradient(Adagrad)</em> is mathematically described by
  \(\begin{cases}
  G_{t+1}=G_t+|\nabla L(\theta_t)|^2\\
  \theta_{t+1}=\theta_t-\dfrac{\lambda}{\sqrt{G_{t+1}+\epsilon}}\cdot \nabla L(\theta_{t+1})
  \end{cases}\)
  \(G_t\) is the accumulated squared gradients (like the second momentum). This method ensures that the learning rate doesn’t get too small when having really large gradients. This method is good with sparse data but might overly reduce learning rate when encountering some frequently occuring features with large gradients.</li>
  <li><em>Root mean squared propagation(RMSprop)</em> is slightly changing Adagrad
  \(\begin{cases}
  G_{t+1}=\beta\cdot G_t+(1-\beta)|\nabla L(\theta_t)|^2\\
  \theta_{t+1}=\theta_t-\dfrac{\lambda}{\sqrt{G_{t+1}+\epsilon}}\cdot \nabla L(\theta_{t+1})
  \end{cases}\)
  This method helps mitigate the problem of diminishing learning rate</li>
  <li><em>Adaptive moment estimate(Adam)</em> combines momentum and RMSProp
  \(\begin{cases}
  m_{t+1}=\beta_1\cdot m_t+(1-\beta_1)\cdot\nabla L(\theta_t)\\
  v_{t+1}=\beta_2v_t+|\nabla L(\theta_t)|^2\\
  \hat m_{t+1} = \dfrac{m_t}{1-\beta_1^{t+1}}\\
  \hat v_{t+1} = \dfrac{v_t}{1-\beta_2^{t+1}}\\
  \theta_{t+1}=\theta_t-\dfrac{\lambda}{\sqrt{\hat v_{t+1}+\epsilon}}\cdot \hat m_{t+1}
  \end{cases}\)
  The normalization prevents bias from early initialization (for example, \(m_0=v_0=0\), dividing \(1-\beta_1,1-\beta_2\) could make them less biased, as time progress, \(\beta^t\) has exponential decay and goes to 0 and has no effect in normalizing).</li>
</ul>

<h3 id="regularization">Regularization</h3>
<p>Regularization is adding penalty terms to reduce the loss function. It controls the magnitude of the feature vector \(\beta\)</p>
<ol>
  <li><em>Ridge regularization</em> is to add \(\lambda\|\beta\|_2^2\)</li>
  <li><em>Lasso regularization</em> is to add \(\lambda\|\beta\|_1\)</li>
</ol>

<h4 id="pros--cons-2">Pros &amp; Cons</h4>
<ol>
  <li>Lasso works better for feature selection, so it is better if there are a large amount of features. But it might only randomly choose some of highly correlated features (colinearity).</li>
  <li>Ridge is better if it depends on almost all the features, because it handles colinearity better. However it is computationally costly with a large number of predictors</li>
</ol>

<h4 id="elastic-net">Elastic net</h4>
<p>Sometimes it might be better to simply use the <em>elastic net</em> regurlarization which add \(\lambda_1\|\beta\|_1+\lambda_2\|\beta\|_2^2\)</p>

<h3 id="confusion-matrix">Confusion matrix</h3>
<p>The <em>confusion matrix</em> is the \(2\times2\) contingency table, where the rows are the predicted values, and columns are the actual values.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Positive</th>
      <th>Negative</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Positive</td>
      <td>TP</td>
      <td>FP</td>
    </tr>
    <tr>
      <td>Negative</td>
      <td>FN</td>
      <td>TN</td>
    </tr>
  </tbody>
</table>

<p>We define</p>
<ul>
  <li><em>Accuracy</em> = \(\dfrac{TP+TN}{TP+FP+FN+TN}\)
Accuracy is used if the dataset is balanced and equally distributed, e.g. spam detection</li>
  <li><em>Precision</em> = \(\dfrac{TP}{TP+FP}\)
Precision is used if the cost for false positive is high, e.g. Fraud detection</li>
  <li><em>Recall(Sensitivity)</em> = \(\dfrac{TP}{TP+FN}\)
Recall is used if the cost for false negative is high, e.g. disease detection</li>
  <li><em>Specificity</em> = \(\dfrac{TN}{TN+FP}\)</li>
  <li><em>F1 score</em> = harmonic mean of Precision and Recall, i.e.
\(\text{F1 score} = \dfrac{2}{\dfrac{1}{\text{Precision}}+\dfrac{1}{\text{Recall}}} = 2\dfrac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}\)
F1 score is the single metric of both the precision and recall which balances the Precision-Recall tradeoff by taking both into account, especially if there is an uneven class distribution, e.g. search engine ranking for relevance.</li>
</ul>

<h3 id="roc">ROC</h3>

<p>In binary classification, \(\hat Y\) is usually a continuous random variable. <em>Receivers operating characteristic curve (ROC)</em> is the parametrized curve \((\text{fpr}(t),\text{tpr}(t))\), \(t\in\mathbb R\) where</p>
<ul>
  <li>
    <table>
      <tbody>
        <tr>
          <td>$$\displaystyle\text{tpr(t)}=\frac{TP}{TP+FN}=\mathbb P(\hat Y\geq t</td>
          <td>Y=1)$$ is true positive rate (recall)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>
    <table>
      <tbody>
        <tr>
          <td>$$\displaystyle\text{fpr(t)}=\frac{FP}{FP+TN}=\mathbb P(\hat Y\geq t</td>
          <td>Y=0)$$ is false positive rate (1-specificity)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>\(t\) is cut-off</li>
</ul>

<p>It is not hard to conclude</p>
<ul>
  <li>A total random model corresponds to the diagonal line, where \(\hat Y\) is independent of \(Y\) and thus \(\text{tpr}(t)=\text{fpr}(t)=\mathbb P(\hat Y\geq t)\)</li>
  <li>The perfect model corresponds to two segments \((0,0)\to(0,1)\) and \((0,1)\to(1,1)\), where \(\mathbb P(\hat Y\geq t_0)=\mathbb P(Y=1)\) for some \(t_0\), and \(\text{tpr}(t)=1,\text{fpr}(t)=0\)</li>
  <li>\(\text{tpr}(-\infty)=\text{fpr}(-\infty)=1\), \(\text{tpr}(\infty)=\text{fpr}(\infty)=0\), \(\text{tpr},\text{fpr}\) are non-increasing</li>
</ul>

<p>The <em>Area under ROC (AUROC/AUC)</em> measures a comprehensive classifier’s performance, if it is \(\frac{1}{2}\), and it is like random, if it is 1, then it is outstanding discrimination. AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Suppose \(Z_1\sim\hat Y|Y=1\) has cdf \(1-\text{tpr}\) and \(Z_0\sim\hat Y|Y=0\) has cdf \(1-\text{fpr}\) are independent
\(\begin{align*}
\text{AUC}&amp;=\int_0^1ydx=\int_{+\infty}^{-\infty}\text{tpr}(t)d\text{fpr}(t)\\
&amp;=\int_{+\infty}^{-\infty}\text{tpr}(t)\text{fpr}'(t)dt\\
&amp;=\int_{+\infty}^{-\infty}\left(-\int_t^\infty\text{tpr}'(s)ds\right)\text{fpr}'(t)dt\\
&amp;=\int_{-\infty}^\infty\int_{-\infty}^\infty\text{tpr}'(s)\text{fpr}'(t)\mathbf 1_{s\geq t}(s,t)dsdt\\
&amp;=\mathbb P(Z_1\geq Z_0)
\end{align*}\)
Suppose \(\{Z_0^{(i)}\}_{i=1}^{n_0}\sim\hat Y|Y=0\), \(\{Z_1^{(j)}\}_{j=1}^{n_1}\sim\hat Y|Y=1\) are (independent) samples, then an unbiased estimator of AUC is \(\dfrac{U}{n_0n_1}\), where
\(U=\sum_{i=1}^{n_0}\sum_{j=1}^{n_1}\mathbf 1_{Z_1^{(j)}\geq Z_0^{(i)}}=n_0n_1+\frac{n_0(n_0+1)}{2}-R_0\)
Note that this is precisely the <em>Wilcoxon-Mann-Whitney (WMW)</em> \(U\)-statistic. Where \(R_0\) is the sum of ranks \(Z_0^{(i)}\) in all of \(Z\)’s</p>

<ul>
  <li>Proof: Suppose the ranks of \(Z_0^{(i)}\) are \(r_1&lt;\cdots&lt;r_{n_0}\), then \(R_0=r_1+\cdots+ r_{n_0}\) and 
  \(\begin{align*}
  U&amp;=(n_0+n_1-r_{n_0})+(n_0+n_1-1-r_{n_0-1})+\cdots+(n_0+n_1-(n_0-1)-r_1)\\
  &amp;=n_0(n_0+n_1)-R_1-\frac{n_0(n_0-1)}{2}\\
  &amp;=n_0n_1+\frac{n_0(n_0+1)}{2}-R_0
  \end{align*}\)</li>
</ul>

<h3 id="r-squared">R Squared</h3>

<p><em>Total sum of squares</em> \(SS_{tot}=\sum_i(y_i-\bar y)^2\)</p>

<p><em>Residual sum of squares</em> \(SS_{res}=\sum_i(y_i-\hat y)^2\)</p>

<p><em>Coefficient of determination (\(R^2\))</em> is defined to be \(1-\dfrac{SS_{res}}{SS_{tot}}\)</p>

<p>If \(R^2=0\), it means the model have worst predictions since it is a constant average prediction, if \(R^2=1\), then the model is accurate.</p>

<p>GAIN/LIFT charts</p>

<h3 id="k-nearest-neighbors">\(k\)-nearest neighbors</h3>
<p>The <em>\(k\)-nearest neighbors</em> algorithm assigns the most likely label from the nearest neighbors.</p>

<h3 id="linear-and-logistic-regression">Linear and logistic regression</h3>
<p><em>Logistic regression</em> used in binary classification by \(p(x)=\dfrac{1}{1+e^{-\beta x}}\).</p>

<h4 id="interaction-terms-in-linear-regression">Interaction terms in linear regression</h4>
<p>When you have categorical vairables, you should add interaction terms since it might has a impact on other variables.</p>

<h4 id="residual-plots">Residual plots</h4>
<ul>
  <li>Residual vs features. It helps find missing signals and identify missing interaction terms.</li>
  <li>Residual vs predicted.</li>
</ul>

<h4 id="feature-selection">Feature selection</h4>
<p>The <em>best subsets selection</em> tries every possible subset of features and then choose the best one. This is very computational costly. Instead we could do</p>
<ul>
  <li><em>Forwards selection</em>: Start with baseline model (no features selected), and each step, try all the remaining features with the current model, choose the best performing one (minimal MSE), and discard the others, and iterate, if non is better than current, then stop and use current model.</li>
  <li><em>Backwards selection</em>: Start with a model that includes all features, then lose features one at a time. If losing any is worse, then stop and use current model.</li>
</ul>

<p>We could also try simply lasso regularization.</p>

<h4 id="regressino-version-of-classification-algorithms">Regressino version of classification algorithms</h4>
<ol>
  <li><em>\(k\) nearest neighbors regression</em> takse the average of the \(k\) nearest values.</li>
  <li><em>Tree regression</em> use MSE as loss function</li>
  <li><em>Supported vector regression</em></li>
</ol>

<h3 id="supported-vector-machine">Supported vector machine</h3>
<p>In binary classification, given a dataset \(\{(\mathbf x_i,y_i)\}_{i=1}^N\), where \(y_i=\pm1\) is the label. Naively, <em>supported vector machine</em> is used to find a border that maximize the margin between two classes.</p>

<h4 id="hard-margin">Hard margin</h4>
<p>If the data is linear separable, we wish to find a hyperplane \(\mathbf w\cdot\mathbf x-b=\mathbf0\) that separate these two classes with maximal margin. Equivalently, it is to solve the following problem: Find \(\mathbf w\) and \(b\) that minimize \(\|\mathbf w\|_2^2\) and subject to
\(y_i(\mathbf w\cdot\mathbf x_i-b)\geq 1\)
The geometric interpretation depends on the fact:
\(\text{The distance between the origin and the plane }\mathbf w\cdot\mathbf x-b=0\text{ is }\frac{|b|}{\|\mathbf w\|_2}\)
We want to choose \(\mathbf w,b\) such that \(\mathbf w\cdot\mathbf x-b=1\), \(\mathbf w\cdot\mathbf x-b=-1\) barely touches two classes. So the margin between each class and the border would be \(\frac{1}{\|\mathbf w\|_2}\). Note that this max-margin hyperplane is completely determined by those \(\mathbf x_i\) that lie nearset to it, they are called <em>support vectors</em>.</p>

<h4 id="hinge-loss">Hinge loss</h4>
<p>The <em>hinge loss</em> is a function like \(\ell(y)=\max(0,1-t\cdot y)\).</p>

<h4 id="soft-margin">Soft margin</h4>
<p>If the dataset is not linearly separable, we introduce the <em>hinge function</em> \(\max(0,1-y_i(\mathbf w\cdot\mathbf x_i-b))\), this penalize data on the wrong side of the margin. We can define a loss function
\(\lambda\|\mathbf w\|_2^2+\frac{1}{N}\sum_{i=1}^N\max(0,1-y_i(\mathbf w\cdot\mathbf x_i-b))\)
If \(\lambda\) is small, then it is basically hard-margin SVM. A soft-margin optimization problem could be to minimize \(\lambda\|\mathbf w\|_2^2+\frac{1}{N}\sum_{i=1}^N\zeta_i\) subject to
\(y_i(\mathbf w\cdot\mathbf x_i-b)\geq 1-\zeta_i,\quad \zeta_i\geq0\)</p>

<h4 id="nonlinear-kernels">Nonlinear kernels</h4>
<p>Sometimes it is very hard to separate data, we consider transformations \(\varphi\) that takes \(\mathbf x_i\) into higher dimensional spaces (even infinite dimensions!). And if we make sufficiently good choices, we don’t need to care what \(\varphi\) really does and we simply need to know what \(\kappa(\mathbf x_i,\mathbf x_j)=\varphi(\mathbf x_i)\cdot\varphi(\mathbf x_j)\) is, \(\kappa\) is called <em>kernel</em>. Common examples are</p>
<ol>
  <li>Linear: \(\kappa(\mathbf x_i,\mathbf x_j)=\mathbf x_i\cdot\mathbf x_j\).</li>
  <li>Polynomlial: \(\kappa(\mathbf x_i,\mathbf x_j)=(\mathbf x_i\cdot\mathbf x_j+r)^d\).
 Note that for example if we choose \(\varphi(x_1,x_2)=(x_1^2,\sqrt2x_1x_2,x_2^2)\), then
 \(\varphi(\mathbf x)\cdot\varphi(\mathbf y)=x_1^2x_2^+2x_1x_2y_2y_2+y_1^2y_2^2=(x_1y_1+x_2y_2)^2=(\mathbf x\cdot\mathbf y)^2\)</li>
  <li>Gaussian Radial Kernel: \(\kappa(\mathbf x_i,\mathbf x_j)=\exp(-\gamma\|\mathbf x_i-\mathbf x_j\|_2^2)\).</li>
  <li>Sigmoid: \(\kappa(\mathbf x_i,\mathbf x_j)=\tanh(\gamma\mathbf x_i\cdot\mathbf x_j+r)\).</li>
</ol>

<p>We can solve the dual optimization problem.</p>

<h3 id="bayes-based-classifiers">Bayes’ based classifiers</h3>

<h4 id="linear-discriminant-analysis-lda">Linear discriminant analysis (LDA)</h4>
<p>Assume \(X|y=c\sim\mathcal N(\mu_c,\sigma^2)\), in the case where \(X\) has one feature, we have
\(f_c(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu_c)^2}{2\sigma^2}\right)\)
Then Bayes’ rule tells us
\(P(y=c|X)=\frac{\pi_c\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu_c)^2}{2\sigma^2}\right)}{\sum_{l=1}^C\pi_l\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu_l)^2}{2\sigma^2}\right)}\)
Here \(\pi_c\) denotes \(P(y=c)\). So we could estimate
\(\hat\mu_c=\frac{1}{N_c}\sum_{y_i=c}X_i\)
\(\hat\sigma^2=\frac{1}{N-C}\sum_{c=1}^C\sum_{y_i=c}(X_i-\hat\mu_c)^2\)
We make predictions by choosing \(c\) rendering maximum likelihood \(P(y=c|X)\), this is equivalent to choose largest <em>discriminant function</em>
\(\delta_c(X)=X\frac{\mu_c}{\sigma^2}-\frac{\mu_c^2}{2\sigma^2}+\log(\pi_c)\)
Here we should use \(\hat\mu_c,\hat\sigma\). In the case where \(X\) has \(m\) features, we have \(X|y=c\sim\mathcal N(\mu_c,\Sigma)\), and
\(f_c(\mathbf x)=\frac{1}{(2\pi)^{m/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf x-\mu_c)^T\Sigma^{-1}(\mathbf x-\mu_c)\right)\)
And the discriminant function will be
\(\delta_c(X)X^T\Sigma^{-1}\mu_c-\frac{1}{2}\mu_c^T\Sigma^{-1}\mu_c+\log(\pi_c)\)</p>

<h4 id="quadratic-discriminant-analysis-qda">Quadratic discriminant analysis (QDA)</h4>
<p>Assume \(X|y=c\sim\mathcal N(\mu_c,\Sigma_c)\), we get discriminant
\(\begin{align*}
\delta_c(X)&amp; = -\frac{1}{2} \left( X - \mu_c \right)^T \Sigma_c^{-1}  \left(X- \mu_c  \right) - \frac{1}{2}\log\left(|\Sigma_c| \right) + \log(\pi_c)\\
&amp;= -\frac{1}{2} X^{T} \sigma^{-1}_c X + X^{T} \sigma^{-1}_c \mu_c - \frac{1}{2} \mu_c^T \sigma_c^{-1} \mu_c - \frac{1}{2}\log\left(|\Sigma_c| \right) + \log(\pi_c)
\end{align*}\)</p>

<h4 id="naive-bayes-classifier">Naive Bayes classifier</h4>
<p>Assume for each given class \(c\), each of the \(m\) features are independent, we then have
\(f_c(X)=f^{(1)}_c(X_1)\cdots f^{(m)}_c(X_m)\)
Then by Bayes rule
\(P(y=c|X)=\frac{\pi_cf^{(1)}_c(X_1)\cdots f^{(m)}_c(X_m)}{\sum_{l=1}^C\pi_lf^{(1)}_l(X_1)\cdots f^{(m)}_l(X_m)}\)
To estimate \(f^{i}_c\) we assume some kind of distribution and hten estimate the parameters</p>
<ul>
  <li>If \(X_i\) are quantitative, we assume it is a normal distribution</li>
  <li>If \(X_i\) are categorical, we assume it is a Bernouli distribution</li>
</ul>

<h4 id="pros--cons-3">Pros &amp; Cons</h4>
<ol>
  <li>LDA works better for smaller datasets and QDA works for large datasets</li>
  <li>LDA works better if the data can be mostly separated by linear decision boundaries. QDA works better if the decision boundaries are not linear.</li>
  <li>If we have really small amount of data, we can use naive Bayes model. This in general a decent classifier.</li>
</ol>

<h3 id="decision-trees">Decision trees</h3>

<h4 id="pros--cons-4">Pros &amp; Cons</h4>
<ul>
  <li>Pros
    <ul>
      <li>Very fast and very needs little data preprocessing</li>
    </ul>
  </li>
  <li>Cons
    <ul>
      <li>This algorithm is greedy, so might not create an optimal tree</li>
      <li>Decision trees have orthogonal boundaries, which might not be ideal</li>
      <li>Decision trees are sensitive to training data</li>
    </ul>
  </li>
</ul>

<h4 id="gini-impurity">Gini impurity</h4>
<p><em>Gini impurity</em> \(I_G\) is defined by
\(I_G(p)=\sum_ip_i(1-p_i)=\sum_i(p_i-p_i^2)=1-\sum_ip_i^2\)
\(I_G(p)\) is between 0 and 1, if \(I_G(p)=0\), then it is of a single class, if it is \(1-\dfrac{1}{N}\), it is evenly distributed.</p>

<h4 id="cross-entropy">Cross entropy</h4>
<p>The <em>information content (surprisal)</em> of an event \(A\) is quantified as \(\log\left(\dfrac{1}{P(A)}\right)=-\log P(A)\). The expected surprisal of \(A\) is \(-P(A)\log P(A)\). The <em>Entropy</em> under \(P\) is the sum of expected surprisal
\(H(P)=-\mathbb E_P[\log P]=-\sum_ip_i\log(p_i)\)
The <em>Cross-entropy</em> of \(Q\) under \(P\) is
\(H(P,Q)=-\mathbb E_P[\log Q]=-\sum_ip_i\log(q_i)\)
which measures the discrepancy using \(Q\) as predictions given the actual distribution is \(P\).</p>

<p>The <em>(KL convergence)</em> of \(P\) from \(Q\) is
\(D_{KL}(P||Q)=\sum_ip_i\log\left(\dfrac{p_i}{q_i}\right)=H(P,Q)-H(P)\)
This is always nonnegative (<em>Gibb’s inequality</em>) since
\(\begin{align*}
-D_{KL}(P||Q)&amp;=\sum_ip_i\ln\left(\dfrac{q_i}{p_i}\right)\\
&amp;\leq\sum_ip_i\left(\dfrac{q_i}{p_i}-1\right)\\
&amp;=\sum_iq_i-\sum_ip_i\\
&amp;=0
\end{align*}\)
So the minimum of the cross entropy \(H(P)\) is attained \(\iff P=Q\)</p>

<p>Note that cross entropy and relative entropy measures the same because the entropy of \(P\) is fixed.</p>

<h4 id="cart-algorithm">CART algorithm</h4>
<p>The <em>CART</em> (Classification and Regression Trees) algorithm is a decision tree-based algorithm that can be used for both classification and regression problems in machine learning. It works by recursively partitioning the training data into smaller subsets using binary splits.</p>

<h4 id="random-forest">Random Forest</h4>
<p>The <em>random forest</em> model is made by building many different decision trees. These trees are made “different” through a variety of random perturbations. Finally take the average of all trees.</p>

<p><a href="https://xgboost.readthedocs.io/en/stable/tutorials/model.html">XGBoost Tutorials</a></p>

<h4 id="boosting">Boosting</h4>
<p>A statistical learning algorithm is said to be a</p>
<ul>
  <li><em>weak learner</em> if it does slightly better thant random guessing.</li>
  <li><em>strong learner</em> if can be made arbitrarily close the true relationship</li>
</ul>

<p>Thanks to PAC (Probably approximately correct) learnability, one can show that there exists <em>boosting</em> algorithms that can turn weak learners into strong learners.</p>

<p>For example a decision tree with a single layer (decision stump) is a weak learner, whereas a decision tree is a strong leaner.</p>

<h4 id="adaptive-boosting">Adaptive boosting</h4>
<p><em>Adaptive boosting</em> is building stronger learners iteratively by learning the weakness of the previous weak leaner. Suppose we have iteratively built up the first \(j\) weak learners, we now construct the \(j+1\)-th weak learner. Suppose the prediction of \(y_i\) by the \(j\)-th weak learner is \(\hat y^{(j)}_i\), and assume the current weight assigned to \(y_i\) is \(w_i\), then we calculate the <em>weighted error rate</em> = 1 - weighted accuracy
\(r_j=\frac{\sum_{\hat y^{(j)}_i\neq y_i}w_i}{\sum_{i=1}^Nw_i}\)
We then calculate the wieght assigned to the \(j\)-th weak learner
\(\alpha_j=\eta\log\left(\frac{1-r_j}{r_j}\right)\)
\(\eta\) is the learning rate. Finally we update the traning sample weights for \(j+1\)-th weak learner
\(w_i\leftarrow\begin{cases}
w_i, \hat y^{(j)}_i=y_i\\
w_i\exp(\alpha_j), \hat y^{(j)}_i\neq y_i
\end{cases}\)</p>

<h4 id="gradient-boosting">Gradient boosting</h4>
<p><em>Gradient boosting</em> is iteratively building an ensenble of weak learners where a learner is directly trained to model the previous learner’s errors. Suppose we have built the first \(j\) weak learners, we build the \(j+1\)-th weak learner by trained to learn to predict ther residual \(r_j\) of the previous learner, and set \(h_{j+1}(X)=\hat r_j\) as its estimate of the residual, and then calculate the residual of this weak learner \(r_{j+1}=r_j-h_j(X)\). By the end the strong learner \(h(X)\) found is the sum of all the weak learners \(h_j(X)\).</p>

<p><em>XGBoost (extreme Gradient boosting)</em> is a specific implementation of gradient boosting that is optimized for performance, efficiency, and scalability. So it is very popular.</p>

<h3 id="time-series">Time series</h3>

<ol>
  <li>A <em>time series</em> is a sequence of data points \(\{(\mathbf x_t,y_t)\}\) where \(\mathbf x_t\) is a collection of features, \(y_t\) is a numeric variable of interest, and \(t\) stands for time.</li>
  <li>Given a time series
\(\{(\mathbf x_{t_i},y_{t_i})\}_{i=1}^n\), a <em>forecast</em> is
\(y_t=f(\mathbf x_t,t|\{y_\tau\}_{\tau&lt;t})+\epsilon_t\).</li>
  <li>A model for time series is a series of random variables \(\{y_t\}_{t\in T}\), where \(y_t\) only depends on \(\mathbf x_t,t\), and \(\mathbf x_t\) is a collection of features that only depends on \(t\).</li>
</ol>

<h4 id="baseline-forecasting-models">Baseline forecasting models</h4>
<ol>
  <li>without trend nor seasonality
    <ul>
      <li><em>Average forecast</em> assumes \(y_t\) are independent and identically distributed. The forecast \(y_t=\dfrac{1}{n}\sum\limits_{i=1}^ny_i+\epsilon\) takes the historical average.</li>
      <li><em>Naive forecast</em> assumes \(y_t\) is a random walk. The forecast \(y_{t}=y_n+\epsilon\) only uses the last observation.</li>
    </ul>
  </li>
  <li>with trend but not seasonality
    <ul>
      <li>Linear trend forecast assumes \(E(y_t)=\beta t\). The forecast is \(y_t=\hat\beta t+\epsilon\) with \(\hat\beta\) being the average of first differences \(y_{i+1}-y_i\). An intercept term can be added.</li>
      <li>Random walk with drift assumes \(y_{t+1}=y_t+\beta+\epsilon\). The forecast is \(y_t=y_n+\hat\beta(t-n)+\epsilon\) with \(\hat\beta\) being the average of first differences.</li>
    </ul>
  </li>
  <li>with seasonality but not trend
    <ul>
      <li>Seasonal average forecast assume \(\{y_{r+km}\}_{k}\) are independent and identically distributed for each \(0\leq r&lt;m\). The forecast is
 \(y_t=\dfrac{1}{\lfloor n/m\rfloor+1}\sum\limits_{k=0}^{\lfloor n/m\rfloor}y_{r+km},\quad r=t\mod m\)</li>
      <li>Seasonal naive forecast assumes \(\{y_{r+km}\}_{k}\) are random walks. The forecast is
 \(y_t=y_\tau+\epsilon,\quad \tau=t-\left(\left\lfloor\frac{t-n}{m}\right\rfloor+1\right)m\)</li>
    </ul>
  </li>
</ol>

<h4 id="stationary-series">Stationary series</h4>
<ol>
  <li>A time series is <em>strictly stationary</em> if \(y_{t_1},\cdots,y_{t_n}\) and \(y_{t_1+\tau},\cdots,y_{t_n+\tau}\) has the same joint probability distribution for any \(n,\tau,t_1,\cdots, t_n\). In particular, we would have
    <ul>
      <li>\(E(y_t)=\mu\) and \(\operatorname{Var}(y_t)=\sigma^2\).</li>
      <li>The joint distribution of \(y_{t_1},\cdots,y_{t_n}\) only depends on \(t_{i+1}-t_i\), these are referred to as the <em>lags</em>.</li>
    </ul>
  </li>
  <li>A time series is <em>stationary</em> if
 \(E(y_t)=\mu,\qquad\operatorname{Cov}(y_t,y_{t+\tau})=\gamma(\tau)\)
 here \(\gamma(\tau)\) is called the <em>autovariance</em>, and note that \(\operatorname{Var}(y_t)=\gamma(0)=\sigma^2\).</li>
</ol>

<h5 id="examples-1">Examples</h5>
<ol>
  <li><em>White noise</em> is a stationary time series with zero mean constant variance and zero correlation between different times.</li>
  <li>The first differences \(y_{t+1}-y_t\) of a random walk \(y_{t+1}=y_t+\epsilon\).</li>
  <li>A moving average process \(y_t=\beta_0\epsilon_t+\beta_1\epsilon_{t-1}+\cdots+\beta_q\epsilon_{t-q}\).</li>
</ol>

<h5 id="differencing">Differencing</h5>
<p>The \(d\)-th differences \(\nabla^{d}y_t=\nabla^{d-1}y_t-\nabla^{d-1}y_{t-1}\) often produce a stationary series from a non-stationary one.</p>

<h4 id="arima">ARIMA</h4>
<ol>
  <li>A time series is <em>autoregressive</em> (AR) of order \(p\) if
 \(y_t=\alpha_1y_{t-1}+\cdots+\alpha_py_{t-p}+\epsilon_t\)</li>
  <li>A time series is autoregressive of order \(p\) with moving average noise (ARMA) of order \(q\) if
 \(y_t=\alpha_1y_{t-1}+\cdots+\alpha_py_{t-p}+\beta_0\epsilon_t+\beta_1\epsilon_{t-1}+\cdots+\beta_q\epsilon_{t-q}\)</li>
  <li>An autoregressive integrated moving average model (ARIMA(\(p,d,q\))) is a time series that its \(d\)-th difference is an ARMA(\(p,q\)).</li>
</ol>

<h2 id="unsupervised-learning">Unsupervised Learning</h2>

<h3 id="principal-components-analysis-following-scikit-learn">Principal components analysis (Following scikit-learn)</h3>

<p><em>Principal components analysis (PCA)</em> is a <em>dimension reduction</em> algorithm. Its goal is to project into a lower dimensional space that maximizes variance.</p>

<p>Suppose there are \(N\) observations
\(\{\mathbf x^{(i)}=(x^{(i)}_1,\cdots,x^{(i)}_p)\}_{i=1}^N\)
of \(p\) features \(\mathbf X=(X_1,\cdots,X_p)\), then
\(\mathbb EX_q=\frac{1}{N}\sum_{i=1}^Nx^{(i)}_q, \quad \text{Cov}(X_q,X_r)=\mathbb E[(X_q-\mathbb EX_q)(X_r-\mathbb EX_r)]\)
Denote \(A=\begin{bmatrix}\mathbf x^{(1)}\\\vdots\\\mathbf x^{(N)}\end{bmatrix}\), and \(\bar A\) whose \(q\)-th column consists of only \(\mathbb EX_q\), then the covariance matrix is
\(\Sigma=\text{Cov}(\mathbf X,\mathbf X)=\mathbb E[(\mathbf X-\mathbb E\mathbf X)^T(\mathbf X-\mathbb E\mathbf X)]=\frac{1}{N-1}(A-\bar A)^T(A-\bar A)\)
A heuristic algorithm could be</p>

<ol>
  <li>Center the dataset so that each feature has zero mean \(\iff A\leftarrow A-\bar A\)</li>
  <li>Induction on \(k\). Choose the \(k\)-th weight vector \(\mathbf w^{(k)}=(w^{(k)}_1,\cdots, w^{(k)}_N)^T\in\mathbb R^N\) such that
 \(\|\mathbf w^{(k)}\|=1, \qquad \mathbf w^{(k)}\perp\mathbf w^{(i)},\quad\forall i&lt;k\)
 that maximizes variance
 \(\text{Var}(\mathbf X^T\mathbf w^{(k)})=(\mathbf w^{(k)})^T\text{Var}(\mathbf X)\mathbf w^{(k)}=(\mathbf w^{(k)})^T\Sigma\mathbf w^{(k)}\)</li>
</ol>

<p>This is just <em>singular value decomposition</em> for \(A-\bar A\). Suppose
\(A-\bar A=V^TSW\)
is the singular decomposition, then
\(\Sigma=W^T\frac{S^2}{N-1}W\)</p>

<p>The \(k\)-th principal component of \(\mathbf x^{(i)}\) is \(\mathbf x^{(i)}\cdot\mathbf w^{(k)}\). The <em>explained variances</em> are the diagonal elements in \(\dfrac{S^2}{N-1}\).</p>

<h3 id="t-distributed-stochastic-neighbor-embedding">\(t\)-distributed stochastic neighbor embedding</h3>

<p><em>\(t\)-distributed stochastic neighbor embedding (tSNE)</em> typically reduce the dimension of the set of \(m\) features down to 2 to 3 for visualization. Suppose \(y_i\) is a low dimensional projection of \(x_i\), we could define conditional probabilities</p>

\[p_{j|i}=\frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_{k\neq i}\exp(-\|x_i-x_k\|^2/2\sigma_i^2)}\]

\[q_{j|i}=\frac{(1+\|y_i-y_j\|^2)^{-1}}{\sum_{k\neq i}(1+\|y_i-y_k\|^2)^{-1}}\]

<p>Assuming
\(p_{i|i}=q_{i|i}=0\). \(p_{j|i}\) and \(q_{j|i}\) are expected to be close. We can choose the cost function to be KL convergence, this would optimize \(y_i\)’s.</p>

<h4 id="pros--cons-5">Pros &amp; Cons</h4>
<ol>
  <li>Since it is stochastic, it generates slightly different results each time.</li>
  <li>Unlike, this is not reusable on making predictions for new data.</li>
  <li>The magnitude of the distances between clusters shouldn’t be interpreted.</li>
  <li>tSNE results should not be used as statistical evidence or proof of something, and it sometimes can produce clusters that aren’t actually true. Thus it is always a good practice to run it a few times to ensure that the cluster persists.</li>
</ol>

<h3 id="k-means-clustering">\(K\) means clustering</h3>
<p><em>\(K\) means clustering</em> tries to divide a dataset into \(k\) clusters. Start with random guess of \(k\) centroids. Then group all points according to distance to the centroids. Recalculate the centroid as the average of each group. Repeat these steps until you see no change of groups.</p>

<h4 id="how-to-choose-the-best-k">How to choose the best \(K\)?</h4>
<p>Typically we run the algorithm multiple times examing the behaviour of the model depending on different values of \(K\) according to some metric, and then choose the best.</p>
<ol>
  <li>The <em>elbow method</em>. We first calculate the <em>inertia</em> of the resulting clustering, which is defined to be
 \(\sum_{i=1}^n\operatorname{dist}(X^{(i)},c^{(i)})^2\)
 And need to find the “elbow” in the graph.</li>
  <li>The <em>Silhouette method</em>. The <em>Silhouette score</em> for the data point \(x_i\) with \(i\) in cluster \(I\) is defined to be
 \(\dfrac{b-a}{\max(a,b)}=\begin{cases}
 1-a/b, a&lt;b\\
 0, a=b\\
 b/a-1, a&gt;b
 \end{cases}\)
 Where \(a=\dfrac{1}{|I|-1}\sum_{i\in I,i\neq j}d(x_i,d_j)\) is the average of the distances between \(x_i\) and other points with indices in \(I\) and \(b=\min\limits_{J\neq I}\dfrac{1}{|J|}\sum_{j\in J}d(x_i,x_j)\) is the minimal average of distances between \(x_i\) with all points with indices in some \(J\neq I\). Note that this score ranges from -1 to 1. The higher the score, the better the clustering.</li>
</ol>

<p>We can use it to generate <em>silhouette plots</em>.</p>

<h3 id="hierarchical-clustering">Hierarchical clustering</h3>
<p><em>Hierarchical clustering</em> starts with each point as its own cluster and work its way up by merging clusters, generating a <em>dendrogram</em>, to have a measure for deciding when to merge clusters, we need cluster <em>linkage</em></p>
<ol>
  <li><em>single linkage</em>. The minimal distance between two points in two clusters.</li>
  <li><em>complete linkage</em>. The maximal distance between two points in two clusters.</li>
  <li><em>centroid linkage</em>. The distance between centroids.</li>
</ol>

<h2 id="neural-networks">Neural Networks</h2>
<p>Start with \(n\) observations with \(m\) features</p>

<h4 id="perceptron">Perceptron</h4>
<p>A <em>perceptron</em> is to neutron is as a artificial neural network is to an actual neural network. With a predefined <em>activation function</em> (some non linear function) It output
\(\hat y=\sigma(w_1x_1+\cdots+w_mx_m+b)=\sigma(\mathbf x\cdot\mathbf w)\)
here augmented \(\mathbf x\) by 1 and \(\mathbf w\) by \(b\) adds a <em>bias</em> term. The decision boudary is still linear which is not ideal, so we need to introduce multilayer neural network.</p>

<h4 id="feed-forward-network-architecture">Feed forward network architecture</h4>
<p>A <em>feed forward network architecture</em> is a multilayered neural network where each layer consists of many perceptrons. Feeding forward each layer is equivalent to a matrix multiplication. So terms of equations
\(h_1=\sigma(W_1\mathbf x),\cdots,\hat y=\sigma(W_{k+1}\mathbf x_k)\)</p>

<h4 id="backpropagation">Backpropagation</h4>
<p>To adjust the weights in the neural network, we need <em>back propagation</em>. If we take the loss function to be \(\ell=(\hat y-y)^2\), we know that \(\nabla\ell\) can be computed using the chain rule. Then we need to update weights by \(\mathbf w\leftarrow\mathbf w-\eta\nabla\ell(\mathbf w)\). This process should run through the whole of training points. A complete cycle is referred to as an <em>epoch</em>.</p>

<h4 id="convolution-neural-network">Convolution neural network</h4>
<p><em>Convolution layers</em> perform convolutions over the image (square matrix of data points). A <em>pooling layer</em> with a stride takes the maximal/minimal/average of points in that layer, which could downsample our observations or degrade the image. Then we feed into a fully connected layer. The reason for pooling is that the computation in the dense layer could be huge.</p>

<p><em>Padding</em> is simply add zeros or constants at the boundary of the image.</p>

<h4 id="recurrent-neural-network">Recurrent neural network</h4>
<p>The set up in equations is
\(h_1=\sigma(W_{xh}X^{(1)}),\cdots,h_t=\sigma(W_{xh}X^{(t)}+W_{hh}h_{t-1}),\hat y^{(t)}=\sigma'(W_{hy}h_t)\)</p>

<h3 id="long-short-term-memory">Long short-term memory</h3>
<p><em>Long short-term memory (LSTM)</em> improve RNN model that overcomes the the issue of vanishing gradient and capture the long term dependencies much better than RNN.</p>

<h3 id="transformer-model">Transformer model</h3>
<ol>
  <li>Input embedding</li>
  <li>Positional embedding</li>
  <li>Multi-head Attention</li>
</ol>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="notes" /><summary type="html"><![CDATA[Data Preprocessing]]></summary></entry><entry><title type="html">LLM &amp;amp; Gen AI Notes</title><link href="https://lihaoranicefire.github.io/LLM-GenAI/" rel="alternate" type="text/html" title="LLM &amp;amp; Gen AI Notes" /><published>2023-03-15T00:00:00+00:00</published><updated>2023-03-15T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/LLM-GenAI</id><content type="html" xml:base="https://lihaoranicefire.github.io/LLM-GenAI/"><![CDATA[<h1 id="large-language-model-and-generative-ai">Large Language Model and Generative AI</h1>
<p>A <em>language model</em> is model that estimating the probability <code class="language-plaintext highlighter-rouge">p(s)</code> of occurrence of a sentence <code class="language-plaintext highlighter-rouge">s</code>.</p>

<h2 id="natural-language-processing">Natural Language Processing</h2>

<h3 id="word-sense-disambiguation-and-uncertainty-in-language">Word sense disambiguation and uncertainty in language</h3>
<ol>
  <li><em>Lexical ambiguity</em>. For example, “silver” can be a noun, an adjective, or a verb.</li>
  <li><em>Syntactic ambiguity</em>. For example, “The man saw the girl with the telescope”. It is ambiguous whether the man saw the girl carrying a telescope or he saw her through his telescope.</li>
  <li><em>Semantic ambiguity</em>. For example, “The car hit the dog while it was moving”. It is unclear whether the dog or the car is moving.</li>
</ol>

<p>There are several potential phases of natural language processing. Summerized below
<img src="https://www.tutorialspoint.com/natural_language_processing/images/phases_or_logical_steps.jpg" alt="NLP phases" /></p>
<ol>
  <li><em>Morphological processing</em> refers to the cognitive mechanisms involved in recognizing and understanding the structure and meaning of words based on their constituent <em>morphemes</em>. Morphemes are the smallest units of meaning in a language, including prefixes, suffixes, roots, and other meaningful elements. For example, a word like “uneasy” can be broken into two sub-word tokens as “un-easy”.</li>
  <li><em>Syntax analysis</em>, also known as <em>parsing</em>, is the process of analyzing the grammatical structure of sentences in natural language to determine their syntactic relationships and properties. It involves breaking down sentences into their constituent parts and representing them according to the rules of a formal grammar.</li>
  <li><em>Semantic analysis</em>, is the process of understanding the meaning of text or speech in natural language. It involves analyzing the semantics, or meaning, of words, phrases, sentences, and larger units of discourse to extract and represent the underlying information.</li>
  <li><em>Pragmatic analysis</em>, studies how language is used in context to convey meaning beyond the literal interpretation of words and sentences. It focuses on the social, cultural, and situational aspects of language use, as well as the intentions and beliefs of speakers and listeners.</li>
</ol>

<p><em>Part-of-speech (POS)</em> tagging is the process of assigning a specific part of speech (such as noun, verb, adjective, etc.) to each word in a given text corpus. POS tagging is essential for understanding the syntactic structure and meaning of a sentence. It helps disambiguate the meaning of words that may have multiple possible interpretations based on their context.</p>

<h3 id="text-normalization">Text normalization</h3>
<p><em>Text normalization</em> refers to the process of transforming text into a canonical, standardized form. Some common techniques are</p>
<ol>
  <li>Lowercasing</li>
  <li>Tokenization</li>
  <li>Removing punctuation, special characters, numbers, stop words</li>
  <li>Stemming</li>
  <li>Lemmatization</li>
  <li>Spelling corrections</li>
  <li>Handling contractions</li>
</ol>

<h2 id="recurrent-neural-network">Recurrent Neural Network</h2>
<p>A <em>recurrent neural network (RNN)</em> is a class of neural networks that discover the sequential nature of the input data. Inputs could be text, speech, time series, etc.</p>

<p>The architecture of a simplest RNN is
\(h_t=\tanh(W_{h}\cdot h_{t-1}+W_{x}\cdot x_{t}+b)\)</p>

<p>Types of RNN</p>
<ul>
  <li>one-to-one: traditional neural network</li>
  <li>one-to-many: music generation</li>
  <li>many-to-one: sentiment classification</li>
  <li>many-to-many (equal): name entity recognition</li>
  <li>many-to-many (unequal): machine translation</li>
</ul>

<p>The loss function is the sum of losses of all time steps.</p>

<p>Due to the number of layers in the deep neural network, the gradients as  continuous matrix multiplications because of the chain rule will shrink exponentially if they start from small values (&lt;1) and will blow up if they start from large values (&gt;1). This is called the <em>vanishing or exploding gradient problem</em>.</p>

<h2 id="long-short-term-memory">Long Short Term Memory</h2>
<p><em>Long short term memory (LSTM)</em> is a special kind of RNN, designed to avoid long-term dependency problem. All RNN have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module has a single <code class="language-plaintext highlighter-rouge">tanh</code> layer, whereas LSRMs has four, interacting as below
\(\begin{cases}
f_t=\sigma(W_{fh}\cdot h_{t-1}+W_{fx}\cdot x_t+b_f) \\
i_t=\sigma(W_{ih}\cdot h_{t-1}+W_{ix}\cdot x_t+b_i) \\
o_t=\sigma(W_{oh}\cdot h_{t-1}+W_{ox}\cdot x_t+b_o) \\
\tilde C_t=\tanh(W_{Ch}\cdot h_{t-1}+W_{Cx}\cdot x_t+b_C) \\
C_t=f_t\cdot C_{t-1}+i_t\cdot \tilde C_t \\
h_t=o_t\cdot \tanh(C_t)
\end{cases}\)
Here <code class="language-plaintext highlighter-rouge">\sigma</code> is the sigmoid function, <code class="language-plaintext highlighter-rouge">f_t, i_t, o_t</code> are the forget, input, output gates, <code class="language-plaintext highlighter-rouge">C_t</code> is the cell state, and <code class="language-plaintext highlighter-rouge">h_t</code> is the hidden state.</p>

<h2 id="gated-recurrent-unit">Gated Recurrent Unit</h2>
<p><em>Gated recurrent unit (GRU)</em> is a variant of LSTM that has a simpler internal structure, and uses gating mechanisms to control and manage the flow of information between cells in the neural network.
\(\begin{cases}
z_t=\sigma(W_{zh}\cdot h_{t-1}+W_{zx}\cdot x_t) \\
r_t=\sigma(W_{rh}\cdot h_{t-1}+W_{rx}\cdot x_t) \\
\tilde h_t=\tanh(W_{hh}\cdot r_t h_{t-1}+W_{hx}\cdot x_t) \\
h_t=(1-z_t)\cdot h_{t-1}+z_t\cdot \tilde h_t \\
\end{cases}\)</p>

<p>Here <code class="language-plaintext highlighter-rouge">r_t,\tilde h_t</code> are the relevance, update gates.</p>

<p>Other variants includes
|Bidirectional (BRNN)|Deep (DRNN)|
|—|—|
|<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/bidirectional-rnn-ltr.png?e3e66fae56ea500924825017917b464a" alt="BRNN" />|<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/deep-rnn-ltr.png?f57da6de44ddd4709ad3b696cac6a912" alt="DRNN" />|</p>

<h2 id="word-representations">Word Representations</h2>
<p>There are two main ways of presenting words</p>
<ol>
  <li>1-hot representation, denoted <code class="language-plaintext highlighter-rouge">o_w</code>.</li>
  <li>word embedding, denoted <code class="language-plaintext highlighter-rouge">e_w</code>.</li>
</ol>

<p>The <em>embedding matrix</em> <code class="language-plaintext highlighter-rouge">E</code> such that <code class="language-plaintext highlighter-rouge">e_w = Eo_w</code> can be learnt using target/context likelihood models by defining the conditional probability as
\(p(w_o|w_i)=\frac{\exp(e_{w_o}\cdot e_{w_i})}{\sum_{w\in V}\exp(e_w\cdot e_{w_i})}\)</p>

<h3 id="bow">BOW</h3>
<p><em>Bag-of-words (BOW)</em> model treats a document as a collection of words without considering the order and the structure of the words. It is the sum of 1-hot encodings. The size of the representation huge and sparse, it also disregards the semantic understanding, and it cannot deal with out-of-vocabulary words.</p>

<p><em>TF-IDF</em> is the product of <em>term frequency (TF)</em> and <em>inverse document frequency (IDF)</em>. TF-IDF helps rank documents based on their relevance to a query. Documents containing rare or distinctive terms (with high TF-IDF scores) are considered more relevant.</p>

<p><em><code class="language-plaintext highlighter-rouge">n</code>-grams</em> are contiguous sequences of <code class="language-plaintext highlighter-rouge">n</code> items from a given sequence of text or speech.</p>

<h3 id="word2vec">Word2Vec</h3>
<p><em>word2vec</em> is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words, popular models include</p>

<ol>
  <li><em>skip-gram</em> maximize
 \(\frac{1}{T}\sum_{t=1}^T\sum_{-c\leq j\leq c,j\neq0}\log p(w_{t+j}|w_t)\)</li>
  <li><em>continuous bag-of-words (CBOW)</em> maximize
 \(\frac{1}{T}\sum_{t=1}^T\sum_{-c\leq j\leq c,j\neq0}\log p(w_t|w_{t+j})\)</li>
</ol>

<p>Computing softmax probabilities for all words is computationally expensive. To address this, we could perform the computations in some other ways.</p>

<p><em>Hierarchical softmax</em> is a way to make the calculation of the sum in the denominator of the conditional probability faster is with the help of a binary tree structure. All the leaves nodes represent words, and internal nodes measure connections between children nodes. Concretely, let <code class="language-plaintext highlighter-rouge">n(w_o,k)</code> to be the node on the unique path from root to <code class="language-plaintext highlighter-rouge">w_o</code> with <code class="language-plaintext highlighter-rouge">w_o</code> being its <code class="language-plaintext highlighter-rouge">k</code>-th generation descendant, stored with the node a weight vector <code class="language-plaintext highlighter-rouge">v_n</code>, we define
\(\begin{align*}
p(n\to\text{left}|w_i)&amp;=\sigma(v_n\cdot e_i)\\
p(n\to\text{right}|w_i)&amp;=1-p(n\to\text{left}|w_i)=\sigma(-v_n\cdot e_i)
\end{align*}\)
and
\(p(w_o|w_i)=\prod_{k}p(n(w_o,k+1)\to n(w_o,k))\)
The internal nodes embeddings are learnt during model training. The tree structure helps greatly reduce the complexity of the denominator estimation from <code class="language-plaintext highlighter-rouge">O(V)</code> to <code class="language-plaintext highlighter-rouge">O(\log V)</code>.</p>

<p><em>Negative sampling</em> transforms the objective of predicting words into a binary classification problem, the model is trained to distinguish between positive (actual context words, label <code class="language-plaintext highlighter-rouge">y=1</code>) and negative (randomly sampled noise, label <code class="language-plaintext highlighter-rouge">y=0</code>) examples. Concretely, we use probabilities
\(\begin{align*}
p(y=1|w_o,w_i)&amp;=\sigma(e_o\cdot e_i)\\
p(y=0|w_o,w_i)&amp;=1-\sigma(e_o\cdot e_i)=\sigma(-e_o\cdot e_i)
\end{align*}\)
Here <code class="language-plaintext highlighter-rouge">\sigma(x)=\dfrac{1}{1+e^{-x}}</code> is the sigmoidal function. We define the loss to be
\(\mathcal L=-\sum_{i,o}\log p(y=1|w_o,w_i)+\sum_{i,o}\sum_{w\sim P_n}\log p(y=0|w,w_i)\)
Here <code class="language-plaintext highlighter-rouge">w\sim P_n</code> is a negative sampled noised words, and the noise distribution <code class="language-plaintext highlighter-rouge">P_n(w)=U(w)^{3/4}/Z</code> is the <em>Zipf’s law</em>, <code class="language-plaintext highlighter-rouge">U</code> meaning word frequency.</p>

<h4 id="pros--cons">Pros &amp; Cons</h4>
<ol>
  <li>skip-gram is better suited for rare words because rare words often have unique contexts.</li>
  <li>skip-gram is known for capturing fine-grained semantic relationships between words. Since it learns separate embeddings for each word, which can represent subtle semantic nuances and capture relationships between words that may appear in diverse contexts.</li>
  <li>CBOW is faster to train. Since it aggregates context information from multiple words to predict a single target word. This approach tends to be computationally more efficient, especially for large vocabularies.</li>
  <li>CBOW performs better for frequent words because it average context vectors. Frequent words tend to occur in various contexts, and CBOW can effectively aggregate this information to learn robust representations for these words.</li>
  <li>skip-gram tends to perform better with larger datasets, while CBOW may perform better with smaller datasets.</li>
</ol>

<h3 id="glove">GloVe</h3>
<p><em>GloVe (global vectors)</em> is a word embedding technique that uses a co-occurence matrix <code class="language-plaintext highlighter-rouge">X</code> where each <code class="language-plaintext highlighter-rouge">X_{ij}</code> denotes the number of times <code class="language-plaintext highlighter-rouge">w_j</code> occurs in the context of <code class="language-plaintext highlighter-rouge">w_i</code>. The co-occurrence probability is defined to be
\(P_{ij}=p(w_j|w_i)=\frac{X_{ij}}{X_i}\)
Here <code class="language-plaintext highlighter-rouge">X_i=\sum_kX_{ik}</code> is the number of occurrence of <code class="language-plaintext highlighter-rouge">w_i</code>. Define
\(F(w_i,w_j,w_k)=\frac{P_{ik}}{P_{jk}}\)
This ratio shed some light on the corelation of the probe word <code class="language-plaintext highlighter-rouge">w_k</code> with the words <code class="language-plaintext highlighter-rouge">w_i</code> and <code class="language-plaintext highlighter-rouge">w_j</code>. If the ratio is large, then the probe word is related to <code class="language-plaintext highlighter-rouge">w_i</code> but not <code class="language-plaintext highlighter-rouge">w_j</code> and vice versa, it equals to 1, then <code class="language-plaintext highlighter-rouge">w_k</code> is likely to be unrelated to both <code class="language-plaintext highlighter-rouge">w_i,w_j</code>. Since we would like the linearity on the word embeddings, We expect <code class="language-plaintext highlighter-rouge">F</code> to satisfy
\(F((e_{w_i}-e_{w_j})\cdot e_{w_k})=\frac{F(e_{w_i}\cdot e_{w_k})}{F(e_{w_j}\cdot e_{w_k})}=\frac{P_{ik}}{P_{jk}}\)
The solution would be <code class="language-plaintext highlighter-rouge">F=\exp</code>, so
\(F(e_{w_i}\cdot e_{w_k})=\exp(e_{w_i}\cdot e_{w_k})\)
Hence
\(e_{w_i}\cdot e_{w_k}=\log P_{ik}=\log X_{ik} - \log X_i\)
Since <code class="language-plaintext highlighter-rouge">\log X_i</code> is independent of <code class="language-plaintext highlighter-rouge">k</code> and break the symmetry between <code class="language-plaintext highlighter-rouge">i,k</code>, we can add a bias term <code class="language-plaintext highlighter-rouge">b_{w_i}</code> to <code class="language-plaintext highlighter-rouge">e_{w_i}</code> to absorb <code class="language-plaintext highlighter-rouge">-\log X_i</code> and add <code class="language-plaintext highlighter-rouge">b_{w_k}</code> to <code class="language-plaintext highlighter-rouge">e_{w_k}</code> to make it symmetric. The cost function can then be defined simply as
\(J(\theta)=\frac{1}{2}\sum_{i,j}f(X_{ij})(e_{w_i}\cdot e_{w_j}+b_{w_i}+b_{w_j}'-\log X_{ij})^2\)
Where <code class="language-plaintext highlighter-rouge">f(c)</code> is a weighting function should be non-decreasing and go to zero as <code class="language-plaintext highlighter-rouge">c\to 0</code>. For example, with some adjustable <code class="language-plaintext highlighter-rouge">c_{\max}</code>
\(f(c)=\begin{cases}
\left(c/c_{\max}\right)^\alpha, &amp;\text{if $c&lt;c_{\max}$}\\
1, &amp;\text{otherwise}
\end{cases}\)
Given the symmetry of <code class="language-plaintext highlighter-rouge">e_{w_i},e_{w_j}</code>, the final word embedding is <code class="language-plaintext highlighter-rouge">\dfrac{e_{w_i}+e_{w_j}}{2}</code>.</p>

<h3 id="perplexity">Perplexity</h3>
<p><em>Perplexity</em> quantifies how surprised the model is when it sees new data. Suppose <code class="language-plaintext highlighter-rouge">s_1,\cdots,s_N</code> are new sentences for testing, each with <code class="language-plaintext highlighter-rouge">m_i</code> words, then the perplexity is defined as
\(PP=\prod_{i=1}^N\left(\frac{1}{p(s_i)}\right)^{1/m_i}=\prod_{i=1}^N\left(\prod_{k=1}^{m_i}\frac{1}{p(w_k|w_0w_1\cdots w_{k-1})}\right)^{1/m_i}\)</p>

<h2 id="machine-translation">Machine Translation</h2>
<p>A naive approach is to do a <em>greedy search</em>, meaning choosing the most likely next word each time, until the token <code class="language-plaintext highlighter-rouge">&lt;EOS&gt;</code> is selected. This doesn’t necessarily give the best outcome.</p>

<p><em>Beam search</em> picks the top <code class="language-plaintext highlighter-rouge">N</code> most likely sequence each time, <code class="language-plaintext highlighter-rouge">N</code> is known as the <em>beam width</em>. This process ends when it meets some stopping criteria, for example the token <code class="language-plaintext highlighter-rouge">&lt;EOS&gt;</code> is selected.</p>

<p>Beam search tends to favor shorter sequences, to avoid this, <em>length normalization</em> uses the loss function
\(\frac{1}{T^\alpha}\sum_{t=1}^{T}\log p(w_t|w_0,w_1,\cdots,w_{t-1})\)
on the sentence <code class="language-plaintext highlighter-rouge">s=w_0w_1\cdots w_T</code>.</p>

<p>When obtaining predicted translation <code class="language-plaintext highlighter-rouge">\hat y</code> that is bad, we can perform <em>error analysis</em> using a good translation <code class="language-plaintext highlighter-rouge">y^*</code></p>

<table>
  <thead>
    <tr>
      <th>Case</th>
      <th><code class="language-plaintext highlighter-rouge">p(y^*\|x)&gt;p(\hat y\|x)</code></th>
      <th><code class="language-plaintext highlighter-rouge">p(y^*\|x)\leq p(\hat y\|x)</code></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Root cause</td>
      <td>Beam search faulty</td>
      <td>RNN faulty</td>
    </tr>
    <tr>
      <td>Remedies</td>
      <td>Increase beam width</td>
      <td>&lt;ul&gt;&lt;li&gt;Try different achitecture&lt;/li&gt;&lt;li&gt;Regularize&lt;/li&gt;&lt;li&gt;Get more data&lt;/li&gt;&lt;/ul&gt;</td>
    </tr>
  </tbody>
</table>

<p>The <em>bilingual evaluation understudy (bleu)</em> is a metric that quantifies how good a machine translation is by computing a similarity score
\(\text{bleu score}=(\text{brebity penalty})\times\exp\left(\frac{1}{n}\sum_{k=1}^np_k\right)\)
based on <code class="language-plaintext highlighter-rouge">n</code>-gram precision
\(p_n=\dfrac{\text{number of matching $n$-grams}}{\text{total number of $n$-grams}}\)
and brevity penalty is a factor that penalizes the short length translation, for example, it could be
\(\text{brevity penalty}=\begin{cases}
1,&amp;\operatorname{len}(\hat y)\geq \operatorname{len}(y^*)\\
e^{1-\operatorname{len}(y^*)/\operatorname{len}(\hat y)},&amp;\operatorname{len}(\hat y)&lt;\operatorname{len}(y^*)
\end{cases}\)</p>

<h2 id="seq2seq">Seq2Seq</h2>
<p><em>Sequence to sequence</em> is a popular model used in tasks like machine translation, video captioning, question answering, speech recognition, etc.</p>

<p>It employs a encoder-decoder architecture. Both the encoder and the decoder are LSTM/GRU models. Encoder reads the input sequence and summarizes the information into <em>context vectors</em>. The decoder is initialized to the final state of the encoder, i.e. the context vector of the encoder’s final cell is input to the first cell of the decoder network.
<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*3q4gK4QGQNEkUC3bH5vOqA.jpeg" alt="Encoder-Decoder Architecture" /></p>

<h2 id="transformer">Transformer</h2>

<h3 id="attention">Attention</h3>
<p><img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*EC04ZMiCnLBT3IG0tdU33g.jpeg" alt="QueryKeyValue" /></p>

<ol>
  <li><em>Query</em></li>
  <li>Key</li>
  <li>Value</li>
</ol>

<p>An <em>attention</em> function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.</p>
<ol>
  <li>
    <p><em>Scaled dot-product attention</em> computes attention score as
 \(\text{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)
 Here queries and keys are of dimension <code class="language-plaintext highlighter-rouge">d_k</code> and values are of dimension <code class="language-plaintext highlighter-rouge">d_v</code>.</p>

    <p>Another most commonly used attention function is <em>additive</em>
 \(\text{Attention}(Q,K,V)=\operatorname{softmax}(V^T\tanh(W[Q,K]+b))\)
 Dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code</p>
  </li>
  <li>
    <p><em>Multi-head attention</em> instead of performing a single attention function with <code class="language-plaintext highlighter-rouge">d_{\text{model}}</code>-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values <code class="language-plaintext highlighter-rouge">h</code> times with different, learned linear projections to <code class="language-plaintext highlighter-rouge">d_k</code>, <code class="language-plaintext highlighter-rouge">d_k</code> and <code class="language-plaintext highlighter-rouge">d_v</code> dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding <code class="language-plaintext highlighter-rouge">d_v</code>-dimensional output values. These are concatenated and once again projected, resulting in the final values.</p>

    <p>Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
 \(\begin{align*}
 \text{MultiHead}(Q,K,V)&amp;=\operatorname{Concat}(\text{head}_1,\cdots,\text{head}_h)W^O\\
 \text{head}_i&amp;=\text{Attention}(QW_i^{Q},KW_i^{K},VW_i^{V})
 \end{align*}\)
 Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.</p>
  </li>
</ol>

<p><img src="https://machinelearningmastery.com/wp-content/uploads/2022/03/dotproduct_1.png" alt="AttentionMechanism" /></p>
<h3 id="positional-encoding">Positional encoding</h3>
<p>We want inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add <em>positional encodings</em> to the input embeddings at the bottoms of the encoder and decoder stacks.
\(\begin{align*}
\text{PE}_{(\text{pos},2i)}&amp;=\sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)\\
\text{PE}_{(\text{pos},2i+1)}&amp;=\cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)
\end{align*}\)
The model easily learns to attend by relative positions, since for any fixed offset <code class="language-plaintext highlighter-rouge">k</code>, <code class="language-plaintext highlighter-rouge">\text{PE}_{\text{pos}+k}</code> can be represented as a linear function of <code class="language-plaintext highlighter-rouge">\text{PE}_{\text{pos}}</code>.</p>

<p>The architecture of the transformer model looks as follows</p>

<p><img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="TransformerModelArchitecture" /></p>

<h2 id="bert">BERT</h2>
<p><em>Bidirectional encoder representations from transformers (BERT)</em> is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create models for a wide range of tasks.
<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*cDlhkuE8b8IBadV9vONOmg.png" alt="BERTPreTraining&amp;FineTuning" /></p>

<h3 id="inputoutput-representation">Input/Output representation</h3>
<p>Input will be in the form of a pair of sentence is represented as one token seqeunce. The first token is always <code class="language-plaintext highlighter-rouge">[CLS]</code>, and the final hidden state vector <code class="language-plaintext highlighter-rouge">C</code> corresponding to this token is used as the aggregate sequence representation for classification tasks. We differentiate the pair with a <code class="language-plaintext highlighter-rouge">[SEP]</code> token. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.
<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*D0_sVWpmOSaGCvm6gk9aHA.jpeg" alt="BERTInputEmbedding" /></p>

<h3 id="pre-training">Pre-training</h3>
<ol>
  <li><em>Masked LM (LLM)</em> is mask some percentage of the input tokens at random, and then predict those masked tokens. This more powerful than shallowly concatenate a left-to-right and a right-to-left model.
 This percentage is always taken to be 15%, for too little masking, it is too expansive in training, and for too much masking, there is not enough context.</li>
  <li><em>Next sentence prediction (NSP)</em> tries to train the binary classification of whether sentence <code class="language-plaintext highlighter-rouge">B</code> is the next sentence of sentence <code class="language-plaintext highlighter-rouge">A</code>. Here only <code class="language-plaintext highlighter-rouge">C</code> is used for this.</li>
</ol>

<p><img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*TT1uyr3LF0HBW71dA5516g.png" alt="BERTVariousTasks" /></p>

<p><em>Domain adaptation</em> refers to the process of adapting a model trained on data from one domain (source domain) to perform well on data from a different domain (target domain).</p>

<h2 id="gan">GAN</h2>
<p><em>Generative Adversarial Networks (GAN)</em> has two neural networks, a <em>generator</em> <code class="language-plaintext highlighter-rouge">G</code> and a <em>discriminator</em> <code class="language-plaintext highlighter-rouge">D</code>, representing the probability that <code class="language-plaintext highlighter-rouge">x</code> comes from the data rather than the generator. Define <code class="language-plaintext highlighter-rouge">z</code> to be a random noise. We want to simultaneously train <code class="language-plaintext highlighter-rouge">D</code> to maximize the probability of assigning the correct label to both training samples and samples from <code class="language-plaintext highlighter-rouge">G</code>, and train <code class="language-plaintext highlighter-rouge">G</code> to minimize <code class="language-plaintext highlighter-rouge">\log(1-D(G(z)))</code>. In other words, <code class="language-plaintext highlighter-rouge">D</code> and <code class="language-plaintext highlighter-rouge">G</code> play the following minimax game with value function
\(\min_{G}\max_{D} V(D,G)=E_{x}[\log D(x)] + E_z[\log(1-D(G(z)))]\)</p>

<h2 id="vae">VAE</h2>
<p><em>Variational Autoencoders (VAE)</em></p>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="notes" /><summary type="html"><![CDATA[Large Language Model and Generative AI A language model is model that estimating the probability p(s) of occurrence of a sentence s.]]></summary></entry><entry><title type="html">MATH868C 2020Fall Several Complex Variables</title><link href="https://lihaoranicefire.github.io/math868C/" rel="alternate" type="text/html" title="MATH868C 2020Fall Several Complex Variables" /><published>2020-09-01T00:00:00+00:00</published><updated>2020-09-01T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/math868C</id><content type="html" xml:base="https://lihaoranicefire.github.io/math868C/"><![CDATA[<p>Please see <a href="/files/math868C_2020Fall/math868C_2020Fall.pdf">PDF</a></p>

<iframe src="/files/math868C_2020Fall/math868C_2020Fall.pdf" width="1000" height="1000"></iframe>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="notes" /><summary type="html"><![CDATA[Please see PDF]]></summary></entry><entry><title type="html">Some LaTeXified old paper</title><link href="https://lihaoranicefire.github.io/posts/2012/08/blog-post-1/" rel="alternate" type="text/html" title="Some LaTeXified old paper" /><published>2020-02-01T00:00:00+00:00</published><updated>2020-02-01T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/posts/2012/08/latexify-A_naive_guide_to_mixed_Hodge_theory</id><content type="html" xml:base="https://lihaoranicefire.github.io/posts/2012/08/blog-post-1/"><![CDATA[<p>A LaTeXified paper: <em>A naive guide to mixed Hodge theory</em> by Alan H. Durfee</p>

<p>please see <a href="/files/A_naive_guide_to_mixed_Hodge_theory.pdf">PDF</a></p>

<iframe src="/files/A_naive_guide_to_mixed_Hodge_theory.pdf" width="1000" height="1000"></iframe>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="notes" /><summary type="html"><![CDATA[A LaTeXified paper: A naive guide to mixed Hodge theory by Alan H. Durfee]]></summary></entry><entry><title type="html">一元五次方程不可解之证明</title><link href="https://lihaoranicefire.github.io/quintic-equation-galois/" rel="alternate" type="text/html" title="一元五次方程不可解之证明" /><published>2020-01-01T00:00:00+00:00</published><updated>2020-01-01T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/quintic-equation-galois</id><content type="html" xml:base="https://lihaoranicefire.github.io/quintic-equation-galois/"><![CDATA[<p>An exposition of insolvability of the quintic equations, please see <a href="/files/quintic_equation_galois.pdf">PDF</a></p>

<iframe src="/files/quintic_equationgalois.pdf" width="1000" height="1000"></iframe>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="exposition" /><summary type="html"><![CDATA[An exposition of insolvability of the quintic equations, please see PDF]]></summary></entry><entry><title type="html">Qualification Examination Solutions</title><link href="https://lihaoranicefire.github.io/umd-quals/" rel="alternate" type="text/html" title="Qualification Examination Solutions" /><published>2019-08-01T00:00:00+00:00</published><updated>2019-08-01T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/umd-quals</id><content type="html" xml:base="https://lihaoranicefire.github.io/umd-quals/"><![CDATA[<p>An partial solution set to UMD Math Qualification Examinations, please see <a href="/files/umd_quals.pdf">PDF</a></p>

<iframe src="/files/umd_quals.pdf" width="1000" height="1000"></iframe>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="exercise" /><summary type="html"><![CDATA[An partial solution set to UMD Math Qualification Examinations, please see PDF]]></summary></entry><entry><title type="html">Undergraduate thesis</title><link href="https://lihaoranicefire.github.io/undergrad-thesis/" rel="alternate" type="text/html" title="Undergraduate thesis" /><published>2018-07-01T00:00:00+00:00</published><updated>2018-07-01T00:00:00+00:00</updated><id>https://lihaoranicefire.github.io/undergrad-thesis</id><content type="html" xml:base="https://lihaoranicefire.github.io/undergrad-thesis/"><![CDATA[<p>My undergraduate thesis, please see <a href="/files/Polynomials_Disguised_in_Different_Senarios.pdf">PDF</a></p>

<iframe src="/files/Polynomials_Disguised_in_Different_Senarios.pdf" width="1000" height="1000"></iframe>]]></content><author><name>Haoran Li</name><email>haoran.li2018@gmail.com</email></author><category term="exercise" /><summary type="html"><![CDATA[My undergraduate thesis, please see PDF]]></summary></entry></feed>