A data science team at the market-making division of a global financial institution is working to improve the division's risk-adjusted returns on options trading using machine-learning methods. The team creates a reinforcement learning (RL) algorithm that lets the learning agent develop price setting strategies. The strategies are intended to maximize profit and minimize inventory risk by dynamically adjusting bid and ask prices in the market maker's electronic limit order book.

The RL algorithm defines the following market states and market actions:

State 1: high transaction costs and high volatility

State 2: low transaction costs and low volatility

Action 1: take a market-neutral approach

Action 2: take a directional approach

The team considers using either the Monte Carlo (MC) approach, formulated as
$Q_{\text{new}}(S,A) <= Q_{\text{old}}(S,A) + \alpha [R - Q_{\text{old}}(S,A)]$
or the temporal difference (TD) learning approach, to update estimates for the Q-values, called $Q(S,A)$ or the value of taking an action in a particular state called, used to decide on the optimal strategy.

The current Q-values for the 2-state, 2-action scenario are given in the table below:

Action 1 (A1) State 1 (S1) State 2 (S2)
0.7 0.8
Action 2 (A2) 0.5 0.9

In the next trial of the RL algorithm, the team notes that under an $\alpha$ parameter of 0.05, the MC approach samples a run taking A2 in S2 and estimates the total subsequent reward to be 1.2. Furthermore, the next decision on the trial is when the learning agent is on S1, with the jump between S2 and S1 generating a reward of 0.3. S1 is where the TD approach is initialized.

In using the two approaches, which of the following provides the correct update on the $Q(2,2)$ -value?

Action 1 (A1)	State 1 (S1)	State 2 (S2)
	0.7	0.8
Action 2 (A2)	0.5	0.9

Real Exam

Community

LLeetQuiz

Using the MC approach, $Q(2,2)$ would be updated to 0.915.

Using the MC approach, $Q(2,2)$ would be updated to 1.200.

Using the TD approach, $Q(2,2)$ would be updated to 0.895.

Using the TD approach, $Q(2,2)$ would be updated to 1.000.

Financial Risk Manager Part 1