
Ultimate access to all questions.
A data science team at the market-making division of a global financial institution is working to improve the division's risk-adjusted returns on options trading using machine-learning methods. The team creates a reinforcement learning (RL) algorithm that lets the learning agent develop price setting strategies. The strategies are intended to maximize profit and minimize inventory risk by dynamically adjusting bid and ask prices in the market maker's electronic limit order book.
The RL algorithm defines the following market states and market actions:
The team considers using either the Monte Carlo (MC) approach, formulated as
or the temporal difference (TD) learning approach, to update estimates for the Q-values, called or the value of taking an action in a particular state called, used to decide on the optimal strategy.
The current Q-values for the 2-state, 2-action scenario are given in the table below:
| Action 1 (A1) | State 1 (S1) | State 2 (S2) |
|---|---|---|
| 0.7 | 0.8 | |
| Action 2 (A2) | 0.5 | 0.9 |
In the next trial of the RL algorithm, the team notes that under an parameter of 0.05, the MC approach samples a run taking A2 in S2 and estimates the total subsequent reward to be 1.2. Furthermore, the next decision on the trial is when the learning agent is on S1, with the jump between S2 and S1 generating a reward of 0.3. S1 is where the TD approach is initialized.
In using the two approaches, which of the following provides the correct update on the -value?_