January 26, 2023 - 21 min read
The following blog post is the DORA litepaper. For a more in-depth description of our distributed oracle protocol, download the full whitepaper here.
Simply speaking, an oracle is a mechanism that connects a blockchain to the world outside of that chain. A blockchain can be thought of as an immutable sequence of state transitions emerging as a result of an execution of a deterministic computer. By deterministic, we mean that every instruction of this computer produces the same output when it is given the same input. This deterministic computer is an abstraction. Under the hood, it is implemented by a network of many computers running a consensus protocol to determine the sequence of instructions and their outcomes.
Can we have an instruction getCurrentPrice("BTC")
in such a computer? NO. Why? Because the output of this instruction would be different depending upon when this instruction is executed and from where the prices are being fetched. This is a deal breaker, since it violates the fundamental property of our computer being deterministic.
So what can one do to address this? One of the ways to fix this issue, is to get the entire context for the instruction as an input. For example, let us have an instruction like getPrice("BTC","SomeExchange","11:59:59 UTC 7 Dec 2022")
. In this instruction we are explicitly specifying that we should fetch the price at which time and from where. However, we still have a few problems to solve.
We are assuming that irrespective of when we execute this instruction, we would get the same output. What if we try to execute this instruction ten years from now and "SomeExchange"
does not have this data available? What if, hypothetically, "SomeExchange"
had kept only five years of data due to their data retention policy at that time? What if some data-corruption had happened to this piece of data? Essentially, what we need is a reliable and immutable archive so that we get exactly the same output even if we execute this instruction ten years from now. Well, we already know one such mechanism. A blockchain itself! We can have an instruction putPrice(CommodityCode,source,time,value)
to store the value
of CommodityCode
obtained from the source
at time
in the blockchain itself, which can be safely retrieved later.
However, we still have a problem. How do we know that the value
is indeed from the source
obtained at time
? If the source
itself digitally signs and certifies that at time
, the price of CommodityCode
is value
, then we are done. This is indeed how some oracles work. Alas, not all sources of information digitally sign their data and put it in the blockchain. Moreover, just because a data source signs their data doesn’t mean that the signed data is actually accurate. In such a case, we need a set of entities that can put a stamp of authenticity and certify that value
is indeed the price of CommodityCode
at source
and at time
.
When there are multiple exchanges and sources of information, there is no single true value of a commodity. Commodities may get traded at different prices at different exchanges. Many blockchain applications and smart-contracts requires a single value of a commodity price. In such a case, we need to aggregate prices from multiple sources of information into a single representative price. So what is an oracle? In short, an oracle is a mechanism that takes information from multiple sources and puts certified authentic representative information into a blockchain in a deterministic fashion so that modern day blockchain applications and smart-contracts can use this information.
DOWNLOAD WHITEPAPER PDF | VIEW PRESENTATION DECK
When there are multiple sources of information that do not sign their data, we need a mechanism that aggregates all these information into a single representative value and put it in the blockchain while making this Byzantine fault tolerant, at least to some degree. A Byzantine fault refers to a fault with an entity where the entity is either non-responsive, or can deviate from an agreed upon protocol arbitrarily, possibly in a malicious fashion.
In the absence of any Byzantine faults, only a single node gathering the information from multiple sources is sufficient. However, in the presence of Byzantine faults, we would need multiple nodes. When all of the nodes can provide one value as an input and some of the nodes are Byzantine, the challenge for the remaining honest nodes to agree upon a single representative value is called the DORA (Distributed Oracle Agreement) problem. Formally speaking:
A DORA protocol among \(n\) nodes \(p_1,p_2,\dots,p_n\) with each node having inputs \(v_i\), for a given agreement distance \(\mathcal{D}\), guarantees that:
Note that, not only do we require that all the nodes converge to a single value, but it needs to be a representative value. The validity property mentioned above states that \(\mathcal{S}\) should be somewhere between the minimum and maximum values from honest nodes. However, this does not seem representative. How can any value between the minimum and maximum values from honest nodes be termed as a representative of values from all the honest nodes? That is a valid question. One may think that taking the average of all the values could be a good candidate to create a representative value. This line of thinking is completely valid, and yet, naive.
Not only do we have Byzantine nodes, we do not even know which ones are Byzantine and which ones are honest. During the computation of the average, even if a single value from a Byzantine node creeps in, the output can arbitrarily deviate from the average of all the honest values. The median offers a robust alternative as a representative value since it is a statistical aggregator which can tolerate data corruption up to 50%. Unfortunately, in the presence of up to \(\frac{1}{3}\) Byzantine nodes, the guarantee given in the validity property above is the best possible guarantee that anyone can provide. Please refer to the DORA whitepaper for a formal proof of this.
Supra’s Oracle service assumes that an SMR (State Machine Replication) a.k.a. blockchain service is available.
Informally speaking, SMR (State Machine Replication) offers an ordering service such that any message sent to this ordering service becomes part of a total order, and every node reading messages from this service observes the same set of messages in the exact same total order.
A tribe is a set of nodes. A tribe can continue to function safely and make progress on the services it offers as long as the number of nodes that turn Byzantine are less than \(\frac{1}{3}\) of all the nodes.
From a tribe, we uniformly randomly draw subsets of nodes. We call each such subset a clan. Each clan is drawn in a fashion such that the probability of having half or more Byzantine nodes in a clan is negligible. Our DORA protocol is designed such that under normal circumstances, we only require a clan to compute and agree on an \(S\)-value. This means that DORA can run with a smaller number of nodes and it can withstand a larger fraction of the nodes turning Byzantine.
An oracle service provider may need to provide a representative value for many different commodities, such as BTC/USD
, ETH/USD
etc. Since we only require a clan to run DORA protocol, we divide the responsibility of various commodities equally across different clans. For example, if we need to aggregate and provide \(S\)-values for 100 different commodities and if we had 4 clans, then each clan may be given a responsibility to compute \(S\)-values for 25 commodities.
Now, let’s take a look at how one round of the the DORA protocol works.
Every node of the clan is assigned a set of data sources to obtain the value of a commodity price. As you can see in the image above, every node obtains data from multiple data sources. Why is that? Well, a clan can only work correctly if less than \(\frac{1}{2}\) of the clan nodes are Byzantine. As an example, imagine a scenario where we have 51 nodes in our clan, out of which 25 of them already turned Byzantine. So far, so good. Now, imagine that we have a design where every node gets its data from only one data source.
Now what if the data source turns Byzantine? Even an honest node, if it receives data from a single Byzantine source, exhibits Byzantine behaviour. Garbage in, garbage out. So now, we have 26 out of 51 nodes exhibiting Byzantine behaviour. There is now no guarantee whether the clan can produce any value at all, and if it does, whether that value would be anywhere near an ideal representative price of a commodity. This renders the oracle effectively useless.
It is true that prices of assets on some exchanges deviate significantly due to technical glitches or other reasons. So it is wise not to rely only on a single source of information. In our design, if we want to withstand failure of \(f_{d}\) data sources, then we mandate that every node listens to at least \(2f_{d} + 1\) data sources to obtain the price of a commodity.
Also note that there are multiple nodes fetching data from every data source. Why is that? Well, imagine that we are now in a situation where the data source is honest but the node is Byzantine. The price provided by the honest data source may not be considered at all during the computation of the \(S\)-value. That is why we ensure that there are multiple paths by which this information can make its way in the computation. Now, we need to look at one more subtlety.
For example, if we have a total of 10 data sources, and every node is required to listen to at least 3 data sources, how can we be sure that prices from every data source make their way in the final computation? What if all the nodes are listening to the first 3 data sources and no one is listening to the remaining 7? To avoid this situation, the assignment of data sources to nodes is done via a VRF (Verifiable Random Function). This way we can ensure that all the data sources are feeding their data to roughly equal number of nodes in the clan.
Once a node has received prices from multiple data sources, it computes its median. This median value is guaranteed to fall between the minimum and maximum values from honest data sources. So, as long as a node receives a majority of the prices from honest data sources, it would compute a value that is bounded by honest values.
Once a node has computed a value, now we need to figure out a way for the entire clan to agree on one single value. Since different nodes are listening to different sets of data sources, the values computed by different nodes may be different. That is where an aggregator comes in. We designate a set of nodes from the tribe as aggregators.
Clan nodes digitally sign their values and send it to the aggregators. We actually only need one aggregator to combine all the values from the clan nodes into one single value. Since we do not know which nodes are Byzantine, we randomly select a family of aggregators from the tribe such that we are guaranteed with extremely high probability that there is at least one honest aggregator within the family.
So what does an aggregator do? It is supposed to combine values from all the nodes into one single representative value for the price of a given commodity. Note that almost half of the clan nodes could be Byzantine and they may never send their values to the aggregator. So we know that an honest aggregator can expect to receive at least \(f_{c} + 1\) values, where \(f_{c}\) is the maximum number of Byzantine nodes in a clan of size \(2f_{c} + 1\).
Let’s say it waits for the first \(f_{c} + 1\) values to arrive. Can it propose the arithmetic mean of this value as the \(S\)-value? The problem is that the aggregator can not be sure that all of these \(f_{c} + 1\) values are coming from honest nodes. A Byzantine node, wishing to manipulate the \(S\)-value, can send an arbitrarily large value. Since network communication delays are unpredictable, this value can arrive earlier, therefore, gets included in the set of \(f_{c} + 1\) values. The arithmetic mean can now become arbitrarily large due to the inclusion of this single dishonest value.
Does the median work instead? Apparently, No. In a clan of size \(2f_{c} + 1\), up to \(f_{c}\) nodes could turn Byzantine. Therefore, it is possible that if the aggregator waits for the first \(f_{c} + 1\) value to compute the \(S\)-value, then as many as \(\lceil \frac{f_{c} + 1}{2} \rceil\) dishonest values can make its way in this set of \(f_{c} + 1\) value.
Is all hope lost now? Fortunately, no; there is a silver lining here. Even in the worst case, when as many as \(f_{c}\) out of \(f_{c} + 1\) values could be Byzantine, we are guaranteed at least one honest value. The magic lies in leveraging this fact. The problem of computing the mean or the median out of the first set of \(f_{c} + 1\) values received, is that the dishonest values can be arbitrarily large or arbitrarily small as compared to honest values. We can solve this problem by mandating that even in the worst case of a single honest value, that value acts as an anchor to other values so that other values, even if dishonest, can not deviate arbitrarily from the honest value.
That is where the agreement distance \(\mathcal{D}\) comes in play.
Given an agreement distance \(\mathcal{D}\), we say that two values \(v_1\) and \(v_2\) agree with each other, if \(|v_1 – v_2| \leq \mathcal{D}\). That is, if two values differ at most by the agreement distance, then they are said to agree with each other.
A set of values \(CC\) is said to form a coherent cluster if \(\forall v_1, v_2 \in CC: |v_1 – v_2| \leq \mathcal{D}\). In other words, a coherent cluster is a set of values where all the values in that set agrees amongst themselves.
Armed with the agreement distance \(\mathcal{D}\), we can anchor all the \(f_{c} + 1\) values such that they are guaranteed not to be too far from one another. Instead of asking the aggregator to wait for the first set of \(f_{c} + 1\) values it receives, we mandate that aggregator waits until it sees a coherent cluster of values of size \(f_{c} + 1\).
This variation of DORA is called DORA-CC (Distributed Oracle Agreement with Coherent Cluster) is formally defined as follows:
A DORA-CC protocol among \(n\) nodes \(p_1,p_2,\dots,p_n\) with each node having inputs \(v_i\), for a given agreement distance \(\mathcal{D}\), guarantees that:
Wait a minute! What guarantee do we have that we would always be able to form a coherent cluster of size \(f_{c} + 1\)? Well, there is no guarantee. However, we observe that under normal circumstances, the prices of a given commodity from different data sources are very close. So, if we set our agreement distance \(\mathcal{D}\) large enough so that values from honest data sources agree within themselves under normal circumstances, a coherent cluster of \(f_{c} + 1\) values can be formed. As for abnormal circumstances, we will come back to that later.
Once a coherent cluster is formed, the remaining parts are easy. Since we know that any deviation is bounded by the agreement distance, it is safe to compute arithmetic mean of the cluster. In fact, the mean ensures that even if there is a single honest value within the cluster, it is still counted.
The aggregator sends the mean, along with the values that were signed by the nodes, to all the clan nodes for their validation and approval. It is important for the aggregator to send the cluster of values and the mean to all of the nodes in the clan and not just the \(f_{c} + 1\) nodes which contributed to the formation of the cluster. Otherwise, if there is a Byzantine node, which contributed to the cluster, it can withhold its vote, thus, stopping the protocol from making progress.
Every node can now validate that the values have been contributed by the nodes of the clan due to the digital signatures. They can also validate that the set of values indeed form a coherent cluster within the agreement distance, and that the mean sent by the aggregator is indeed the mean of the cluster. Since the clan has at least \(2f_{c} + 1\) nodes, we are ensured that we would receive at least \(f_{c} + 1\) votes of approval, thus allowing the aggregator to form a quorum certificate. Once the quorum certificate is formed, the mean is posted as the \(S\)-value for the given round of agreement on the SMR. All the honest nodes would see the \(S\)-value being posted for the round and conclude the round.
Can we go home now? Not yet. Remember that all of this works only if we have normal circumstances, where the values from most of the data sources are close to one another. When things are wild, it is possible that values from even honest nodes are not close enough to form a coherent cluster.
What do we do then? Well, when you think you can’t win a battle, you fall back.
So which nodes decided when to fall back? Well, all of them. All the clan nodes start their fallback timer as soon as they send their values to aggregators. When the situation is wild, it may be impossible for any aggregator to form a coherent cluster of size \(f_{c} + 1\).
Eventually, the fallback timer runs out. When a node’s timer runs out it sends a fallback vote to all the aggregators. Eventually, we would have an aggregator with at least \(f_{c} + 1\) fallback votes allowing it to form a quorum certificate for the fallback, and it would post it on SMR. Any node observing the fallback message for the round on the SMR switches to the fallback protocol.
We know that things are kind of flaky and that is why we have decided to fall back. So what do we do? We bring out the heavyweights! At this point, we employ the entire tribe to participate in computing the \(S\)-value.
The fallback protocol is very similar to the earlier protocol. The tribe nodes gather data from the data sources, compute the median from the set of prices they received, digitally sign and send it off to aggregators.
This is where things differ a bit. We know that we are already in a situation where we won’t be able to form a coherent cluster. So the aggregator now waits for the first \(2f_{t} + 1\) tribe nodes to send their values. Here, we assume the tribe size to be \(3f_{t} + 1\) with at most \(f_{t}\) nodes that may turn Byzantine. Out of \(2f_{t} + 1\) values, at least \(f_{t} + 1\) of these values must be from honest nodes. Therefore, we now use the median of these values, since we know that the median would be bound by honest values.
The remaining part follows the familiar trajectory. The aggregator proposes this median as the \(S\)-value for this round, along with the set of \(2f_{t} + 1\) digitally signed values it received. The tribe nodes validate and send their approval votes back to the aggregator, the aggregator forms a quorum certificate with \(2f_{t} + 1\) votes, and then posts it on the SMR.
Let us consider a model where we have a single aggregator. Consider a scenario where this aggregator happens to be Byzantine. This aggregator can not post an incorrect \(S\)-value due to the requirement that a majority of the clan nodes validate and vote for the \(S\)-value computed by the aggregator. However, the aggregator can cause liveness issues, either by not proposing an \(S\)-value, or by not submitting the agreed \(S\)-value to the SMR. The multi-aggregator model solves this issue by ensuring that there is at least one honest aggregator within the set of aggregators.
This design choice allows us to avoid any aggregator change protocol that may be needed for a single aggregator model, in case the aggregator turns out to be Byzantine.
But now, we have multiple aggregators that may post \(S\)-values on the SMR. Which one would be counted? It’s easy, the first one! What if the value is posted by a Byzantine aggregator? It doesn’t matter. Aggregators can not forge the signature of any of the honest nodes. So the \(S\)-value posted would still (i) either be at most agreement distance away from an honest value, or (ii) be between two honest values.
Note that a larger agreement distance may allow a larger deviation from an honest value. In practice, the agreement distance can be quite small. It can be kept less or equal to the deviation observed among honest data sources to ensure that we can form a coherent cluster under normal circumstances.
Did you notice that one round of agreement does not depend on any of the earlier rounds? We can leverage this fact to our advantage.
We can start a new round of DORA at a pre-determined tick. Time between two ticks can be set based on how frequently we wish to publish \(S\)-values, or how frequently we want to observe the market conditions, etc. In practice, it is possible to start a new round every few seconds, since a new round can be started without waiting for earlier rounds to finish.
Network delays, like weather, are unpredictable. Due to the asynchronous nature of communication within clan nodes, it is possible that the values from rounds are posted out of order. For example, the round 4 value may appear on the SMR before the round 3 value. Practically, round 4 would carry a fresh value and therefore information that comes out of round 3 after the round 4 value is made available may not serve any purpose. Based on this assumption, we can save on some unnecessary computation and communication.
If for some reason, older rounds are taking longer time to conclude, one can terminate those rounds if the \(S\)-value from a later round becomes available on the SMR. This way, the nodes do not have to waste their resources and bandwidth trying to conclude older rounds which may not offer any value.
Many stock exchanges as well as cryptocurrency exchanges employ mechanisms to freeze trades when prices fluctuate in a crazy manner.
One of the popular mechanisms is a circuit-breaker. The underlying function is very simple.
Let \(S_r\) denote the \(S\)-value of round \(r\). A circuit-breaker function \(\frac{|S_r – S_{r-1}|}{S_{r-1}} \geq thr\) triggers and breaks the circuit (or halts the trade) when \(S_r\) deviates from \(S_{r-1}\) by more than some percentage threshold defined by \(thr\).
One downside is that the circuit breaker function mentioned earlier, only looks at the previous \(S\)-value to determine if the deviation is kind of out of control.
Maybe we should look further back in the past to determine whether the current trends are crazier than in the past.
If we look at all the \(S\)-values in a given history window \(W_{h}\), we can come up with a generalized formula as mentioned in the image above. Just like \(thr\) is a parameter for circuit-breaker, \(C_{h}\) is a constant parameter that controls how much deviation is permissible.
A good thing about the history consistency check function mentioned above is that it incorporates the standard deviation of the history \(\sigma_{h}\). Imagine that for some reason, the \(S\)-value continues to climb very fast. While for initial rounds one may think that this is abnormal, if such a climb persists, \(\sigma_{h}\) automatically increases to classify this as normal. Similarly, when the \(S\)-value exhibits milder fluctuations, \(\sigma_{h}\) automatically narrows the interval. Such generic and flexible functions can be used across a wide variety of commodities without requiring fine-tuning of the parameters.
For example, popular cryptocurrencies tend to remain relatively stable, while not-so-popular cryptocurrencies tend to fluctuate wildly. That means that, in the case of using the circuit-breaker function mentioned above, the threshold parameter may need fine tuning for different currencies accordingly, otherwise, one may end up freezing the trade even though it may be a normal situation.
We do not have to restrict ourselves with just these two functions. Once the stream of \(S\)-values are available on the SMR, there are a wide variety of anomaly detection techniques one can employ such as using a chi-square test or other statistical tests to detect a statistical anomaly. If we are to use these functions during off-chain processing, we have plenty of choices for the detection of anomalies. However, if such a detection is going to happen on-chain via a smart contract, then we must consider the computation and storage cost required to perform any anomaly detection tests.
Many market commodities are correlated, and therefore any variations in one commodity may affect the price of correlated commodities. For example, let \(\mathcal{x}\) be BTC/USD
, \(\mathcal{y}\) be ETH/USD
and \(\mathcal{z}\) be BTC/ETH
. It is easy to see that ideally, we would have \(\mathcal{x} – \mathcal{y} = 0\).
Of course, in practice we would have \(\mathcal{x} – \mathcal{yz} \leq \mathcal{D}_{r}\) for some correlation distance \(\mathcal{D}_{r}\). Then, if we detect \(\mathcal{x} – \mathcal{yz}\) > \(\mathcal{D}_{r}\), we know that something is amiss. Such a multi-variate analysis is a great weapon to have in our arsenal for forensic analysis. If we assign all three commodities to different clans, an adversary has to be mighty enough to control three clans to evade any detection.
In future, our plan is to leverage such cross-correlations that exist among different commodities for a real-time anomaly detection.
If the configuration of the oracle network remains unchanged for a prolonged period of time, it may allow nodes to collude and hatch a plot to attack our oracle service.
However, our design makes this as difficult as reasonably possible. As a mitigation technique, at Supra, we propose various randomization techniques to keep the adversary in the dark.
We have already discussed that the assignment of data sources to clan nodes is done via a VRF. We also propose to randomize mapping of commodities to clans via a VRF. Every clan is given responsibility to track the prices of a set of commodities which are chosen uniformly randomly from the entire set of commodities.
At the end of some pre-determined period that we call a cycle, this mapping from commodities to clan is redrawn. This reduces the chance and the time-period available for an adversary to target a commodity of its choice for a rogue manipulation.
While the rotation changes the responsibility of a clan periodically, the nodes that make up the clan do not change. Allowing a clan configuration to remain static for a long period may be favourable to an adversary. Therefore, we propose to completely reorganize the entire tribe and redraw the clans from the tribe at the end of a pre-determined period, we call an epoch. Performing a shuffle on the nodes mitigates the abilities of an adversary to control a clan at its will.
At each shuffle, the clans are drawn randomly from the tribe using a uniform distribution. Since the tribe may have up to \(\frac{1}{3}\) Byzantine nodes, the probability of having at least one clan having \(\frac{1}{2}\) or more Byzantine nodes is non-zero. We need to fix the size of the tribe and the number of clans in the tribe in such a way that the probability of any clan having a Byzantine majority is negligible.
For example, if we have 625 nodes in the tribe, and 5 clans in the tribe, each having 125 nodes, the probability of any of the clans having a Byzantine majority is less than 35 in a million. If the duration of an epoch is 2 days, the failure frequency would be about one in 156 years.
As the size of the tribe increases, the failure probability decreases for a fixed number of clans.
At each epoch, the set of aggregators are also drawn uniformly randomly from the tribe. Having a sufficiently large set of aggregators ensures that there is a very high probability of having at least one honest aggregator.
For example, if we choose 12 aggregators from the tribe of 502 nodes, the probability of not having a single honest aggregator would be less than 1.5 in a million. Even if an epoch lasts for a day, the frequency of failure would be less than one in 1,800 years!
As you increase the number of aggregators, the failure probability drops exponentially.
Note that the sizes of the tribe, the clans, and the family of aggregators are parametric to our design. The actual numbers mentioned above are merely given for an illustrative purpose.
If you want to unravel the technical mysteries of our oracle design via a formal write-up, theorems and proofs, please check out the full DORA whitepaper.
RECENT POSTS
Suscríbete al boletín de Supra para recibir noticias, actualizaciones, análisis de la industria y más.
©2024 Supra | Entropy Foundation (Suiza: CHE.383.364.961). Todos los derechos reservados