By Nathan Brannen on 29 August 2022
It seems like every time I speak to an AVM provider, they claim their AVM is the best in the country.
How can that be?
What does it even mean to be the best?
It turns out that there are a variety of ways to measure an AVM’s performance, and certain metrics may be more or less relevant depending on what’s important to you.
If you’re a marketer looking to generate leads, you may prefer a model that slightly overestimates property values as a way to generate more leads. If you’re a lender, you may be most concerned about prices with long tail risks as you want to minimize your downside exposure..
In this article, we’ll cover the main ways AVM performance is measured and touch on the considerations to keep in mind when evaluating any AVM metric.
What are the relevant metrics for evaluating AVMs?
There are a variety of terms and statistics you’ll see when evaluating AVMs. Let’s walk through how things have evolved over the years.
Twenty years ago, Hit rate (HR), or the percentage of properties an AVM has a value for in a given market, was the dominant metric. Users simply wanted the greatest coverage. While this seems to make sense, there is one big problem - HR doesn’t tell you anything about the accuracy of the values!
This led AVMs to measure their average error margin, aka Mean Error. A positive Mean Error indicates an AVM is more likely to overestimate a property’s value, while a negative Mean Error indicates that properties tend to be underestimated. Ideally, AVMs want their Mean Error to be as low as possible, but you have to be careful with Mean Error values. For example, if an AVM estimates one property at 50% over its actual value and another property at 50% under its sale price, then the errors would cancel each other out and the Mean Error would be zero!
For this reason, Mean Absolute Error (MAE) is now commonly used. MAE measures the average magnitude or size of the error between an AVM’s prediction and a property’s actual sale price, regardless of whether it is over or underestimated.
So when evaluating an AVM, make sure you consider a model’s overall MAE as well as its MAE in the market you’re most interested in. In newer markets with many recently built homes, like Phoenix AZ, AVMs often have MAEs that are significantly lower than those in older markets, like many areas in the Northeast US, where homes are vastly different in condition and style.
Are all AVM errors created equal?
Beyond looking at the raw error rates of models, many clients find it helpful to understand the percentage of property valuations that are “usable.” For example, if an estimated value needs to be “reasonably close” to be useful, then you might prioritize a model with the highest percentage of valuation estimates scored within +/- 10% of their actual value. The percent threshold you’re willing to tolerate is known as an “Error Bucket” and the percentage of properties that fall within that range is referred to as the “Percent Predicted Error,“ or PPE. A common measure reported by AVMs is the percentage of homes that fall within +/- 10%, also known as PPE10.
In certain situations, extreme overvaluations are particularly undesirable. In these cases, you would want to prioritize and focus on a concept called “Right Tail Errors >20%,” or RT20. For example, if you’re a risk-sensitive Lender who wants to leverage an AVM to evaluate HELOC loans, the difference between thinking you’re making an 80% CLTV (Combined Load-to-Value) loan and actually making a 105% CLTV loan is a big deal. So you would want to prioritize models with a low RT20 score allowing you to minimize your risk.
Lastly, the idea that models may have biases baked into their algorithms is a growing concern. While the Mean Error can help indicate whether a model tends to over or undervalue a property, it is also important to understand whether that bias, or tendency to over/undervalue a property, is consistent across all properties. For example, many AVM’s accuracies may vary in magnitude for entry-level homes compared with luxury properties.
Price-Related Bias (PRB) is a metric designed to quantify this bias. An upward sloping PRB indicates that an AVM undervalues low-priced homes and over-values high-priced homes. While an AVM’s PRB isn’t strictly a test for racial bias, it recently has been used to explore whether AVMs are having a disparate impact on protected minorities.
Gaming the Statistics
Now that you know the relevant metrics to evaluate an AVM, you may be thinking that evaluating AVMs against each other is straightforward. You simply take price predictions from an AVM and compare them to the actual sales prices. However, there are many challenges to keep in mind.
- Many AVMs use the list price and sale price of properties as inputs to their AVMs. Once a listing is posted or a sale closes, many models will quickly use that new information to factor into their price predictions. As a result, to conduct a valid test, it is necessary to get the AVMs value estimate before the property is listed or sold.
- Some AVMs are designed to avoid providing estimates for homes where the model has low confidence on its price prediction. Companies may choose to improve their model’s accuracy metrics at the expense of having a lower hit rate or vice versa.
- The housing market is constantly changing and AVMs are always trying to adapt. The best AVM in 2021 may not be the most accurate in 2022, particularly as AVM providers adopt newer technologies like AI and computer vision, and implement new datasets to improve their models. If you look at the leading AVMs by market over the past 2 years, you’ll see that the market is incredibly competitive and it is difficult for an AVM to consistently outperform its peers.
Given these challenges, how do you know how much to trust an AVM’s valuation?
About 20 years ago a researcher for Freddie Mac named Douglas Gordon described a method to address model uncertainty. Forecast Standard Deviation (FSD) is an AVM’s own estimate of its accuracy for a particular property. It’s a statistical measure that scores the likeliness that a particular AVM value estimate is accurate. For example, if an AVM returns an FSD score of 10% for a given home, (without getting into a lot of math) it basically means that if you took 100 equally confident estimates of the home’s value, ~68% would be within +/- 10% of the correct value.
By leveraging the FSD, you can compare and evaluate different AVM value estimates that have the same FSD score. This approach helps normalize the way various metrics may have been gamed, or manipulated, and provides a better framework to determine which model is truly more accurate.
For those that have made it this far, hopefully, you’ve learned a lot about the varied approaches to evaluate AVMs. It may not always be simple, but armed with your newfound knowledge you can better understand how to compare different models for your particular use case. The trick is knowing what each of these metrics tells you and how to read between the lines.
Have any additional thoughts on AVMs you’d like to share? Please reach out and let us know what you think!