GitHub - rwmyers46/Rural-Land-Valuation: Multivariate Regression Model for Farm & Ranch Land Pricing

Rural Land Valuation:

Rural land is a notoriously difficult asset to price. The relatively low transaction volume coupled with nebulous land features create a market with few similarly comparable properties. This project focused on quantifying land features in order to improve accuracy for investment decisions.

Photo by Christine Mendoza on Unsplash

Data Collection:

In order to control for geographic and legislative variation, such as topography, climate, and taxes, the sample set was chosen from a region of contiguous counties comprising the Blacklands North Texas Region. Data was sourced from landoftexas.com property listings and processed using Beautiful Soup and Selenium. In order to ensure that the data accurately represented market value, the dataset only included transacted properties.

Blacklands North Texas Region

Text Processing:

Text processing was a bifurcated flow, with branches for structured and unstructued data. Structured property listing features with associated HMTL tags, such as size or price, were stripped and saved to Numpy arrays. Unstructured text from a listing's Property Description section were processed with featureCounts, a custom Natural Language Processing function that creates booleanized features from the presence of an approximate word. The data were then recombined into a dataframe with the ArrayMaker function.

Feature Engineering:

Next to water features, elevation is perhaps the region's most sought after land feature. Properties with higher elevations are more likely to provide vistas for home sites, varied topography, and better drainage for agriculture. The Google Cloud Platform has several using APIs for basic Geographic Information System (GIS) processing.

To find property elevation, the Google Maps API was employed. The GCP_Features function from the Add_GIS_Features utility file takes a dataframe argument and returns a dataframe with elevation and driving time from the nearest major metropolis of Dallas. For each address GCP_Features calls the get_GIS function, which first uses the Google Maps Geocode API to get the latitude & longitude for a property address, and then sends these values back to Google Maps Elevation API to fetch the elevation.

Model:

The following model regression model classes were evaluated with cross validation:

Linear Regression
Ridge
Lasso
Random Forest
XG Boost
K-Nearest Neighbors
Multilayer Perception
Polynomial
Elastic Net|

Lasso, Ridge, and Simple Linear Regression demonstrated the best results in initial testing. These classes were optimized with RidgeCV & LassoCV across a range of Alpha values and visualized with Yellowbrick.

To determine whether a higher order polynomial would yield better accuracy, a learning curve was plotted for degrees 0 to 5. In this test, the best R^2 occurred at n = 1.

Simple Linear Regression produced the best results with an R^2 of 25.34%. The lower coefficient of determination most likely resulted from property description inaccuracy and variance. But although most variance is unexplained, the feature impact on valuation is consistent with domain knowledge:

Water features uniformly showed the greatest premium of $100 - $250 per acre
Bosque & McClellan county are considered the region's two most desirable counties
Agricultural features such as barn & cattle, which suggest flatter, more open land were discounted accordingly

Assumptions & Error Sources:

A property listing's data majority was in the Description section. This results in the data's accuracy being wholly dependent on the thoroughness and veracity of the listing broker. Since landsoftexas.com owns the category for land sales in the United States, the associated error was unavoidable without employing advanced GIS and satellite image processing techniques.

Sale price seasonality
Topography feature mentions proportional to reality
Mineral right ownership claims accurate

Results:

Ultimately, it proved difficult to beat Simple Linear Regression. Even the best models were unable to best a 30% coefficient of determination, largely resulting from property description inaccuracy and variance. But although the models do not account for 70% of the variance, the feature impact on valuation is consistent with domain knowledge:

Water features uniformly showed the greatest premium of $100 to $250 per acre
Bosque & McClellan county are considered the regions best counties
Agricultural features such as Barn & Cattle suggest land that is flat and open and were discounted accordingly

Surprises:

Oil & gas rights ownership ("minerals") negatively impacted valuations. Since mineral rights are an asset sellers prefer to retain, this finding would only make sense if it could be demonstrated that brokers advertise minerals to increase the appeal of less desirable land. Note: oil has not been discovered in the region.
Properties with a greater travel time from the Dallas CBD were generally priced higher. This relationship may be due to the desire for privacy and quiet. One tradeoff for a more remote property, absent of highway noise, is a less-navigable ingress. In the country, this usually entails a single-lane, unpaved road safe only at lower speeds.

Future Work:

Data - Collecting more data would have been the simplest way to improve model accuracy, but the source site changed design mid-project, making this untenable within the given timeline.
GIS - Results could also be improved using more advanced GIS processing - not only to extract property features consistently across listings, but also to collect numerical measurements of said features: water feature size, wooded to open land ratio, and presence of high voltate transmission lines.
Curb Appeal - A lot can be gleaned from a property's front gate. Features accessible from Google's Street View API, such as fence condition, grandness of an entrace, and visible trash, all impact valuation and this imagery can be used to train a neural network.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
code		code
data		data
images		images
notebooks		notebooks
visualizations		visualizations
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rural Land Valuation:

Data Collection:

Text Processing:

Feature Engineering:

Model:

Assumptions & Error Sources:

Results:

Surprises:

Future Work:

About

Releases

Packages

Languages

License

rwmyers46/Rural-Land-Valuation

Folders and files

Latest commit

History

Repository files navigation

Rural Land Valuation:

Data Collection:

Text Processing:

Feature Engineering:

Model:

Assumptions & Error Sources:

Results:

Surprises:

Future Work:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages