A couple of weeks back I wrote about an idea I had;
I’d like to train a model using Decision Tree Regressor to predict Walkability in New York City, with results that closely match the BEH Neighborhood Walkability Index using OpenStreetMap derived variables, including POI feature tags, street form and density. ~ Part 1: Training a Walkability Index Machine Learning Model from OpenStreetMap derived data.
I started to execute on it, knowing it was likely the most difficult part would be the feature engineering.
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. ~ Wikipedia - Feature engineering
The same as with any project, data management, data munging, data carpentry, data processing is the most time-consuming and painstaking step. As Andrew Ng says;
Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. — Andrew Ng, “Machine Learning and AI via Brain simulations” (PDF). Stanford University. Retrieved 2017-08-03.
Quick OSM pointsample
Let’s do a quick sample of the
new-york_new-york_points.csv (which I quickly converted in QGIS from the Mapzen Metro Extract
import pandas as pd inCSV = 'data/input/new-york_new-york_points.csv' df = pd.read_csv(inCSV) len(df.index) Out: 234942 df = df.sample(frac=0.1) len(df.index) Out: 23494 df.to_csv('data/input/new-york_new-york_points_samp_1000.csv', index=False)