A couple of weeks back I wrote about an idea I had;

I’d like to train a model using Decision Tree Regressor to predict Walkability in New York City, with results that closely match the BEH Neighborhood Walkability Index using OpenStreetMap derived variables, including POI feature tags, street form and density. ~ Part 1: Training a Walkability Index Machine Learning Model from OpenStreetMap derived data.

I started to execute on it, knowing it was likely the most difficult part would be the feature engineering.

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. ~ Wikipedia - Feature engineering

The same as with any project, data management, data munging, data carpentry, data processing is the most time-consuming and painstaking step. As Andrew Ng says;

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. — Andrew Ng, “Machine Learning and AI via Brain simulations” (PDF). Stanford University. Retrieved 2017-08-03.

Quick OSM pointsample

Let’s do a quick sample of the new-york_new-york_points.csv (which I quickly converted in QGIS from the Mapzen Metro Extract new-york_new-york.osm.pbf)

import pandas as pd

inCSV = 'data/input/new-york_new-york_points.csv'
df = pd.read_csv(inCSV)

Out[3]: 234942

df = df.sample(frac=0.1)

Out[5]: 23494

df.to_csv('data/input/new-york_new-york_points_samp_1000.csv', index=False)