.NET, TensorFlow, and the windmills of Kaggle — the journey begins / forpes.ru

Главная
.NET, TensorFlow, and the windmills of Kaggle — the journey begins

.NET, TensorFlow, and the windmills of Kaggle — the journey begins +7

23.01.2019 07:29

lostmsu 1 544 Источник

This is a series of articles about my ongoing journey into the dark forest of Kaggle competitions as a .NET developer.

I will be focusing on (almost) pure neural networks in this and the following articles. It means, that most of the boring parts of the dataset preparation, like filling out missing values, feature selection, outliers analysis, etc. will be intentionally skipped.

The tech stack will be C# + TensorFlow tf.keras API. As of today it will also require Windows. Larger models in the future articles may need a suitable GPU for their training time to remain sane.

Let's predict real estate prices!

House Prices is a great competition for novices to start with. Its dataset is small, there are no special rules, public leaderboard has many participants, and you can submit up to 4 entries a day.

Register on Kaggle, if you have not done that yet, join this competition, and download the data. Goal is to predict sale price (SalePrice column) for entries in test.csv. Archive contains train.csv, which has about 1500 entries with known sale price to train on. We'll begin with loading that dataset, and exploring it a little bit, before getting into neural networks.

Analyze training data

Did I say we will skip the dataset preparation? I lied! You have to take a look at least once.

To my surprise, I did not find an easy way to load a .csv file in the .NET standard class library, so I installed a NuGet package, called CsvHelper. To simplify data manipulation, I also got my new favorite LINQ extension package MoreLinq.

Loading .csv data into DataTable

static DataTable LoadData(string csvFilePath) {
  var result = new DataTable();
  using (var reader = new CsvDataReader(new CsvReader(new StreamReader(csvFilePath)))) {
    result.Load(reader);
  }
  return result;
}

ML.NET

Using DataTable for training data manipulation is, actually, a bad idea.

ML.NET is supposed to have the .csv loading and many of the data peparation and exploration operations. However, it was not ready for that particular purpose yet, when I just entered House Prices competition.

The data looks like this (only a few rows and columns):

Id	MSSubClass	MSZoning	LotFrontage	LotArea
1	60	RL	65	8450
2	20	RL	80	9600
3	60	RL	68	11250
4	70	RL	60	9550

After loading data, we need to remove the Id column, as it is actually unrelated to the house prices:

var trainData = LoadData("train.csv");
trainData.Columns.Remove("Id");

Analyzing the column data types

DataTable does not automatically infer data types of the columns, and assumes it's all strings. So the next step is to determine what we actually have. For each column I computed the following statistics: number of distinct values, how many of them are integers, and how many of them are floating point numbers (a source code with all helper methods will be linked at the end of the article):

var values = rows.Select(row => (string)row[column]);
double floats = values.Percentage(v => double.TryParse(v, out _));
double ints = values.Percentage(v => int.TryParse(v, out _));
int distincts = values.Distinct().Count();

Numeric columns

It turns out most columns are actually ints, but since neural networks mostly work on floating numbers, we will convert them to doubles anyway.

Categorical columns

Other columns describe categories the property on sale belonged to. None of them have too many different values, which is good. To use them as an input for our future neural network, they have to be converted to double too.

Initially, I simply assigned numbers from 0 to distinctValueCount — 1 to them, but that does not make much sense, as there is actually no progression from «Facade: Blue» through «Facade: Green» into «Facade: White». So early on I changed that to what's called a one-hot encoding, where each unique value gets a separate input column. E.g. «Facade: Blue» becomes [1,0,0], and «Facade: White» becomes [0,0,1].

Getting them all together

Large output of data exploration

CentralAir: 2 values, ints: 0.00%, floats: 0.00%
Street: 2 values, ints: 0.00%, floats: 0.00%
Utilities: 2 values, ints: 0.00%, floats: 0.00%
Alley: 3 values, ints: 0.00%, floats: 0.00%
BsmtHalfBath: 3 values, ints: 100.00%, floats: 100.00%
HalfBath: 3 values, ints: 100.00%, floats: 100.00%
LandSlope: 3 values, ints: 0.00%, floats: 0.00%
PavedDrive: 3 values, ints: 0.00%, floats: 0.00%
BsmtFullBath: 4 values, ints: 100.00%, floats: 100.00%
ExterQual: 4 values, ints: 0.00%, floats: 0.00%
Fireplaces: 4 values, ints: 100.00%, floats: 100.00%
FullBath: 4 values, ints: 100.00%, floats: 100.00%
GarageFinish: 4 values, ints: 0.00%, floats: 0.00%
KitchenAbvGr: 4 values, ints: 100.00%, floats: 100.00%
KitchenQual: 4 values, ints: 0.00%, floats: 0.00%
LandContour: 4 values, ints: 0.00%, floats: 0.00%
LotShape: 4 values, ints: 0.00%, floats: 0.00%
PoolQC: 4 values, ints: 0.00%, floats: 0.00%
BldgType: 5 values, ints: 0.00%, floats: 0.00%
BsmtCond: 5 values, ints: 0.00%, floats: 0.00%
BsmtExposure: 5 values, ints: 0.00%, floats: 0.00%
BsmtQual: 5 values, ints: 0.00%, floats: 0.00%
ExterCond: 5 values, ints: 0.00%, floats: 0.00%
Fence: 5 values, ints: 0.00%, floats: 0.00%
GarageCars: 5 values, ints: 100.00%, floats: 100.00%
HeatingQC: 5 values, ints: 0.00%, floats: 0.00%
LotConfig: 5 values, ints: 0.00%, floats: 0.00%
MasVnrType: 5 values, ints: 0.00%, floats: 0.00%
MiscFeature: 5 values, ints: 0.00%, floats: 0.00%
MSZoning: 5 values, ints: 0.00%, floats: 0.00%
YrSold: 5 values, ints: 100.00%, floats: 100.00%
Electrical: 6 values, ints: 0.00%, floats: 0.00%
FireplaceQu: 6 values, ints: 0.00%, floats: 0.00%
Foundation: 6 values, ints: 0.00%, floats: 0.00%
GarageCond: 6 values, ints: 0.00%, floats: 0.00%
GarageQual: 6 values, ints: 0.00%, floats: 0.00%
Heating: 6 values, ints: 0.00%, floats: 0.00%
RoofStyle: 6 values, ints: 0.00%, floats: 0.00%
SaleCondition: 6 values, ints: 0.00%, floats: 0.00%
BsmtFinType1: 7 values, ints: 0.00%, floats: 0.00%
BsmtFinType2: 7 values, ints: 0.00%, floats: 0.00%
Functional: 7 values, ints: 0.00%, floats: 0.00%
GarageType: 7 values, ints: 0.00%, floats: 0.00%
BedroomAbvGr: 8 values, ints: 100.00%, floats: 100.00%
Condition2: 8 values, ints: 0.00%, floats: 0.00%
HouseStyle: 8 values, ints: 0.00%, floats: 0.00%
PoolArea: 8 values, ints: 100.00%, floats: 100.00%
RoofMatl: 8 values, ints: 0.00%, floats: 0.00%
Condition1: 9 values, ints: 0.00%, floats: 0.00%
OverallCond: 9 values, ints: 100.00%, floats: 100.00%
SaleType: 9 values, ints: 0.00%, floats: 0.00%
OverallQual: 10 values, ints: 100.00%, floats: 100.00%
MoSold: 12 values, ints: 100.00%, floats: 100.00%
TotRmsAbvGrd: 12 values, ints: 100.00%, floats: 100.00%
Exterior1st: 15 values, ints: 0.00%, floats: 0.00%
MSSubClass: 15 values, ints: 100.00%, floats: 100.00%
Exterior2nd: 16 values, ints: 0.00%, floats: 0.00%
3SsnPorch: 20 values, ints: 100.00%, floats: 100.00%
MiscVal: 21 values, ints: 100.00%, floats: 100.00%
LowQualFinSF: 24 values, ints: 100.00%, floats: 100.00%
Neighborhood: 25 values, ints: 0.00%, floats: 0.00%
YearRemodAdd: 61 values, ints: 100.00%, floats: 100.00%
ScreenPorch: 76 values, ints: 100.00%, floats: 100.00%
GarageYrBlt: 98 values, ints: 94.45%, floats: 94.45%
LotFrontage: 111 values, ints: 82.26%, floats: 82.26%
YearBuilt: 112 values, ints: 100.00%, floats: 100.00%
EnclosedPorch: 120 values, ints: 100.00%, floats: 100.00%
BsmtFinSF2: 144 values, ints: 100.00%, floats: 100.00%
OpenPorchSF: 202 values, ints: 100.00%, floats: 100.00%
WoodDeckSF: 274 values, ints: 100.00%, floats: 100.00%
MasVnrArea: 328 values, ints: 99.45%, floats: 99.45%
2ndFlrSF: 417 values, ints: 100.00%, floats: 100.00%
GarageArea: 441 values, ints: 100.00%, floats: 100.00%
BsmtFinSF1: 637 values, ints: 100.00%, floats: 100.00%
SalePrice: 663 values, ints: 100.00%, floats: 100.00%
TotalBsmtSF: 721 values, ints: 100.00%, floats: 100.00%
1stFlrSF: 753 values, ints: 100.00%, floats: 100.00%
BsmtUnfSF: 780 values, ints: 100.00%, floats: 100.00%
GrLivArea: 861 values, ints: 100.00%, floats: 100.00%
LotArea: 1073 values, ints: 100.00%, floats: 100.00%

Many value columns:
Exterior1st: AsbShng, AsphShn, BrkComm, BrkFace, CBlock, CemntBd, HdBoard, ImStucc, MetalSd, Plywood, Stone, Stucco, VinylSd, Wd Sdng, WdShing
Exterior2nd: AsbShng, AsphShn, Brk Cmn, BrkFace, CBlock, CmentBd, HdBoard, ImStucc, MetalSd, Other, Plywood, Stone, Stucco, VinylSd, Wd Sdng, Wd Shng
Neighborhood: Blmngtn, Blueste, BrDale, BrkSide, ClearCr, CollgCr, Crawfor, Edwards, Gilbert, IDOTRR, MeadowV, Mitchel, NAmes, NoRidge, NPkVill, NridgHt, NWAmes, OldTown, Sawyer, SawyerW, Somerst, StoneBr, SWISU, Timber, Veenker

non-parsable floats
GarageYrBlt: NA
LotFrontage: NA
MasVnrArea: NA

float ranges:
BsmtHalfBath: 0...2
HalfBath: 0...2
BsmtFullBath: 0...3
Fireplaces: 0...3
FullBath: 0...3
KitchenAbvGr: 0...3
GarageCars: 0...4
YrSold: 2006...2010
BedroomAbvGr: 0...8
PoolArea: 0...738
OverallCond: 1...9
OverallQual: 1...10
MoSold: 1...12
TotRmsAbvGrd: 2...14
MSSubClass: 20...190
3SsnPorch: 0...508
MiscVal: 0...15500
LowQualFinSF: 0...572
YearRemodAdd: 1950...2010
ScreenPorch: 0...480
GarageYrBlt: 1900...2010
LotFrontage: 21...313
YearBuilt: 1872...2010
EnclosedPorch: 0...552
BsmtFinSF2: 0...1474
OpenPorchSF: 0...547
WoodDeckSF: 0...857
MasVnrArea: 0...1600
2ndFlrSF: 0...2065
GarageArea: 0...1418
BsmtFinSF1: 0...5644
SalePrice: 34900...755000
TotalBsmtSF: 0...6110
1stFlrSF: 334...4692
BsmtUnfSF: 0...2336
GrLivArea: 334...5642
LotArea: 1300...215245

With that in mind, I built the following ValueNormalizer, which takes some info about the values inside the column, and returns a function, that transforms a value (a string) into a numeric feature vector for the neural network (double[]):

ValueNormalizer

static Func<string, double[]> ValueNormalizer(
    double floats, IEnumerable<string> values) {
  if (floats > 0.01) {
    double max = values.AsDouble().Max().Value;
    return s => new[] { double.TryParse(s, out double v) ? v / max : -1 };
  } else {
    string[] domain = values.Distinct().OrderBy(v => v).ToArray();
    return s => new double[domain.Length+1]
                .Set(Array.IndexOf(domain, s)+1, 1);
  }
}

Now we've got the data converted into a format, suitable for a neural network. It is time to build one.

Build a neural network

As of today, you would need to use a Windows machine for that.

If you already have Python and TensorFlow 1.1x installed, all you need is

<PackageReference Include="Gradient" Version="0.1.10-tech-preview3" />

in your modern .csproj file. Otherwise, refer to the Gradient manual to do the initial setup.

Once the package is up and running, we can create our first shallow deep network.

using tensorflow;
using tensorflow.keras;
using tensorflow.keras.layers;
using tensorflow.train;

...

var model = new Sequential(new Layer[] {
  new Dense(units: 16, activation: tf.nn.relu_fn),
  new Dropout(rate: 0.1),
  new Dense(units: 10, activation: tf.nn.relu_fn),
  new Dense(units: 1, activation: tf.nn.relu_fn),
});

model.compile(optimizer: new AdamOptimizer(), loss: "mean_squared_error");

This will create an untrained neural network with 3 neuron layers, and a dropout layer, that helps to prevent overfitting.

tf.nn.relu_fn

tf.nn.relu_fn is the activation function for our neurons. ReLU is known to work well in deep networks, because it solves vanishing gradient problem: derivatives of original non-linear activation functions tended to become very small when the error propagated back from the output layer in deep networks. That meant, that the layers closer to the input would only adjust very slightly, which slowed training of deep networks significantly.

Dropout

Dropout is a special-function layer in neural networks, which actually does not contain neurons as such. Instead, it operates by taking each individual input, and randomly replaces it with 0 on self output (otherwise it just passes the original value along). By doing so it helps to prevent overfitting to less relevant features in a small dataset. For example, if we did not remove the Id column, the network could have potentially memorized <Id>-><SalePrice> mapping exactly, which would give us 100% accuracy on the training set, but completely unrelated numbers on any other data.

Why do we need dropout? Our training data only has ~1500 examples, and this tiny neural network we've built has > 1800 tunable weights. If it would be a simple polynomial, it could match the price function we are trying to approximate exactly. But then it would have enormous values on any inputs outside of the original training set.

Feed the data

TensorFlow expects its data either in NumPy arrays, or existing tensors. I am converting DataRows into NumPy arrays:

using numpy;

...

const string predict = "SalePrice";

ndarray GetInputs(IEnumerable<DataRow> rowSeq) {
  return np.array(rowSeq.Select(row => np.array(
      columnTypes
      .Where(c => c.column.ColumnName != predict)
      .SelectMany(column => column.normalizer(
        row.Table.Columns.Contains(column.column.ColumnName)
        ? (string)row[column.column.ColumnName]
        : "-1"))
      .ToArray()))
    .ToArray()
  );
}

var predictColumn = columnTypes.Single(c => c.column.ColumnName == predict);
ndarray trainOutputs = np.array(predictColumn.trainValues
                                             .AsDouble()
                                             .Select(v => v ?? -1)
                                             .ToArray());
ndarray trainInputs = GetInputs(trainRows);

In the code above we convert each DataRow into an ndarray by taking every cell in it, and applying the ValueNormalizer corresponding to its column. Then we put all rows into another ndarray, getting an array of arrays.

No such transform is needed for outputs, where we just convert train values to another ndarray.

Time to get down the gradient

With this setup, all we need to do to train our network is to call model's fit function:

model.fit(trainInputs, trainOutputs,
          epochs: 2000,
          validation_split: 0.075,
          verbose: 2);

This call will actually set aside the last 7.5% of the training set for validation, then repeat the following 2000 times:

split the rest of trainInputs into batches
feed these batches one by one into the neural network
compute error using the loss function we defined above
backpropagate the error through the gradients of individual neuron connections, adjusting weights

While training, it will output the network's error on the data it set aside for validation as val_loss and the error on the training data itself as just loss. Generally, if val_loss becomes much greater, than the loss, it means the network started overfitting. I will address that in more detail in the following articles.

If you did everything correctly, a square root of one of your losses should be on the order of 20000.

Submission

I won't talk much about generating the file to submit here. The code to compute outputs is simple:

const string SubmissionInputFile = "test.csv";
DataTable submissionData = LoadData(SubmissionInputFile);
var submissionRows = submissionData.Rows.Cast<DataRow>();
ndarray submissionInputs = GetInputs(submissionRows);
ndarray sumissionOutputs = model.predict(submissionInputs);

which mostly uses functions, that were defined earlier.

Then you need to write them into a .csv file, which is simply a list of Id,predicted_value pairs.

When you submit your result, you should get a score on the order of 0.17, which would be somewhere in the last quarter of the public leaderboard table. But hey, if it was as simple as a 3 layer network with 27 neurons, those pesky data scientists would not be getting $300k+/y total compensations from the major US companies

Wrapping up

The full source code for this entry (with all of the helpers, and some of the commented out parts of my earlier exploration and experiments) is about 200 lines on the PasteBin.

In the next article you will see my shenanigans trying to get into top 50% of that public leaderboard. Its going to be an amateur journeyman's adventure, a fight with The Windmill of Overfitting with the only tool the wanderer has — a bigger model (e.g. deep NN, remember, no manual feature engineering!). It will be less of a coding tutorial, and more of a thought quest with really crooky math and a weird conclusion. Stay tuned!

Links

Kaggle
House Prices competition on Kaggle
TensorFlow regression tutorial
TensorFlow home page
TensorFlow API reference
Gradient (TensorFlow binding)

Комментарии (1)

roryorangepants
23.01.2019 11:56
#19649390
Its going to be an amateur journeyman's adventure, a fight with The Windmill of Overfitting with the only tool the wanderer has — a bigger model (e.g. deep NN, remember, no manual feature engineering!)

It is more common to fight with overfitting by using simpler models, not bigger.
If you increase size of your model, you move from underfitting to overfitting on bias-variance curve.