

# therefore, we are imputing values by interpolation for a gap of 3 or smaller # interpolation beats any model for small gaps of NA values

KAGGLE SPELLING CORRECTOR CONTEST CODE
In the code above, we are replacing all outliers with NA values. # … with 2 more variables: PRESSURE, ID # we decided that we are going to impute NA for outliersĭplyr::mutate_at(vars(WINDDIR, WINDSPEED, TEMPERATURE, DEWPOINT, PRESSURE), # some values have abnormally large values. # in the training set that are in the test setĭplyr::mutate(ID = dplyr::row_number()) %>% # important for looking up missing values Data Preparation # reading in training set In the end, however, the crucial part for this module was to figure out for which NA values we needed to use linear interpolation, and which ones we needed to predict with modeling. Our approaches for missing data imputations included linear interpolation, random forests, and linear regressions. The final score was evaluated by calculating the absolute mean error. In total, there were 2,074,873 missing values where we only had to make predictions for 29,000 of them.

The data set consisted of 1,880,892 rows and 11 columns. This competition was about imputing missing values in a times series weather data set, acquired from 76 different weather stations in the United Kingdom. During my 4th year as an undergraduate student in statistics, I took a Big Data class where we had to compete in a Kaggle competition.
