When working with features in Machine learning and representing them in a matrix, what's the recommended way to represent hour of day and day of week as features for value prediction models?

Is using 0 for all hour values and 1 for the hour to represent the preferred way to represent these attributes as a feature? Same for day of week?

Thanks

2

There are 2 answers

0
Tushar Gupta On

In this case there is a periodic weekly trend and a long term upwards trend. So you would want to encode two time variables:

  • day_of_week
  • absolute_time

In general

There are several common time frames that trends occur over:

  • absolute_time
  • day_of_year
  • day_of_week
  • month_of_year
  • hour_of_day
  • minute_of_hour

Look for trends in all of these.

Weird trends

Look for weird trends too. For example you may see rare but persistent time based trends:

  • is_easter
  • is_superbowl
  • is_national_emergency etc.

These often require that you cross reference your data against some external source that maps events to time.

Why graph?

There are two reasons that I think graphing is so important.

Weird trends: While the general trends can be automated pretty easily (just add them every time), weird trends will often require a human eye and knowledge of the world to find. This is one reason that graphing is so important.

Data errors: All too often data has serious errors in it. For example, you may find that the dates were encoded in two formats and only one of them has been correctly loaded into your program. There are a myriad of such problems and they are surprisingly common. This is the other reason I think graphing is important, not just for time series, but for any data.

Answer from https://datascience.stackexchange.com/questions/2368/machine-learning-features-engineering-from-date-time-data

0
Nadjmeddine Boudjellal On

no, your choice isn't perfect, because like that you will lose the loop representation because in hours the machine learning needs to know that 23:00 is near to 00:00 and the same thing in weekdays, it generally starts with Monday as 0 and Sunday as 6, so if you use your method, machine learning will represent every day or hours as a depending entity that has no relation with other, and that's wrong. the right way to represent this type of data is you represent each feature( hour, day of the week ..) with two features. those two features are the sin/cos of the value, for example for hours, you create hours_cos / hours_sin and then for each hour you calculate the sin and cos values, and before applying the sin and cos, you need to calculate theta, in python you just import pi from math then :

theta = 2 * pi * hour

then you import also sin and cos from math, and calculate the sin(theta) cos(theta)