u/Virtual-Current6295

How to apply normalization for cross sectional time series data ?

I am unable to convince myself to use one method.
Some methods that i thought of were :

  1. I use normalization for full training data of one subject across all features. In this method, i am introducing some kind of lookahead bias, and also this loses on some information which could have been valuable. And also when i want to use one model ( suppose regression with gradient descent) for the subjects combined, then I am unable to judge if this will be a good method.
  2. A bad method was to not care about the subjects, and just normalize across full feature. but this just feels wrong to me.
  3. I was reading about cross sectional normalization which ranks the subjects and does some kind of normalization. But i am unsure how that would be useful.
  4. Another way i found was by using some rolling window, where i keep normalizing not over full data, but the past window data. This seems better but here also what choice of window should be done, and there are lot of questions.

And the bigger problem over all of these is the time series . I would lose quite a lot of information when i don't consider these. ( although not all features have a big factor of this).

reddit.com
u/Virtual-Current6295 — 9 days ago

How to apply normalization for cross sectional time series data ?

I am unable to convince myself to use one method.
Some methods that i thought of were :

  1. I use normalization for full training data of one subject across all features. In this method, i am introducing some kind of lookahead bias, and also this loses on some information which could have been valuable. And also when i want to use one model ( suppose regression with gradient descent) for the subjects combined, then I am unable to judge if this will be a good method.
  2. A bad method was to not care about the subjects, and just normalize across full feature. but this just feels wrong to me.
  3. I was reading about cross sectional normalization which ranks the subjects and does some kind of normalization. But i am unsure how that would be useful.
  4. Another way i found was by using some rolling window, where i keep normalizing not over full data, but the past window data. This seems better but here also what choice of window should be done, and there are lot of questions.

And the bigger problem over all of these is the time series . I would lose quite a lot of information when i don't consider these. ( although not all features have a big factor of this).

reddit.com
u/Virtual-Current6295 — 9 days ago

How do i start with regression on huge dataset with huge number of features ?

The full dataset is about 80 GB, my laptop ram is just 16 gb. The good thing is i have already separated the data into separate feather files, and now i have files of around 500 mb each.
Other than the huge file size, i have huge number of features ( around 1500 ) and it's a complex problem, where i know linear regression is not a great choice, but to start with and establish some initial bounds / baselines i am trying linear regression.

I read up on how i can reduce features, and something like co variance matrix, pca would help me reduce co related features, but calculating that itself is a big challenge. I read up on stream, map, reduce which i might be able to use in python but it is still very slow. How can i do this faster ?

Basically, how do i load the data faster and perform operations on it ?

reddit.com
u/Virtual-Current6295 — 10 days ago

How to apply linear regression over huge dataset and with a large number of features ?

The full dataset is about 80 GB, my laptop ram is just 16 gb. The good thing is i have already separated the data into separate feather files, and now i have files of around 500 mb each.
Other than the huge file size, i have huge number of features ( around 1500 ) and it's a complex problem, where i know linear regression is not a great choice, but to start with and establish some initial bounds / baselines i am trying linear regression.

I read up on how i can reduce features, and something like co variance matrix, pca would help me reduce co related features, but calculating that itself is a big challenge. I read up on stream, map, reduce which i might be able to use in python but it is still very slow.

But yeah, my plan right now is to use co variance and pca to first reduce some features, and then try linear regression.

Are there better ways or in general some steps that i should follow to reduce this dataset ? sampling seems to be a good option for approximation.

In general if someone has experience, how should i approach this problem . what steps should i follow to reduce noise and find which features are relevant to use ?
And after this, how do i proceed with deep learning ?

reddit.com
u/Virtual-Current6295 — 10 days ago

How to apply linear regression over huge dataset and with a large number of features ?

The full dataset is about 80 GB, my laptop ram is just 16 gb. The good thing is i have already separated the data into separate feather files, and now i have files of around 500 mb each.
Other than the huge file size, i have huge number of features ( around 1500 ) and it's a complex problem, where i know linear regression is not a great choice, but to start with and establish some initial bounds / baselines i am trying linear regression.

I read up on how i can reduce features, and something like co variance matrix, pca would help me reduce co related features, but calculating that itself is a big challenge. I read up on stream, map, reduce which i might be able to use in python but it is still very slow.

But yeah, my plan right now is to use co variance and pca to first reduce some features, and then try linear regression.

Are there better ways or in general some steps that i should follow to reduce this dataset ? sampling seems to be a good option for approximation.

In general if someone has experience, how should i approach this problem . what steps should i follow to reduce noise and find which features are relevant to use ?

reddit.com
u/Virtual-Current6295 — 10 days ago