During my Bachelor’s Degree, my favorite professor told me this:
Once something works well enough, nobody calls it "AI" anymore
This concept goes in the same direction of Larry Tesler who said "AI is whatever hasn’t been done yet." The first example of Artificial Intelligence was the calculator, which was (and is) able to do very complex mathematical computations in a fraction of a second while it would take minutes or hours for a human being. Nonetheless, when we talk about AI today we don’t think about a calculator. We don’t think of it because it simply works incredibly well, and you take it for granted. The Google Search algorithm, which is in many ways much more complex than a calculator, is a form of AI that we use in our everyday lives but we don’t even think about it.
So what is really "AI"? When do we stop defining something as AI? The question is pretty complex as, if we really think about it, AI has multiple layers and domains.
It surely has multiple layers of complexity. For example, ChatGPT is more complex than the "digit recognition" 2D CNN proposed by LeCun, both conceptually and computationally, and a simple regression algorithm is far less complex than the digit recognition 2D CNN proposed by LeCun (more on this later).
It surely also has multiple domains. Every time we see a CAPTCHA, we are actively creating the input for a Neural Network to process an image. Every time we interact with a GPT we are building the text input for an NLP algorithm. Every time we say "Alexa turn on the light in the kitchen" we are feeding the audio input for a Neural Network. And while it is true that at the end of the day, everything is nothing but a 0/1 signal for a computer, it is also true that, practically, the Neural Network that processes an image has a completely different philosophy, implementation, and complexity than the one that processes an audio or a text.
This is why companies are looking more and more for specialized Machine Learning Engineers who know how to treat a special kind of data more than others. For example, in my professional career, I have worked more in the time series domain than with anything else. This blog post wants to give the reader an idea of how to use Neural Networks when we are in the time series domain. We will also do it at different levels of complexity. We will start from the most simple Neural Network that we have, which is known as Feed Forward Neural Network, to the most fancy, modern, and complex structure of Transformers.
I don’t want to bore any of the readers, and I also know that a lot of you wouldn’t find this article useful without any coding, so we are going to translate everything from English to Python.
Let’s get started! 🚀
1.Feed Forward Neural Network
1.1 Explanation
I want to start this section with a brief formalization of the problem. A timeseries is a sequence of observations recorded over time. Each observation in the sequence corresponds to a specific point in time. It kind of works like this:

For every time step t_i there is a corresponding y_i. Now it is obvious that y_i is correlated with y_i+1. The data are, as smart people say, "sequential". Nonetheless, a very simple approach is to ignore the time axis completely and to consider the sequence y_1,…, y_T as the units of the input layer of our Feed Forward Neural Network:

Once we have our input layer, we process it from left to right in a Feed Forward Neural Network. For example, if we want to do a binary classification or regression task, we can build the following FFNN:

Now, each arrow (or line) that you see in this network is a weight, that is a real number that gets multiplied and then added together (+ bias) in the units (that are the white circles).
You might already see the limitation of this method and why we define it as simple: all the units are mixed together, so the "sequentiality" is completely lost. Sure, if the model is trained well it might be able to partially recover it by adjusting the weights accordingly, but we are not enforcing it, which is not the best. The negative thing about the model is the simplicity, and the positive thing is also the simplicity: for very simple tasks FFNN works just fine, at a very limited computational cost.
1.2 Code
Now, let’s consider a sine wave. A very simple one, like this one:
Now we’ll make this task very simple. First off, we’ll do a simple regression, real number, easy and sweet. The input will be a piece of this sine wave and the output is the next point. For example, you give me the sine from t= – 2.9s and t= – 2.7s and I give you what happens at t= – 2.6s. It’s easier done than said, this is how it looks:
Then we train our Neural Network to go from a sequence of N points to 1, where 1 is basically the next point. I made it extremely simple, you can add layers by adding Dense, changing the number of units, changing the activation function, changing the optimizer, and so on and so forth.
And we can test the results with this code:
Pretty good, as we expected: we created a very simple task and we asked a very simple Neural Network to do the job.
I don’t want to leave you with just the made-up example, so I will say this: consider using the FFNN when:
- The "sequentiality" of the time series is not that relevant. For example, it might be that you have a signal vs another quantity that is also time dependent. In that case, maybe your time series is not really a "time" series and FFNN could do a good job.
- You want to keep things simple. This is far more common. Models that consider the sequential inputs are usually more computationally intensive. FFNNs are a very good alternative, as they allow you to consider even very very simple structures.
Arguably, unless you have a lot of instances (i.e. a big dataset), it’s a good practice to start with FFNN, because it is the simplest way to start.
2. (1D) Convolutional Neural Networks
2.1 Explanation
So the first thing to make this thing more complex is to actually consider the sequentiality of the input. A way to do that is by using a model that takes care of this sequentiality by "running" with a kernel (small set of weights) along the signal like shown in the picture below:

This operation is called Convolution and the corresponding network is known as Convolutional Neural Network (CNN). When someone talks about CNNs they probably refer to the structure built by the brilliant mind of Yann Lecun who developed this Network to classify the handwritten digits into 0–9.
What I showed you above came a little later, and it was developed by Serkan Kiranyaz et al. in 2015, in a paper known as Convolutional Neural Networks for patient-specific ECG classification. Now, the original paper is performing a classification task, and it is actually a great way to use 1DCNN (and CNN in general). For this reason, we will build a very simple 1DCNN classification algorithm to distinguish square waves and sine waves.
2.2 Code
This is how we build "num (x 2)" time series num is an integer number. By default is set to 1000 so we will generate 2000 time series. We will have 1000 square curves and 1000 sine curves. This is how:
This sine waves have random frequency, phase and amplitudes. The square waves are the result of the sign operation applied to random sine waves. Let’s give a look to make sure they look ok:
Yep. They look very fine. 🙂
Now we build a 1D CNN with 32 layers and a kernel vector of 3 units with relu activation. We also do some max pooling and then 2 fully connected layers right after that. We compile it, and train the model on the training set. A lot of words, but it’s this simple:
Beautiful. And it performs very well too, as we can see from this:
A lot of times, 1D CNNs are a good choice, especially for classification. This is because, they are simple Neural Networks (because at the end of the day, it is "just" a convolution operation) that complex enough to perform complicated classification tasks. They are a step forward in complexity with respect to FFNN, so you might need to be mindful of that, but, again, they are not monster-big Neural Networks.
3. Long Short Term Memory/Recurrent Neural Networks
3.1 Explanation
Do you remember Chapter 1, where we had to predict the next value of a sine wave, given the previous sequence of values?
The most common way to do that is not with FFNN but with Recurrent Neural Networks:

Now the input layer y_1,…,y_11 is processed through a hidden state sequentially, meaning that the hidden state h_i input is not only yi but also the previous hidden state h{i-1}. In a few words, the information of the previous units is preserved throughout the layer.
Now, the way this information is preserved depends on the specific RNN. In this specific example, we will use Long Short Term Memory cells. In particular, we will do a multistep forecasting task. It is basically the same thing as Chapter 1, but we will not only predict the next step, but we will predict the next k steps. We are generating a signal that has a quadratic dependency with time plus a small high-frequency small (fixed) amplitude sine wave.
3.2 Code
When you want to use LSTM you usually have a little bit of boring preprocessing to do to split your input time series in chunks. I did it in one function and put the whole pipeline in one snippet:
Let me zoom in on the predictions:

It’s pretty impressive that it’s guessing both the quadratic dependency (even if it looks like there is a small bias) and the wiggly behavior of sine waves.
When you use LSTM you are getting in the serious business of complexity. They are a pretty big beast. They tend to be hard to train, and they should be used with caution. Nonetheless, they are notorious for being the state of the art for forecasting in numerous study cases, so they do work very well if you have enough data and computational power. It’s almost indispensable to have a GPU during training time if you don’t want to wait ages.
4. Transformers
4.1 Explanation
In 2017 a paper came out and, as of today, it has been cited 130441 times (as of today, August 2024). This paper is called "Attention is all you need" and it explains how multiple sequence transduction models can be replaced with a Transformer model, that replaces RNN and CNN with the attention mechanisms.
The attention mechanism allows the model to focus (put the attention, as the name suggests) on the relevant parts of the input sequence. This is because, when you translate from sequence A to sequence B, it is not necessarily true that the first unit of sequence A "corresponds" (or translates) to the first unit of sequence B. Think of translating from one language to another. Just because the sentence in English starts with "I", it doesn’t mean that in Indian, the first word is the translation of "I".
The attention mechanism is a very fascinating one and I strongly suggest the read of the original paper, especially because it would take a long time to explain it from scratch.
4.2 Code
Now. I’ll be honest with you. Training a transformer is hard. Like, HARD hard. Imagine that the T in GPT stands for Transformer. That’s how hard.
For this reason, I would do a very simple case of training a Neural Network on a sine wave as an input and convert it to cosine. This approach is called sequence to sequence. If you read "seq2seq" you have heard it from me first :).
We used TensorFlow so far, but now we are betraying it and using Pytorch. This is because if you want to train a Transformer using Tensorflow you have to build your own class. The comfortable thing about Tensorflow is the fact that you don’t have to build your own class and you can just to model.fit() where model is defined iteratively (just like we did above). If we have to do the class business, in my opinion, Pytorch is superior because it is more intuitive.
Pytorch also has this nn.Transformer function which is extremely cool as it is very much in line with the paper.
Now. This very simple task is already super long to train with GPU. Imagine multiple signals. It gets messy. What is the good thing? The good thing is that it’s very rare that you need to train your Transformer from scratch. The only case I can think of is:
- You have a very large dataset
- For some reason, nobody has ever heard of this dataset and fine tuning an already existing transformer is not an option
As you can tell, very rare. Most of the time, you don’t need to use such a complex technology. Almost all the times that you do need to use such a complex technology a simple fine tune of an existing transformer is the way to go.
5. Conclusions
Thank you for spending your time with me. It means a lot. Let’s go through the article together and summarize it:
- We talked about Neural Networks and we discussed how they have multiple domains (audio vs image vs text) and complexity (FFNN vs CNN vs Transformer)
- We applied a Feed Forward Neural Network (FFNN) with a 1 step forecasting (regression) task of a sine wave. We showed how FFNN is a good option for very simple cases with limited computational power and dataset size.
- We applied a 1D Convolutional Neural Network (1DCNN) to a classification task. In particular, we distinguished sine waves from square waves. We showed a very good accuracy and we have demonstrated how 1DCNN can be used for classification tasks where a little bit more computational power is allowed.
- We talked about Long Short Term Memory (LSTM/RNN) Networks. We applied LSTM for multiple-step forecasting. We talked about the complexity of this neural network and how it is important to be cautious as it requires large computational power.
- We described the Transformers using the "Attention is all you need" paper to explain it. We did a very simple seq2seq sine-to-cosine algorithm translation. We noticed how computationally expensive they are and we stated that training one from scratch is something very rare as most of the time a simple fine-tuning is the way to go.
6. About me!
Thank you again for your time. It means a lot ❤
My name is Piero Paialunga and I’m this guy here:

Image made by author
I am a Ph.D. candidate at the University of Cincinnati Aerospace Engineering Department and a Machine Learning Engineer for Gen Nine. I talk about AI, and Machine Learning in my blog posts and on Linkedin. If you liked the article and want to know more about machine learning and follow my studies you can:
A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!
If you want to ask me questions or start a collaboration, leave a message here or on Linkedin: