Introduction

Over the last decade, significant progress has been made in automated driving systems (ADS). Given the current momentum and progress, ADS can be expected to continue to advance as variety of ADS products are going to become commercially available in the space of a decade1. It is envisioned that automated driving technology will lead to a paradigm shift in transportation systems in terms of user experience, mode choices and business models. Nowadays, a greater number of industrialists are increasing their investments in self-driving cars technologies and, more generally, in the automotive sector. ADS research and an increasing number of industrial implementations have been catalyzed by the accumulated knowledge in vehicle dynamics in the wake of breakthroughs in computer vision caused by the advent of deep learning2,3,4,5 and the availability of new sensor modalities such as lidar6.

Deep Learning (DL) has been widely used for the implementation of ADSs. Starting from the work of Krizhevsky et al.2, different DL approaches have been proposed. Bojarski et al.3 proposed a deep neural network (PilotNet) for predicting the steering angles thanks to a set of images captured from the car. Based on Krizhevsky et al.2, Kocić et al.4 proposed the so called J-net, that is a DL model suitable for end-to-end systems. All of the above mentioned methods follow direct supervised training strategies. A review of these methods for ADS can be found in Yurtsever et al.7. In this context, a ground truth is required for training, which normally consists in the ego-action sequence of an expert human driver, while the network learns to imitate the driver. More precisely, the training is performed on a set of camera images taken from the front car (input), which is a raw signal (i.e. pixels) and, for example, the steering angles (output) used to control the car. Previously mentioned methods are then trained using road images paired with the steering angles generated by a human tasked with driving a data-collection car. The prediction tasks are solved using Convolutional Neural Networks (CNNs)8. Generally, CNNs use layers that convolve the inputs with filters and compress them. As a result, CNNs can find features and transform them into simpler representations. This ability has made them useful in many different applications (e.g. arts and image classification, identification of movements)9. However, CNNs predict one frame at a time and generate future images recursively, which are prone to focus on spatial appearances and relatively weak in capturing long-term motions10. In order to capture a temporal relation, a possible solution is the use of Long Short Term Memory (LSTM)8 networks. The core idea behind the LSTM architecture is a memory cell which can maintain its state over time as non-linear gating units which regulate the information flow into and out of the cell11.

In order to overcome the limits of CNNs and exploit the advantages of LSTMs, Shi et al.12 proposed the convolutional LSTM (ConvLSTM). ConvLSTM network, which is able to model the spatiotemporal structures simultaneously by explicitly encoding the spatial information into tensors, thus overcoming the limitation of vector-variate representations in standard LSTM where the spatial information is lost. Similar approaches have been also used in the context of ADS13,14. In both works, authors used 4 consecutive ConvLSTM layers as the first part of the deep network, then Yu et al.13 completed the network with a fully connected layer and Bai et al.14 used a 3D convolutional (3DConv) layer15 followed by 2 fully connected layers.

DL has been successfully applied in different fields, gaining remarkable results. Nonetheless, a crucial issue dealing with DL remains, which is neural network architecture definition16. In fact, currently employed architectures and related hyperparameters settings have mostly been developed by human experts, a manual process which is both time-consuming and error-prone. For instance, in LSTM networks many parameters and variants can be used to solve the same problem. Performance is then influenced by the final version of the LSTM networks that is used. Greff et al.11 called it a “...search space odyssey...”.

In the Machine Learning (ML) and DL community, Bayesian optimization (BO)17,18—sometimes also named Sequential Model Based Optimization (SMBO)—has recently became the standard strategy for Automated Machine Learning (AutoML)19 and Neural Architecture Search (NAS)16. BO is a general sample-efficient strategy for global optimization of black-boxes, which are expensive and multi-extremal functions, traditionally constrained to a box-bounded search space. Furthermore, BO has also been extended to the case of unknown constraints and/or partially defined objective functions20,21,22,23,24, as well as to the case of weakly specified search spaces25,26. A recent review of automated ML techniques for DL approaches can be found in He et al.27. Specifically for NAS, a recent review is provided in28, with also evolutionary approaches have been recently proposed, often in combination with BO29,30,31.

This work proposes the application of Bayesian optimization (BO) combined with Spatiotemporal-Long Short Term Memory (ST-LSTM) network for the prediction of the steering angles in automated driving systems based on camera images (i.e. raw pixels) taken from the front car (called BO_ST-LSTM from now on). Unlike previous works13,14, we have added a MaxPooling layer to reduce the complexity of the final network and we have applied the BO to optimize the hyperparameters and obtain a more suitable model for the experiment taken into account. More precisely, the main contributions of this study can be summarized as follows.

  1. 1.

    We propose a sample efficient approach, based on BO, for optimally tuning an ST-LSTM based network to enhance steering wheel angle prediction: we named the resulting final model BO_ST-LSTM. BO_ST-LSTM has demonstrated higher accuracy with respect to competitors and a better generalization performance between validation and test set.

  2. 2.

    We propose a general hyperparameter optimization framework for DL methods in the context of automated driving systems (ADS) based on BO. This hybrid approach can improve the predictive power of DL architectures by finding the most suitable configuration for the problem under study.

Our results can be adapted and used in the development of ADSs to help obtain a more precise prediction of steering wheel angle, leading to safer car for drivers and traffic participants.

The following sections of the paper are organized as follows. “Related work” section will discuss related works in the field of automated driving systems (ADS). “Proposed method” section will present our method of prediction of steering angles by coupling BO and ST-LSTM, while “Experiments” section will explain the experiments and performances of our method. Finally, “Conclusion and future works” section will present discussions and conclusions of the paper.

Related work

Object recognition is an important task in different fields and it is often solved by using machine learning methods and large datasets. An example is the so-called AlexNet2. In this work, the authors trained a large convolutional neural networks on the subsets of ImageNet dataset32 used in the ILSVRC-2010 and ILSVRC-2010 competitions33, thereby obtaining the best results ever reported on these datasets in 2012.

In the context of ADS, Deep Learning (DL) has been widely used for their implementation34. Bojarski et al.3 proposed a neural-network-based system known as PilotNet, which outputs steering angles given images on the road ahead. PilotNet training data contains single images sampled from videos recorder by a front-facing camera in the car, the former paired with the corresponding steering command. The PilotNet network architecture consists in a set of normalization layers, convolutional layers and fully connected layers. The proposed net is mainly tested for understanding the ability to recognize objects that can affect the steering. In the same year, Kim and Canny35 proposed a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network’s output. Successively, a causal filtering step is applied to determine which input regions actually influence the output.

More recently, Kocić et al.4 developed an end-to-end deep neural network (called J-net) suitable for deployment on embedded automotive platform modifying the AlexNet2 solution, which is a convolutional neural network for image classification. Differently from AlexNet and PilotNet, J-Net is based on a set of convolutional layers and a set of max-pooling layers used to reduce the number of parameters. Compared with AlexNet and PilotNet, J-Net had comparable results in terms of predictive power, but with a lower complexity overall.

Motion planning is a fundamental technology for autonomous driving vehicles and, over the past years, novel deep learning approaches have been demonstrated to be power techniques. In Yu et al.13, two main contributions can be found. Firstly, the authors introduced the Baidu Driving Dataset (DBB): DBB is a new driving dataset, which contains a 10,000-km frontal camera image and a vehicle motion attitude data of real road-conditions. Secondly, the authors have proposed an end-to-end reactive control model for lateral and longitudinal controls. The former referring to steering angle predictions, the latter referring to means accelerating and braking commands optimization. Similarly to other works3,4, a set of pre-processing layers, convolutional layers and fully connected layers were used for steering angle prediction. The accelerating and braking problem was solved with a Convolutional Long Short Term Memory architecture (Conv-LSTM)12 using 5 frames, taken from the front camera car, as input. The network architecture was composed of 4 Conv-LSTM layers and a fully connected layer.

The Conv-LSTM network combines image feature extraction capabilities from convolutional networks and the memory ability of LSTM networks. Starting from the work of Yu et al.13, Bai et al.14 proposed the so-called Spatiotemporal-Long Short Term Memory (ST-LSTM) network for motion planning value prediction in the context of autonomous vehicle. The ST-LSTM network is composed of 4 ConvLSTM layers followed by a 3-D convolutional (3DConv) layer15 and 2 fully connected layers. As in Yu et al.13, authors do not consider a single image as input but 3 continuous frames to ensure the real-time performance of motion result.

Recently, AutoML and NAS are becoming standard solutions for optimizing hyperparameters and architecture of neural networks, respectively, with examples in different application domains. For instance, in Nguyen et al.36 an accurate and reliable multi-step ahead prediction model based on LSTM, whose hyperparameters have been optimized through BO, has been validated on steam generator data acquired from different French nuclear power plants for prognostic and health management of the plants. In Osmani et al.37, an expensive and time-consuming design of a Deep Neural Network for human activity recognition has been addressed via BO in order to optimally and efficiently tune the deep neural architectures’ hyper-parameters. With respect to ADS, a recent and interesting application of BO is devoted to generate simulation scenarios in order to improve accuracy and “safety” of the ADS38,39,40. In Kong et al.41, authors proposed a deep Q-learning (DQL)-based energy management strategy (EMS) for an electric vehicle. In this work, BO is used to optimize the hyperparameter configuration of the DQL-based EMS. Another interesting work is Alizadeh et al.42, in which a novel attention-based LSTM cell has been proposed and optimized by Bayesian optimization for streamflow postprocessing which outperformed the simple LSTM, GRU, a machine learning algorithm, and two statistical-based models.

Proposed method

Firstly, the general architecture of the ADS for steering angle prediction is hereby presented, followed by a short description of the two main ADS components: BO and ST-LSTM.

Figure 1 shows the general approach of the BO_ST-LSTM system design for steering angle prediction. The first part, the data layer, is composed of two procedures, namely data preprocessing and separation. In the first procedure, several preprocessing operations are applied. In the next stage, the upper and lower sections of the images are cropped to eliminate unnecessary information; then, resolution reduction is applied to each frame for computational reasons. The size is reduced from 455 × 256 pixels to 200 × 66 pixels. This dimension is equal to the one used as input for the PilotNet2. Finally, three consecutive images are stacked and used as the input frame dimension for the deep neural network. The steering angle associated with each tensor is the one related to the last frame.

The second procedure, called data separation, is used to split the dataset, and a usual machine learning principle is applied. The dataset is divided into training, validation and test sets in order to learn, optimize and test the final intelligent system.

BO_ST-LSTM layer is responsible for the optimization and learning phases of the BO_ST-LSTM system. Training and validation sets are used to optimize the ST-LSTM network by means of Bayesian optimization17. This procedure is useful to develop a more robust architecture suitable for this specific task; then, during the optimization phase, an early stopping with patience equal to 5 is used to avoid overfitting and reduce the overall computation time. Given the optimized setting, the ST-LSTM is trained and its performance is measured on the test set.

Figure 1
figure 1

Overview of the BO-STLSTM system. Images in the Input: camera images box come from the SullyChen Dataset43.

Bayesian optimization

Bayesian optimization is a sample-efficient strategy for global optimization of black-boxes, expensive and multi-extremal functions, traditionally constrained over a box-bounded search space \(\Omega\):

$$\begin{aligned} \min \limits _{\theta \in \Omega } g(\theta ) \end{aligned}$$
(1)

To solve such problem (1), BO uses two key components: a probabilistic surrogate model of the objective function \(g(\theta )\) and an acquisition function (also called infill criterion or utility function) that is based on the current approximation of \(g(\theta )\). The optimization of the acquisition function allows to select the next promising \(\theta '\) where to evaluate the objective function. The observed value, \(g(\theta ')\) (or \(g(\theta ')+\varepsilon\) in the case that the objective function is also noisy), is then used to update the probabilistic model approximating \(g(\theta )\) and the process is iterated until a given termination criteria is reached (e.g., a maximum number of function evaluations).

A Gaussian Process (GP)44 is the most common choice for the probabilistic surrogate model. An alternative is offered by Random Forest (RF)45, an ensemble learning method which, contrary to GP, is able—by construction—to deal with a complex search spaces \(\Omega\), spanned by mixed, categorical and conditional components of (\(\theta\)). Conditional means that the value of a component of the solution vector \(\theta _{[i]}\) depends on the value of at least another component \(\theta _{[j]}\), with \(i \ne j\).

Regardless its specific implementation, the aim of the probabilistic surrogate model is to provide an estimate (aka prediction) of \(g(\theta ), \forall \theta \in \Omega\), along with a measure of uncertainty about such estimate. These two elements are usually the mean and standard deviation of the prediction provided by the probabilistic surrogate model, denoted by \(\mu (\theta )\) and \(\sigma (\theta )\) respectively.

The acquisition function is aimed at driving the selection of the next \(\theta '\) to be evaluated on the objective function, balancing between exploitation—that is choosing \(\theta '\) whose associated prediction is not worse than the best function value observed so far—and exploration—that is choosing \(\theta '\) whose prediction is largely uncertain. While exploitation is associated to local search, exploration is associated to global search: the first is significantly driven by \(\mu (\theta )\) while the second is significantly driven by \(\sigma (\theta )\). Several acquisition functions have been proposed—an overview is provided in Frazier17 and Archetti et al.18—each one offering a different mechanism to balance the exploitation-exploration trade-off. The most widely used acquisition functions are lower confidence bound (LCB), expected improvement (EI) and maximum probability of improvement (MPI).

Lower confidence bound (LCB) is an acquisition function that manages exploration-exploitation by being optimistic in the face of uncertainty:

$$\begin{aligned} LCB(\theta ) = \mu (\theta ) - \xi \sigma (\theta ), \end{aligned}$$
(2)

where \(\mu (x)\) and \(\sigma (x)\) are mean value and standard deviation of the probabilistic surrogate model. \(\xi \ge 0\) is the parameter to manage the trade-off between exploration and exploitation. More precisely, \(\xi = 0\) is for pure exploitation; on the contrary, higher values of \(\xi\) emphasizes exploration. In Srinivas et al.46 a schedule of \(\xi\) is proposed with convergence proof.

Expected improvement (EI) measures the expectation of the improvement on \(g(\theta )\) with respect to the predictive distribution of the probabilistic surrogate model.

$$\begin{aligned} EI(\theta ) = {\left\{ \begin{array}{ll} (g(\theta ^+)-\mu (\theta ) - \xi ) \Phi (Z)+\sigma (\theta )\phi (Z) \quad \text {if} \quad \sigma (\theta ) > 0 \\ 0 \quad \text {if} \quad \sigma (\theta ) = 0 \end{array}\right. } \end{aligned}$$
(3)

\(g(\theta ^+)\) is the best value of the objective function observed so far, \(\xi\) is used to balance between exploration-exploitation, \(\phi (Z)\) and \(\Phi (Z)\) are the probability distribution and the cumulative distribution of the standardized normal, respectively, with Z defined as follows:

$$\begin{aligned} Z = {\left\{ \begin{array}{ll} \frac{g(\theta ^+) - \mu (\theta )}{\sigma (\theta )} \quad \text {if} \quad \sigma (\theta ) > 0 \\ 0 \quad \text {if} \quad \sigma (\theta ) = 0 \end{array}\right. } \end{aligned}$$
(4)

Maximum probability of improvement (MPI) was the first acquisition function proposed in literature. As it is basically biased towards exploitation, it has been recently modified by including a parameter \(\xi\) to allow for a better exploration-exploitation trade-off:

$$\begin{aligned} MPI(\theta ) = P(g(\theta ) \le g(\theta ^+) + \xi ) = \Phi \bigg (\frac{g(\theta ^+) + \mu (\theta ) + \xi }{\sigma (\theta )}\bigg ) \end{aligned}$$
(5)

At last, selecting \(\theta '\) requires to solve an auxiliary optimization problem on the same search space \(\Omega\) but computationally cheaper than (1), that is minimizing \(LCB(\theta )\) or maximizing \(EI(\theta )\) or \(MPI(\theta )\).

Denote with \(D_{1:n}\) a set of initial solutions (i.e., initial design), for example sampled by using Latin Hypercube Sampling (LHS) technique. The element \(D_i\) is \((\theta _i,g(\theta _i))\), with \(i=1,...,n\) (i.e., we are considering the noise-free setting, without loss of generality). Furthermore, consider N as the maximum number of function evaluations. Then, the general BO algorithm is summarized as follows:

figure a

Spatiotemporal-long short term memory network

The ST-LSTM network14 is based on Convolutional LSTM (ConvLSTM) layers, 3D Convolutional (3DConv) layers and Fully Connected (FC) layers with the aim to extract the spatiotemporal features of time-serial image segment. Bai et al.14 proposed architecture with 4 ConvLSTM layers combined with a batch normalization step so as to prevent the eventuality of low training efficiency caused by data distribution offset in deep networks.

ConvLSTM uses multi-frame picture segments as input. In this way, spatial features can be extracted like a convolutional layer and the timing relationship can be obtained at the same time.

The four ConvLSTM layers are then used to extract the temporal hidden feature information and help the model learn more effectively from the data. Following the definition of Shi et al.12, all the inputs \(\mathcal {X}_1, \ldots , \mathcal {X}_t\), cell outputs \(\mathcal {C}_1, \ldots , \mathcal {C}_t\), hidden states \(\mathcal {H}_1, \ldots , \mathcal {H}_t\), and input and forget and output gates (\(i_t, f_t, o_t\)) are 3D tensors whose last two coordinates are spatial dimensions. The key equations of ConvLSTM are shown in Eq. (6), where \(*\) denotes the convolution operator and \(\circ\) denotes the Hadamard product:

$$\begin{aligned} \begin{aligned} i_t&= \sigma (W_{xi} *\mathcal {X}_t + W_{hi} *\mathcal {H}_{t-1} + W_{ci} \circ \mathcal {C}_{t-1} + b_i) \\ f_t&= \sigma (W_{xf} *\mathcal {X}_t + W_{hf} *\mathcal {H}_{t-1} + W_{cf} \circ \mathcal {C}_{t-1} + b_f) \\ \mathcal {C}_t&= f_t \circ \mathcal {C}_{t-1} + i_t \circ \tanh (W_{xc} *\mathcal {X_t} + W_{hc} *\mathcal {H}_{t-1} + b_c) \\ o_t&= \sigma (W_{xo} *\mathcal {X}_t + W_{ho} *\mathcal {H}_{t-1} + W_{co} \circ \mathcal {C}_{t-1} + b_o) \\ \mathcal {H}_t&= o_t \circ \tanh (\mathcal {C}_t) \end{aligned} \end{aligned}$$
(6)

where \(\sigma\) is the activation function (i.e. sigmoid function), \(b_i, b_f, b_c\) and \(b_0\) are the biases related to \(i_t, f_t, \mathcal {C}_t\) and \(o_t\).

If we view the states as the hidden representations of moving objects, a ConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a smaller kernel can capture slower motions12.

3DConv is used to better capture the temporal and spatial characteristics of videos. In fact, the traditional 2D convolutional layer operates on time-serial images using a simple convolutional layer to identify each frame of the video. Nevertheless, this method does not take into account the inter-frame motion information in the time dimension.

The convolution operation in 3DConv has a time dimension of three, meaning that the input is composed of three consecutive frames of images, compliantly with the data pre-processing adopted in Bai et al.14. In this structure, each feature map in the convolutional layer is connected to multiple adjacent successive frames in the previous layer, thus capturing motion information.

Experiments

The proposed methodology is evaluated and compared with the classical end-to-end driving models on the public dataset SullyChen Dataset43. The data were recorded around Rancho Palos Verdes and San Pedro, California, using a 2014 Honda Civic. The temporal resolution of images is 0.05 seconds. A first version of the dataset has been used by Qian et al.47 on their work. Figure 2 shows some examples. The dataset contains both straight and mixed roads. Furthermore, pictures are taken on junctions, primary and secondary roads. For more examples, see also Qian et al.47.

In this work, we used the updated version which includes 63,000 images of the frontal road as well as the steering angles; for computational reasons, we decided to use around the 62% (39,000) of the total number of images. The considered dataset is then split in 80\(\%\) (31,200 images) and 20% (7800). The first subset is again divided following the same percentage (80–20\(\%\)) in order to get training and validation set. Images are shuffled during the training phase. Originally, images come with a dimension of \(455 \times 256\) pixels; in this study, however, the size has been reduced to \(200 \times 66\) pixels.

To ensure reproducibility of experiments, we shared both data and our code as a public GitLab project, at the following link: https://gitlab.com/ub-dems-public/cs-labs/user-ariboni/csp-drive-dl.

Figure 2
figure 2

Images from the SullyChen Dataset43: (a) straight road with trees and shadow, (b) road with left curve with obstacle (car), (c) road with right curve (d) road with a gradual curve to the left with traffic island.

Methodology

Three different architectures were implemented and trained for 15 epochs, with batch size equal to 50 and validated with different setting according to their definitions: PilotNet3, J-net4 and modified version of the ST-LSTM proposed by Bai et al.14. Table 1 summarizes settings for all networks.

Table 1 Setting for all networks.

PilotNet is a deep learning network mainly based on convolutional layers. More precisely, the network consists of 9 layers, including a normalization layer, 5 convolutional layers and 3 fully connected layers. The convolutional layers were designed to perform feature extraction and were chosen empirically through a series of experiments that varied layer configurations (see Bojarski et al.3 for more details). Strided convolutions were used in the first three convolutional layers with a 2 \(\times\) 2 stride and a 5 \(\times\) 5 kernel and a non-strided convolution with a 3 \(\times\) 3 kernel size in the last two convolutional layers. The five convolutional layers are followed by three fully connected layers leading to an output control value (the steering angle).

As the previous network, J-Net is basically a convolutional neural network. In this case, the network consists of 5 layers, including a normalization layer, 3 convolutional layers and one fully connected layer. In order to reduce the size of the deep neural network layers, the authors applied three max-pooling operators, one after each convolutional layer. The size of all max-pooling operators is 2 \(\times\) 2. Eventually, the last layer of the J-net is a fully-connected layer composed of ten nodes and followed by a simple output layer for the steering angle prediction.

Differently from PilotNet and J-Net, ST-LSTM network is composed of a series of ConvLSTM layers and a 3DConv layer. The deep network proposed in Bai et al.14 consists of 6 layers, including 4 ConvLSTM layers, one 3DConv layer and one fully connected layer followed by the final output layer. Each ConvLSTM layer is followed by a batch normalization step. In order to reduce dimensionality, we introduced in this work a max-pooling layer operating of size 2 \(\times\) 2 \(\times\) 2 on 3D input. Furthermore, we added a dropout regularization technique before the output prediction to avoid overfitting. This latter architecture is improved by means of Optimization17 to propose a more robust solution. This procedure is performed with three different acquisition functions and, in order to limit the impact of the random component, we executed 10 runs for each function. Details of this process are provided in the next sections.

The developed architectures are trained using stochastic gradient descent. The algorithm seeks to update the weights of the model to reduce the error between actual and estimated values. In order to compute this error, it is necessary to choose a suitable loss function; as such, having to handle a regression problem, we resolved to use the mean squared error (MSE). As reported in Eq. (7), this measure is given by the average of the square of the difference between actual \(y_i\) and estimated values \(\hat{y}_i\).

$$\begin{aligned} MSE = \displaystyle \frac{1}{n}\sum _{t=1}^{n}(y_i - \hat{y}_i)^2 \end{aligned}$$
(7)

BO_LT-STM: optimizing the hyperparameters of an ST-LSTM

As described in “Proposed method” section, Bayesian optimization is a sample-efficient strategy for global optimization of black-boxes. BO uses probability to find a minimum of an objective function in order to obtain better performance in the testing phase and reduce the optimization time. For this specific task, we have used the mean squared error on the validation set as the objective function to be minimized. Also, this procedure is executed on eight different hyper-parameters of the ST-LSTM architecture. Table 2 shows the parameters considered with the relative domain spaces.

Table 2 Parameters of the ST-LSTM architecture for Bayesian optimization.

The BO process is performed through the Python suite named GPyOpt and by using the Gaussian Process (GP) as a probabilistic surrogate model and through three different acquisition functions: lower confidence bound (LCB), expected improvement (EI), and maximum probability of improvement (MPI). As far as the acquisition functions’ hyperparameter is concerned (i.e., \(\xi\)), we have adopted default values suggested by the GPyOpt library.

We compared the obtained results in order to identify the most suitable acquisition function for the optimization process. A set of initial random solutions composed of five configurations is defined to update the surrogate model at each execution. The optimization process starts from these configurations and lasts 20 iterations. The acquisition function selects the new configuration to be evaluated at each run by managing the trade-off between exploration and exploitation. Therefore, in order to limit the randomness’s impact in the choice of initial solutions, we have performed each optimization process with the relative acquisition function for ten times, with different seeds.

Figure 3 shows the evolution of the so-called best seen, that is the minimum validation error value observed during the BO iterations in this study. Solid lines represent the average, while shaded areas are standard deviation over the ten experiments. At the first iteration, the best seen is the minimum validation error observed on the initial random configurations; then, the best seen of the successive 20 evaluated configurations is reported at a later stage.

Figure 3
figure 3

Comparison among GP-based BO processes using three different acquisition functions.

All of the three acquisition functions allow, on average, to reduce the error by about two percentage points during the BO process. All of them, in fact, lead to identifying configurations with a similar error value. Since the average results obtained are similar, the standard deviation of the various best seen values at the end of BO was considered to identify the most performing acquisition function. As shown in Fig. 4, the best seen value determined by LCB has a lower standard deviation compared to the other two acquisition functions. For this reason, ST-LSTM has been re-trained while considering the best configuration obtained through BO with the LCB acquisition function.

Figure 4
figure 4

Comparison between the distribution of the best seen obtained from the different acquisition function.

Figure 5 displays a graphical representation of the final ST-LSTM optimized architecture.

Figure 5
figure 5

Structure of BO_ST-LSTM architecture. Images used at the beginning of the structure come from the SullyChen Dataset43.

Results

This section summarizes the most relevant results of the study. The predictive performance indicators selected to compare the developed architectures are the MSE (Eq. 7), the mean absolute error (MAE) and the standard deviation of absolute error (St. AE).

$$\begin{aligned}{}&AE = \Bigl | \hat{y_i} - y_i\Bigr |, \; \; \; \; \; \; i = 1, \ldots , n \end{aligned}$$
(8)
$$\begin{aligned}{}&MAE = \frac{1}{n} \sum _{i = 1}^{n} AE_i, \end{aligned}$$
(9)
$$\begin{aligned}{}&St.\;AE = \sqrt{\frac{1}{n - 1} \sum _{i = 1}^{n} \Bigl ( AE_i - MAE\Bigr )^2} \end{aligned}$$
(10)

where \(\hat{y_i}\) is the \(i\)th predicted value of the response system, \(y_i\) is the \(i\)th actual steering wheel angle and n is the number of observations in the evaluated set (validation or test set). Table 3 summarizes the overall results on training and validation sets. BO_ST-LSTM is the best approach in mainly all the indicators on the validation set. All approaches show a possible problem related to overfitting, indeed all of them obtained really good results on the training set that are not confirmed on the validation set. For this reason, the bias-variance trade-off is investigated in what follows.

Table 3 Average values of the prediction performance indicators on the training and validation sets for the four architectures.

Table 4 separately reports the MSE on training and validation, for each one of the four Deep Neural Network (DNN) models considered. Then, we have decomposed the MSE into bias and variance in order to investigate possible differences in the balance offered by the four models. More precisely, bias and variance are estimated as follows, under the noise-free assumption (i.e., if \(x_i=x_j\) then \(y_i=y_j\)):

$$\begin{aligned}{}&Bias^2 = \Bigg (\frac{1}{N} \sum _{i=1}^N \hat{y}_i - \frac{1}{N} \sum _{i=1}^N y_i\Bigg )^2 \end{aligned}$$
(11)
$$\begin{aligned}{}&Variance = \frac{1}{N} \sum _{i=1}^N \Big ((y_i-\hat{y}_i) - \overline{err}\Big )^2 \end{aligned}$$
(12)

with \(\overline{err}=\frac{1}{N}\sum _{i=1}^N (y_i- \hat{y}_i)\). Finally, \(MSE=Bias^2+Variance\).

A DNN model should simultaneously achieves low variance and low bias in order to minimize the MSE. Generally speaking, if we assume f as an unknown real function and \(\hat{f}\) as the estimated function on a training sample then we can define the meaning of variance and bias. Variance refers to the amount by which \(\hat{f}\) would change considering a different training set. Bias is related to the error that is introduced by approximating f by using a simpler model. The considered DNN models share a common pattern looking at bias and variance: all of them have low bias and high variance. This is an evidence that we are considering a set of models that tend to overestimate the training set. As a consequence, small changes in the training set can result in significant changes in \(\hat{f}\).

Table 4 Bias-variance tradeoff decomposition.

Focusing on MSE indicator, BO_ST-LSTM is the most promising model, as it provides the lowest MSE value on validation. This was quite expected because the goal of BO was to minimize this indicator. Moreover, BO_ST-LSTM resulted in the highest MSE on the training set (one order of magnitude greater than the other three DNN models), making it less prone to overfitting. Finally, as already pointed out, MSE is basically made up of the variance component for all the DNN models, both in training and validation. However, BO_ST-LSTM has the lowest variance MSE component in the validation set. In addition, both bias and variance are reduced using the BO strategy with respect to the other approaches.

Assuming that we can select only one DNN model to be deployed in a real-life application, our choice would be BO_ST-LSTM. This conclusion is also motivated by a pairwise Mann-Whitney U test performed on the prediction errors of the four models on the validation set. More specifically, prediction error for BO_ST-LSTM is significantly lower than the other three models (\(p\text {-value}<0.001\)); ST-LSTM and PilotNet are significantly similar in terms of prediction error on the validation set (\(p\text {-value}=0.866\)), as well as PilotNet and J-Net (\(p\text {-value}=0.054\)). Finally, ST-LSTM and J-Net resulted significantly different in terms of prediction error on the validation set (\(p\text {-value}=0.004\)).

All the approaches were then re-trained on the dataset consisting of both the training and validation data, setting 15 learning epochs. Results are summarized in Table 5: the inclusion of validation data leads to models with small values of MSE on test, lower than MSE on validation (i.e., a reduction of around 50%). Results on the test set confirm that BO_ST-LSTM is the most performing model, providing the smallest MSE again. Thus, if we could select only one DNN model, depending on the MSE on the validation set, choosing BO_ST-LSTM would be fine. It is important to remark that, although BO_ST-LSTM, ST-LSTM and PilotNet resulted in really close values of MSE on test set, ST-LSTM and PilotNet would be our third and fourth choice respectively.

Table 5 Average values of the prediction performance indicators on the training and test sets.

Conclusion and future works

This paper is meant to address the prediction of the steering wheel angle in a self-driving system via Deep Learning. We started from an ST-LSTM, which can be considered the most suitable model for the specific application targeted, as resulted from the empirical comparison against PilotNet and J-Net. Moreover, the hyperparameters tuning of the ST-LSTM, performed via BO, allowed to decrease further prediction error of this specific DNN architecture, both on the validation and test set. More specifically, GP-based BO with LCB acquisition function proved to be the best hyperparameters optimization strategy.

Nevertheless, some limitations ought to be considered. Although BO is a sample efficient global optimization strategy, running a single hyperparameter optimization process has required—in our experimental setting—to train and evaluate 25 different ST-LSTM models (5 sampled via LHS plus 20 via BO). While this “cost” is compensated by an MSE on validation which is significantly lower than the other DNN models, the difference between the MSE of BO_ST-LSTM and ST-LSTM, on the test set, is quite negligible. Unfortunately, a principled comparison, in terms of computational costs, is not possible because this information is not reported in the papers related to the other DL models considered in the study. Reproducing all the experiments of the other research groups is for sure expensive and—potentially—unfair. This is the reason why the comparison is not possible and, in any case, it is out of the scope of this paper.

Moreover, it is important to remark that results do not indicate ST-LSTM is better than J-Net or PilotNet necessarily, as these two latter models have not been optimized meaning that their optimized model might outperform the BO-ST-LSTM model.

Future works will address (a) the possibility to also optimize the architecture of the ST-LSTM, moving from hyperparameters optimization towards Neural Architecture Search (NAS)—and (b) its formulation as a multi-objective or constrained optimization problem by considering not only MSE but also other requirements, such as jointly minimizing the inference time (multi-objective) or keeping it lower than a fixed threshold (constrained). Furthermore, comparison of other models such as 2D-CNN-LSTM or 3D U-Net along with ST-LSTM models could also provide better insights on which model provides better performance. Besides, the usefulness of the application of other useful modules such as the Spatiotemporal attention module could be assessed.