[Part I] End to End Guide for Heart Disease Prediction : Data Collection and Preprocessing

[Part I] End to End Guide for Heart Disease Prediction : Data Collection and Preprocessing
Source: Google Images

"Structured data classification with deep learning offers groundbreaking potential for heart disease prediction. By harnessing the capabilities of neural networks, this approach can process medical data efficiently and deliver precise diagnoses, leading to improved patient care and better health outcomes."

This series of five blogs will guide you through a comprehensive Structured Data Classification project, covering every step from Data Collection to Model Deployment. In this blog we will explore Data Collection and Pre-processing.


In the realm of heart disease prediction, data collection involves gathering diverse medical records and relevant patient information. Preprocessing encompasses cleansing, transforming, and normalizing the data to ensure accuracy and remove inconsistencies. These crucial steps lay the foundation for effective and reliable predictive models.

 Importing libraries

import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

Firstly, we have to import the TensorFlow library, a machine learning framework for building the deep learning model. Then NumPy and Pandas libraries are imported for numerical computations and data manipulation respectively.  Pandas also provides a Data Frame that is particularly useful for handling structured data.  Then Keras API and layers module of it are imported for providing user friendly interface to create neural networks and deep learning architectures.

 The Dataset

file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)

 Then code is written to load data from a CSV file located at a specific URL into pandas dataframe.  After executing the code, the dataframe variable will hold the heart disease data so that we can perform preprocessing and other modelling operations.

(303, 14)

The dataset contains data of 303 patient’s data.  It contains 13 features namely, age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca and thal.

The target will be 1 or 0, based on whether the patient has heart disease or not.

 Data Splitting

val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
Using 242 samples for training and 61 for validation

 The code splits the original DataFrame into a training DataFrame and a validation DataFrame by randomly sampling 20% of the data for validation and remaining for training. It then prints the respective sizes of the training and validation sets.

 Data Preprocessing

Source: medium.com

Now a function named dataframe_to_datasetis defined which converts the Pandas DataFrame into a TensorFlow Dataset. It prepares the data for training and validation by creating input feature dictionaries and corresponding target labels. Then, it shuffles the data and returns the resulting datasets for training.

def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds

train_ds = dataframe_to_dataset(train_dataframe) val_ds = dataframe_to_dataset(val_dataframe)
for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

For getting an insight into the data structure and content that the model would use during training, the following code had been executed.  We can see the input features and target labels of the first batch printed on the console.  Then batching is performed so that the model can process the multiple samples in parallel.

from tensorflow.keras.layers import IntegerLookup
from tensorflow.keras.layers import Normalization
from tensorflow.keras.layers import StringLookup
def encode_numerical_feature(feature, name, dataset):
    normalizer = Normalization()
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
    encoded_feature = normalizer(feature)
    return encoded_feature
def encode_categorical_feature(feature, name, dataset, is_string):
    lookup_class = StringLookup if is_string else IntegerLookup
    lookup = lookup_class(output_mode="binary")
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
    encoded_feature = lookup(feature)
    return encoded_feature
# All Categorical features should be encoded as integers like this:
sex = keras.Input(shape=(1,), name="sex", dtype="int64")
# Categorical feature encoded as string
thal = keras.Input(shape=(1,), name="thal", dtype="string")
# All Numerical features
age = keras.Input(shape=(1,), name="age")
# Define all inputs
# Integer categorical features
sex_encoded = encode_categorical_feature(sex, "sex", train_ds, False)
# String categorical features
thal_encoded = encode_categorical_feature(thal, "thal", train_ds, True)
# Numerical features
age_encoded = encode_numerical_feature(age, "age", train_ds)

Two functions, encode_numerical_feature and encode_categorical_feature, are defined which are used for encoding numerical and categorical features respectively. These functions facilitate the data preprocessing steps required before feeding the data into a deep learning model.

  1. encode_numerical_feature(feature, name, dataset): This function encodes a numerical feature. It performs the following steps:
  • Creates a Normalization layer (normalizer) to normalize the feature.
  • Prepares a dataset that contains only the numerical feature, as it will be used to learn the feature's statistics for normalization.
  • Adapts the normalizer to the data in the dataset to learn the mean and standard deviation of the feature.
  • Normalizes the input numerical feature using the learned statistics and returns the normalized feature.
  1. encode_categorical_feature(feature, name, dataset, is_string): This function encodes a categorical feature. It handles both string and integer categorical features. The steps are as follows:
  • Chooses the appropriate lookup class.
  • Creates a lookup layer to map categorical strings or integers to fixed integer indices.
  • Prepares a dataset that contains only the categorical feature.
  • Adapts the lookup layer to the data in the dataset to learn the mapping between categorical values and integer indices.
  • Encodes the input categorical feature (feature) into integer indices using the lookup layer and returns the encoded feature.

 The provided code prepares the input features for a deep learning model by encoding them appropriately. It distinguishes between categorical features encoded as integers and a categorical feature encoded as a string. It also handles numerical features.

1. Encoding categorical features:

  • Categorical features encoded as integers are passed through the encode_categorical_feature function, which internally uses IntegerLookup to convert them into integer indices.
  • The categorical feature encoded as a string (thal) is passed through the same function, but with is_string=True. This uses StringLookup to convert the string values into integer indices.

2. Encoding numerical features:

  • Each numerical feature (e.g., age, trestbps, etc.) is passed through the encode_numerical_feature function, which uses Normalization to normalize the numerical features.

The result of these encoding steps is that all features are now in a format suitable for training a deep learning model. These encoded features can be combined and used as input data while building and training the heart disease prediction model.


We have gone through the data collection and preprocessing steps. We got to know about our dataset and the features we are using for achieving our task.  Later, we split our dataset for training and validation and converted Pandas dataframe into TensorFlow data. After that we went through how to preprocess and encode various features from the dataset for use in the TensorFlow model.  The resulting encoded features can then be used as inputs for building a machine learning model to predict heart disease risk based on the provided dataset.

In the next article, we will discuss how to create and train the model on the prepared dataset.


[1] https://www.imperva.com/learn/data-security/data-classification/

[2] https://courses.edsa-project.eu/mod/resource/view.php?id=452



[5] https://www.techtarget.com/searchenterpriseai/definition/data-splitting

  Maddula Syam Pavan