Feature Extraction using Sweetviz Library

Kanishk Barhanpurkar

Published in

Analytics Vidhya

4 min readAug 7, 2020

A very efficient tool for Feature Extraction.

Introduction

Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks. According to the pypi.org, the different uses of Sweetviz library are as follows-

Target Analysis
Visualize and Compare
Mixed-type associations
Summary Information (related to missing values )
Numerical Analysis

Sweetviz is very efficient for extracting features from the data-set. In this blog, we will discover different functionalities of Sweetviz library on well-know disease detection dataset known as PIMA Indian Diabetes Data-set. (link).

Some insights on Feature Extraction-

Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing. A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process. Feature extraction is the name for methods that select and /or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.

Methodology

We will implement the Sweetviz library on the above-mentioned data-sets and discuss how efficiently features can be extracted. Let’s understand the steps required to conduct the process of methodology-

Step 1: Create the report in HTML format for particular data-set.

Step 2: For every attribute, check for Numerical Associations and Categorical Associations.

Step 3: Check for the highest value for the numerical association for each attribute. Make a list of top three attributes in decreasing order of numerical associations. Observe the categorical association (correlation ratio) for the selected attributes.

Step 4: The selected attribute with the highest value will be considered as a feature and can be used for further data analysis.

Step 5: Use heat-map for cross-verification of the selected attribute selected as a feature.

Implementation

Let’s implement this method on PIMA Indian Diabetes Data-set.

Step 1: Report creation of Sweetviz Library. In this step, we will import libraries such as NumPy, Pandas and Seaborn and Sweetviz.

Step 2: Check for Numerical Association and Categorical Association for all attributes of the dataset.

**Numerical Association and Correlation Ratio analysis.**

Step-3: After checking numerical association and categorical association, select the three attributes with the highest value of categorical association. The three attributes obtained for PIMA Indian Data-set are as follows -

Glucose- 0.47
BMI- 0.29
Age- 0.24

Step-4: The maximum value of categorical association obtained is Glucose and its categorical association value is 0.47. Therefore, the attribute “Glucose” will be used as a feature and process of feature extraction is completed.

Step 5: However, heat-map can be used to cross-check the entire procedure. It can be verified that light shades of red colour in the range of 0.40–0.60.

Conclusion

We have learned that Sweetviz is a great tool which can be used for feature extraction and exploratory data analysis. It can save valuable time which can be used for data pre-processing process. It provides a robust HTML report file which contains several visualizations like Countplot, Bar-graph and Heat-map. The entire system is based on report formation is based on two principles are Categorical Association (Uncertainty Coefficient) and Numerical Association (Correlation Ratio). Hence, it can be concluded that sweetviz library is a comprehensive tool which can be used for the feature extraction process.

References

[1] Sweetviz Official Link- https://pypi.org/project/sweetviz/

[2] Research Paper- Wu, H., Yang, S., Huang, Z., He, J., & Wang, X. (2018). Type 2 diabetes mellitus prediction model based on data mining. Informatics in Medicine Unlocked, 10, 100–107. doi:10.1016/j.imu.2017.12.006

[3] Research Paper- Zhi-Hua Zhou and Yuan Jiang. NeC4.5: Neural Ensemble Based C4.5. IEEE Trans. Knowl. Data Eng, 16. 2004.

[4] Feature extraction blog- https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be