How Do You Handle Missing Values, Categorical Data And Feature Scaling In Machine Learning

Pianalytix
5 min readDec 4, 2020

Handling Missing Values In Machine Learning

In Real-World Data, There Are Some Illustrates Where A Particular Element In Data Is Absent Because Of Various Reasons, Such As Corrupt Data, Failure To Load The Information, Or Incomplete Extraction, And So On. Missing Data Is A Very Big Problem In Real-Life Scenarios Faced By Analysts Because Making The Right Decision On How To Handle It Generates Robust Data Models. Missing Data Can Also Refer To As NA(Not Available) Or NaN(Not A Number) Values In Pandas. For Example, Suppose Different Users Being Surveyed May Choose Not To Share Their Contact Details, Some Users May Choose Not To Share The Personal Details In This Way Many Datasets Went Missing.

This Is Some Ways To Handle Missing Values In The Dataset:

Deleting Rows

If Columns Have Quite 70% — 75% Of Rows As Null Then The Complete Column Is Dropped. The Rows That Are Having One Or Additional Columns Values As Null Also Can Be Drop. Dropping Of Rows Or Columns Is Suggested Only If There Are Enough Samples Within The Knowledge Set. One Should Check That That When We’ve Deleted The Info We’ll Cause Loss Of Knowledge Which Is Able To Not Provide The Expected Results Whereas Predicting The Output.

Syntax : Data.Drop([‘Cabin’],Axis=1,Inplace=True)

As Shown Above, This Dataset Has 891 Rows And 12 Columns, There Are 687 Null Values In [‘Cabin’] Column It’s Approximately 77% Therefore We Delete This Column.

Replacing With Mean / Median

This Method Can Calculate The Mean, Median, Or Mode Of The Feature And Replace It With The Missing Values. This Approach Can Be Applied To A Feature Which Has Numeric Data Like The Age Of A Person Or The Ticket Fare. But The Loss Of The Details Is Negated By This Methodology That Gives Higher Results Compared To Deleting Of Rows And Columns.

Syntax For Means : Data[“Age”] = Data[“Age”].Replace(Np.NaN, Data[“Age”].Mean())

Syntax For Median : Data[“Age”] = Data[“Age”].Replace(Np.NaN, Data[“Age”].Median())

Replacing Missing Data With The Most Frequent Values

When Missing Values Is From Categorical Columns Such As String Or Numerical Then The Missing Values Can Be Replaced With The Most Frequent Category. If The Number Of Missing Values Is Very Large Then It Can Be Replaced With A New Category.

Syntax : Data[‘Cabin’].Fillna(‘Unknown’)[:10]

Imputation Using K-NN ( K-Nearest Neighbors )

The K Nearest Neighbours Is An Algorithm That Is Used For Classification And Regression. This Algorithm Can Be Used When Missing Values Are Present In The Dataset.KNN Works By Finding The Distances Between A Query And All The Examples In The Data, If NaN Values Are Found Then It Can Be Replaced With The Nearest Neighbour Estimated Values.

As Shown In Fig.1 There Is Some Data (Red, Blue) That Have Some Values And Data Purple Which Has Null Values, In Fig.2 We Find Nearest Neighbour Of Purple Data Finally, In Fig.3 We Found That These Are Nearest Neighbours Of Blue Hence We Will Assign Value Blue To Purple.

Handling Categorical Data In Machine Learning

Categorical Data Are Usually Grouped Into A Category, It Is Defined As “A Collection Of Information That Is Divided Into Groups”.I.E, If A School Or College Is Trying To Get Details Of Its Students, The Resulting Data Is Referred To As Categorical. Maybe This Data Grouped According To The Variables Present In The Details Such As Branch, Section, Sex, Etc; This Data Is Called Categorical Data.

There Are Two Subcategories Of Categorical Data, Such As :

Nominal Data:

Nominal Data Is Used To Name Variables Without Providing Any Numerical Value. Nominal Data Is Also Called Labeled Or Named Data. It Helps To Arrive At Better Conclusions. Examples Of Nominal Data Include Division, Gender, Etc.

Ordinal Data :

The Variables Have Natural, Ordered Categories And The Distances Between The Categories Are Not Known Is Called Ordinal Data. Ordinal Data Is Set To Order Or Scale To It. Still, This Order Does Not Have A Standard Scale On Which The Difference In Variables In Each Scale Is Measured.

Examples Of Ordinal Data Include; Interval Scale, Likes, Dislike, Customer Satisfaction Survey Data, Etc. Each Of These Examples May Have Different Collection And Analysis Techniques, But They Are All Ordinal Data.

Feature Scaling In Machine Learning

A Method Used To Normalize The Range Of Independent Variables Or Features Of Data Is Known As Feature Scaling.

Normalization Is Performed During Data Preparation In Machine Learning Because Normalization Is A Technique Which Is Used To Change The Values Of Numeric Columns In The Dataset To Use A Common

Scale, Without Making Differences In The Ranges Of Values Or Losing Information.

Some Common Methods To Perform Feature Scaling :

Standardization:

In Standardization, They Replace The Values By Z Scores.

Algorithms Which Assume Zero Centric Data Like Principal Component Analysis (PCA) For That Time We Used Mean Normalization. If The Distribution Is Not Gaussian Or The Standard Deviation Is Very Small, Normalization Works Better.

Unit Vector:

Scaling Is Done Considering The Whole Feature Venture To Be Of Unit Length. That Means Dividing Each Component By The Euclidean Length Of The Vector:

Check the original blog here: https://pianalytix.com/how-do-you-handle-missing-values-categorical-data-and-feature-scaling-in-machine-learning/

If you are interested in machine learning course, You can check Machine Learning Internship Program

--

--

Pianalytix

Pianalytix Edutech Pvt Ltd. Helps users learn Machine Learning more efficiently and to implement Machine Learning in the real world