Exploratory Data Analysis (EDA) can be an essential part of your data science process. I want to emphasize the work “can”. I’ve seen many people expand their EDA process to a point of overkill. Of course there are always more patterns to be found, but you need to build a sense of awareness for when your EDA process has gone long enough, and you have a good feel for the data. The goal (in most cases) is not to explore the data — it is to analyze the data in some way, often through a model.
In an effort to make your EDA processes more efficient, here are 9 functions I use for quick EDA!
You can also check out Mito for easy EDA in a spreadsheet — that writes Python
Note!!!! — These functions require Pandas and Numpy.
For any data frame the .info() function will tell you how many entrees you have, the names of each column, the data type of each column, and how many non-null values you have in each column. You can compare the quantity of non-null values to the total number of entries to find which columns have null values.
There are multiple ways to find duplicates rows in your dataset. This function above is the easies, as it will find all the duplicate entries and print how many there are. If it prints “0”, there are no duplicates and you are good to go!
Find Unique Values in a Column
In much of you EDA, you are focused on a few key columns. This functions quickly prints all the unique values of that column, so you can understand the breadth and range of the values. Below is what the output looks like:
Find the Counts of Unique Values in a Column
This function build upon the previous one by providing you the unique values in that column that have the largest and smallest frequencies. This is a great way to look for outliers.
Find all the Null Values in a Dataframe
This function combines .isnull() and .sum() and will return a list of each column in the data frame with the amount of null values in each column. Finding null values is an important part of EDA and data cleaning. Here is the output of the function call:
Fill Null Values with Zeros (or any filler)
df.replace(np.nan, "0", inplace = True)
This function will take your entire data frame and fill the null values with zeros, or whatever value you put in the second argument of the function. It is certainly the fastest way to get rid of your null values, putting your dataset in a place that will avoid more errors and dead-ends in your analysis. If you are not sure whether or not Null values will impact your analysis, I advise you to either fill them or delete the entries that hold the null values.
Filter Rows in your Dataframe
df2 = df[df["column_name"] > 100]
The line of code above creates a new data frame that hold all the rows, where “column_name” is greater than some value. You can, of course, filter on other conditionals such as “less than” or “equals to” and more complex conditionals, with multiple conditions.
Create a box-plot for any column
The function above will return box plots for all the numerical columns in dataset.
To specify that the box plot only be created for a certain column, use this function:
Create a Correlation Matrix
This pandas function will only return correlations for pairs of numeric columns.
To see all 9 of these functions in action, here is a quick tutorial video: