Find & Drop duplicate columns in a DataFrame | Python Pandas
Duplicate columns pandas: In this article we will learn to find duplicate columns in a Pandas dataframe and drop them.
Pandas library contain direct APIs to find out the duplicate rows, but there is no direct APIs for duplicate columns. And hence, we have to build API for that. Initially let’s create a dataframe with duplicate columns.
import pandas as sc # List of Tuples players = [('Nathan', 35, 'Australia', 35, 'Australia', 35), ('Vishal', 24, 'India', 24, 'India', 24), ('Abraham', 34, 'South Africa', 34, 'South Africa', 34), ('Trevor', 28, 'England', 28, 'England', 28), ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42), ] # Create a DataFrame object PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey']) print("Original Dataframe is:") print(PlayerObj)
Output : Original Dataframe is: Name Age Country Address Citizen Jersey 0 Nathan 35 Australia 35 Australia 35 1 Vishal 24 India 24 India 24 2 Abraham 34 South Africa 34 South Africa 34 3 Trevor 28 England 28 England 28 4 Kumar 42 SriLanka 42 SriLanka 42 Original Dataframe is: Name Age Country Address Citizen Jersey 0 Nathan 35 Australia 35 Australia 35 1 Vishal 24 India 24 India 24 2 Abraham 34 South Africa 34 South Africa 34 3 Trevor 28 England 28 England 28 4 Kumar 42 SriLanka 42 SriLanka 42
- Python Pandas: How to display full Dataframe i.e. print all rows & columns without truncation
- Python Pandas DataFrame sub() Function
- Pandas: Delete last column of dataframe in python | How to Remove last column from Dataframe in Python?
Find duplicate columns in a DataFrame :
Drop duplicate columns pandas: To find the duplicate columns in dataframe, we will iterate over each column and search if any other columns exist of same content. If yes, that column name will be stored in duplicate column list and in the end our API will returned list of duplicate columns.
import pandas as sc def getDuplicateColumns(df): ''' Get a list of duplicate columns. It will iterate over all the columns and finfd the duplicate columns in dataframe :param df: Dataframe object :return: Column’s list whose contents are same ''' duplicateColumnNames = set() # Iterate over all the columns for x in range(df.shape[1]): # Select column at xth index of dataframe. col = df.iloc[:, x] # Iterate over all the columns from (x+1)th index till end for y in range(x + 1, df.shape[1]): # Select column at yth index of dataframe. otherCol = df.iloc[:, y] # Check if two columns x & y are equal if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) return list(duplicateColumnNames) def main(): # List of Tuples players = [('Nathan', 35, 'Australia', 35, 'Australia', 35), ('Vishal', 24, 'India', 24, 'India', 24), ('Abraham', 34, 'South Africa', 34, 'South Africa', 34), ('Trevor', 28, 'England', 28, 'England', 28), ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42), ] # Creation of DataFrame object PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey']) print("Original Dataframe is:") print(PlayerObj) # To get list of duplicate columns duplicateColumnNames = getDuplicateColumns(PlayerObj) print('Duplicate Columns are: ') for ele in duplicateColumnNames: print('Column name is : ', ele) if __name__ == '__main__': main()
Output : Original Dataframe is: Name Age Country Address Citizen Jersey 0 Nathan 35 Australia 35 Australia 35 1 Vishal 24 India 24 India 24 2 Abraham 34 South Africa 34 South Africa 34 3 Trevor 28 England 28 England 28 4 Kumar 42 SriLanka 42 SriLanka 42 Duplicate Columns are: ('Column name is : ', 'Citizen') ('Column name is : ', 'Jersey') ('Column name is : ', 'Address')
Drop duplicate columns in a DataFrame :
Pandas drop duplicated columns: To drop/ remove the duplicate columns we will pass the list of duplicate column’s name which is returned by our API to dataframe.drop.
import pandas as sc def getDuplicateColumns(df): ''' Get a list of duplicate columns. It will iterate over all the columns and finfd the duplicate columns in dataframe :param df: Dataframe object :return: Column’s list whose contents are same ''' duplicateColumnNames = set() # Iterate over all the columns for x in range(df.shape[1]): # Select column at xth index of dataframe. col = df.iloc[:, x] # Iterate over all the columns from (x+1)th index till end for y in range(x + 1, df.shape[1]): # Select column at yth index of dataframe. otherCol = df.iloc[:, y] # Check if two columns x & y are equal if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) return list(duplicateColumnNames) def main(): # List of Tuples players = [('Nathan', 35, 'Australia', 35, 'Australia', 35), ('Vishal', 24, 'India', 24, 'India', 24), ('Abraham', 34, 'South Africa', 34, 'South Africa', 34), ('Trevor', 28, 'England', 28, 'England', 28), ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42), ] # Creation of DataFrame object PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey']) print("Original Dataframe is:") print(PlayerObj) # To get list of duplicate columns duplicateColumnNames = getDuplicateColumns(PlayerObj) print('Duplicate Columns are: ') for ele in duplicateColumnNames: print('Column name is : ', ele) # Delete duplicate columns print('After removing duplicate columns new data frame becomes: ') newDf = PlayerObj.drop(columns=getDuplicateColumns(PlayerObj)) print("Modified Dataframe is: ", newDf) if __name__ == '__main__': main()
Output : Original Dataframe is: Name Age Country Address Citizen Jersey 0 Nathan 35 Australia 35 Australia 35 1 Vishal 24 India 24 India 24 2 Abraham 34 South Africa 34 South Africa 34 3 Trevor 28 England 28 England 28 4 Kumar 42 SriLanka 42 SriLanka 42 Duplicate Columns are: Column name is : Jersey Column name is : Citizen Column name is : Address After removing duplicate columns new data frame becomes: Modified Dataframe is: Name Age Country 0 Nathan 35 Australia 1 Vishal 24 India 2 Abraham 34 South Africa 3 Trevor 28 England 4 Kumar 42 SriLanka