Python Program to Write to an Excel File Using openpyxl Module

Python Program to Write into an Excel File Using openpyxl Library

In this article we are going to see how we can write to an excel sheets by using Python language. To do this we will be using openpyxl library which is used to read and write to an excel file.

Python Program to Write to an Excel File Using openpyxl Module

In Python we have openpyxl library which is used to create, modify, read and write different types of excel files like xlsx/xlsm/xltx/xltm etc. When user operates with thousands of records in a excel file and want to pick out few useful information or want to change few records then he/she can do it very easily by using openpyxl library.

To use openpyxl library we first have to install it using pip.

Command : pip install openpyxl

After installation we can use the library to create and modify excel files.

Let’s see different programs to understand it more clearly.

Program-1: Python Program to Print the Name of an Active Sheet Using openpyxl Module

Approach:

  • First of all we have to import the openpyxl module
  • Then we use the workbook( ) function to create a workbook object
  • From the object we created, we extract the active sheet from the active attribute
  • Then we use the title attribute of the active sheet to fetch its title and print it.

Program:

# Import the openpyxl library
import openpyxl as opxl

# We will need to create a blank workbook first
# For that we can use the workbook() funtion available in th openpyxl library
workb = opxl.Workbook()

# Get the active workbook sheet from the active attribute
activeSheet = workb.active

# Extracting the sheet title from the activSheet object
sheetTitle = activeSheet.title

# Printing the sheet title
print("The active sheet name is : " + sheetTitle)

Output:

The active sheet name is : Sheet

Program-2: Python Program to Update the Name of an Active Sheet Using openpyxl Module

Approach:

  • First of all we have to import the openpyxl module.
  • Then we use the workbook( ) function to create a workbook object.
  • From the object we created, we extract the active sheet from the active attribute.
  • Then we use the title attribute of the active sheet to fetch its title and print it.
  • Now we store the new name in the title attribute of the active sheet.
  • Finally we fetch its title and print it.

Program:

# Import the openpyxl library
import openpyxl as opxl

# We will need to create a blank workbook first
# For that we can use the workbook() funtion available in th openpyxl library
workb = opxl.Workbook()

# Get the active workbook sheet from the active attribute
activeSheet = workb.active

# Fethcing the sheet title from the activSheet object
sheetTitle = activeSheet.title

# Printing the sheet title
print("The active sheet name is : " + sheetTitle)

# Updating the active sheet name
activeSheet.title = "New_Sheet_Name"

# Fethcing the sheet title from the activSheet object
sheetTitle = activeSheet.title

# Printing the new sheet title
print("The active sheet name after updation is : " + sheetTitle)

Output:

The active sheet name is : Sheet

The active sheet name after updation is : New_Sheet_Name

Program-3: Python Program to Write into an Excel Sheet Using openpyxl Module

Approach:

  • First of all we have to import the openpyxl module.
  • Then we use the workbook( ) function to create a workbook object.
  • From the object we created, we extract the active sheet from the active attribute.
  • Then we create cell objects from the active sheet object that store the row column coordinates. These cells can be called using row and column values or just the cell name like A1 *for row = 1 and column = 1).
  • Store some values in those cells.
  • Save the file using save( ) to make the changes permanent.
  • Now open the excel sheet to find the changes.

Program:

# Import the openpyxl library
import openpyxl as opxl

# We will need to create a blank workbook first
# For that we can use the workbook() funtion available in th openpyxl library
workb = opxl.Workbook()

# Get the active workbook sheet from the active attribute
activeSheet = workb.active

# Creating a cell object that contains attributes about rows, columns
# and coordinates to locate the cell
cell1 = activeSheet.cell(row=1, column=1)
cell2 = activeSheet.cell(row=2, column=1)

# Addind values to the cells
cell1.value = "Hi"
cell2.value = "Hello"

# Rather than writing the row and column number,
# we can also access the cells by their individual names
# C2 means third column 1st row
cell3 = activeSheet['C1']
cell3.value = "Gracias"

# C1 means third column 2nd row
cell3 = activeSheet['C2']
cell3.value = "Bye"

# FInally we have to save the file to save our changes
workb.save("E:\\Article\\Python\\file1.xlsx")

Output:

Python Program to Write into an Excel File Using openpyxl Library

Program-4: Python Program to Add more Sheets to the Active Workbook Using openpyxl Module

Approach:

  • First of all we have to import the openpyxl module.
  • Then we use the workbook( ) function to create a workbook object.
  • From the object we created, we extract the active sheet from the active attribute.
  • Then create a new sheet by using create_sheet() method.
  • Save the file by specifying the path.

Program:

# Import the openpyxl library
import openpyxl as opxl

# We will need to create a blank workbook first
# For that we can use the workbook() funtion available in th openpyxl library
workb = opxl.Workbook()

# Get the active workbook sheet from the active attribute
activeSheet = workb.active

# To add more sheets into the workbook we have to use the create_sheet() method
workb.create_sheet(index=1, title="2nd sheet")

# FInally we have to save the file to save our changes
workb.save("E:\\Article\\Python\\file1.xlsx")

Output:

Python Program to Write into an Excel File Using openpyxl Library

 

Python Program to Read an Excel File Using Openpyxl Module

Python Program to Read an Excel File Using openpyxl Library

In this article we are going to see how we can read excel sheets in Python language. To do this we will be using openpyxl library which is used to read and write to an excel file.

Python Program to Read an Excel File Using Openpyxl Module

In Python we have openpyxl library which is used to create, modify, read and write different types of excel files like xlsx/xlsm/xltx/xltm etc. When user operates with thousands of records in a excel file and want to pick out few useful information or want to change few records then he/she can do it very easily by using openpyxl library.

To use openpyxl library we first have to install it using pip.

Command : pip install openpyxl

After installation we can use the library to create and modify excel files.

Let’s see different programs to understand it more clearly.

Input File:

Python Program to Read and Excel File Using openpyxl Library

Program-1: Python Program to Print a Particular Cell Value of Excel File Using Openpyxl Module

Approach:

  • First of all we have to import the openpyxl module
  • Store the path to the excel workbook in a variable
  • Load the workbook using the load_workbook( ) function passing the path as a parameter
  • From the workbook object we created, we extract the active sheet from the active attribute
  • Then we create cell objects from the active sheet object
  • Print the value from the cell using the value attribute of the cell object

Program:

# Import the openpyxl library
import openpyxl as opxl

# Path to the excel file
path = "E:\\Article\\Python\\file1.xlsx"

# Created a workbook object that loads the workbook present
# at the path provided
wb = opxl.load_workbook(path)

# Getting the active workbook sheet from the active attribute
activeSheet = wb.active

# Created a cell object from the active sheet using the cell name
cell1 = activeSheet['A2']

# Printing the cell value
print(cell1.value)

Output:

Sejal

Program-2: Python Program to Print Total Number of Rows in Excel File Using Openpyxl Module

Approach:

  • First of all we have to import the openpyxl module
  • Store the path to the excel workbook in a variable
  • Load the workbook using the load_workbook( ) function passing the path as a parameter
  • From the workbook object we created, we extract the active sheet from the active attribute
  • Then we print the number of rows using the max_row attribute of the sheet object

Program:

# Import the openpyxl library
import openpyxl as opxl

# Path to the excel file
path = "E:\\Article\\Python\\file1.xlsx"

# Created a workbook object that loads the workbook present
# at the path provided
wb = opxl.load_workbook(path)

# Getting the active workbook sheet from the active attribute
activeSheet = wb.active

# Printing the number of rows in the sheet
print("Number of rows : ", activeSheet.max_row)

Output:

Number of rows :  7

Program-3: Python Program to Print Total Number of Columns in Excel File Using Openpyxl Module

Approach:

  • First of all we have to import the openpyxl module
  • Store the path to the excel workbook in a variable
  • Load the workbook using the load_workbook( ) function passing the path as a parameter
  • From the workbook object we created, we extract the active sheet from the active attribute
  • Then we print the number of columns using the max_column attribute of the sheet object

Program:

# Import the openpyxl library
import openpyxl as opxl

# Path to the excel file
path = "E:\\Article\\Python\\file1.xlsx"

# Created a workbook object that loads the workbook present
# at the path provided
wb = opxl.load_workbook(path)

# Getting the active workbook sheet from the active attribute
activeSheet = wb.active

# Printing the number of columns in the sheet
print("Number of columns : ", activeSheet.max_column)

Output:

Number of columns :  2

Program-4: Python Program to Print All Column Names of Excel File Using Openpyxl Module

Approach:

  • First of all we have to import the openpyxl module
  • Store the path to the excel workbook in a variable
  • Load the workbook using the load_workbook( ) function passing the path as a parameter
  • From the workbook object we created, we extract the active sheet from the active attribute
  • Then we find and store the number of columns in a variable cols
  • We run a for loop from 1 to cols+1 that creates cell objects and prints their value

Program:

# Import the openpyxl library
from ast import For
import openpyxl as opxl

# Path to the excel file
path = "E:\\Article\\Python\\file1.xlsx"

# Created a workbook object that loads the workbook present
# at the path provided
wb = opxl.load_workbook(path)

# Getting the active workbook sheet from the active attribute
activeSheet = wb.active

# Number of columns
cols = activeSheet.max_column

# Printing the column names using a for loop
for i in range(1, cols + 1):
    currCell = activeSheet.cell(row=1, column=i)
    print(currCell.value)

Output:

Name
Regd. No

Program-5: Python Program to Print First Column Value of Excel File Using Openpyxl Module

Approach:

  • First of all we have to import the openpyxl module.
  • Store the path to the excel workbook in a variable.
  • Load the workbook using the load_workbook( ) function passing the path as a parameter.
  • From the workbook object we created, we extract the active sheet from the active attribute.
  • Then we find and store the number of rows in a variable rows.
  • We run a for loop from 1 to rows+1 that creates cell objects and prints their value.

Program:

# Import the openpyxl library
from ast import For
import openpyxl as opxl

# Path to the excel file
path = "E:\\Article\\Python\\file1.xlsx"

# Created a workbook object that loads the workbook present
# at the path provided
wb = opxl.load_workbook(path)

# Getting the active workbook sheet from the active attribute
activeSheet = wb.active

# Number of rows
rows = activeSheet.max_row

# Printing the first column values using for loop
for i in range(1, rows + 1):
    currCell = activeSheet.cell(row=i, column=1)
    print(currCell.value)

Output:

Name
Sejal
Abhijit
Ruhani
Rahim
Anil
Satyam
Pushpa

Program-6: Python Program to Print a Particular Row Value of Excel File Using Openpyxl Module

Approach:

  • First of all we have to import the openpyxl module.
  • Store the path to the excel workbook in a variable.
  • Load the workbook using the load_workbook( ) function passing the path as a parameter.
  • From the workbook object we created, we extract the active sheet from the active attribute.
  • We use a variable rowNum to store the row number we want to read values from and a cols variable that stores the total number of columns.
  • We run a for loop from 1 to cols+1 that creates cell objects of the specified rows and prints their value.

Program:

# Import the openpyxl library
from ast import For
import openpyxl as opxl

# Path to the excel file
path = "E:\\Article\\Python\\file1.xlsx"

# Created a workbook object that loads the workbook present
# at the path provided
wb = opxl.load_workbook(path)

# Getting the active workbook sheet from the active attribute
activeSheet = wb.active

# Number of columns
cols = activeSheet.max_column

# The row number we want to print from
rowNum = 2

# Printing the row
for i in range(1, cols + 1):
    currCell = activeSheet.cell(row=rowNum, column=i)
    print(currCell.value)

Output:

Sejal 19012099

 

 

Python – Variables

Python Variables

Python is not a “statically typed” language. We do not need to declare variables or their types before using them. When we first assign a value to a variable, it is generated. A variable is a name that is assigned to a memory location. It is the fundamental storage unit in a program.

In this post, we’ll go over what you need to know about variables in Python.

Variables in Python Language

1)Variable

Variables are simply reserved memory locations for storing values. This means that when you construct a variable, you reserve memory space.

The interpreter allocates memory and specifies what can be stored in reserved memory based on the data type of a variable. As a result, you can store integers, decimals, or characters in variables by assigning various data types to them.

2)Important points about variables

  • In Python we don’t have to give the type of information when defining a variable, unlike the other programming languages (C++ or Java). The variable form is assumed by Python implicitly on the basis of a variable value.
  • During program execution, the value stored in a variable may be modified.
  • A variable is simply the name given to a memory location, all operations performed on the variable have an impact on that memory location.

3)Initializing the value of the variable

There is no clear statement to reserve the memory space for Python variables. When you assign a value to a variable, the declaration occurs automatically. To allocate values to the variables, the same sign (=) is used.

The = operator’s left operand is the variable name and the operand’s right is the value in the variable. The = operator is the variable value.

Examples:

A=100
b="Hello"
c=4.5

4)Memory and reference

A variable in Python resembles a tag or a reference that points to a memory object.

As an example,

k=”BTechGeeks”

‘BTechGeeks’ is an string object in the memory, and k is a reference or tag the indicates that memory object.

5)Modifying the variable value

Let us try this:

p=4.5
p="Cirus"

Initially, p pointed to a float object, but now it points to a string object in memory. The variable’s type also changed; originally, it was a decimal (float), but when we assigned a string object to it, the type of p changed to str, i.e., a string.

If there is an object in memory but no vector pointing to it, the garbage collector can automatically free it. We forced the variable p to point to a string object, as in the preceding example, and then float 4.5 was left in memory with no variable pointing to it. The object was then immediately released by the garbage collector.

6)Assigning one variable with another variable

We can assign the value of one variable with another variable like

p="BtechGeeks"
q=p

Both the p and q variables now point to the same string object, namely, ‘BTechGeeks.’

Below is the implementation:

p = "BTechGeeks"
# assign variable q with p
q = p
# print the values
print("The value of p :", p)
print("The value of q :", q)

Output:

The value of p : BTechGeeks
The value of q : BTechGeeks

7)The following are the rules for creating variables in Python

  • A variable name must begin with a letter or an underscore.
  • A number cannot be the first character in a variable name.
  • Variable names can only contain alphanumeric characters and underscores (A-z, 0-9, and _ ).
  • Case matters when it comes to variable names (flag, Flag and FLAG Aare three different variables).
  • The reserved terms (keywords) are not permitted to be used in naming the variable.

Related Programs:

Compare and get Differences between two Lists in Python

Compare and get Differences between two Lists in Python

Lists in Python:

Lists are the most versatile ordered list object type in Python. It’s also known as a sequence, which is an ordered group of objects that can contain objects of any data form, including Python Numbers, Python Strings, and nested lists. One of the most widely used and flexible Python Data Types is the list.

You can check if two lists are equal python.

Examples:

Input:

list1 = ["Hello", "Geeks", "this", "is", "BTechGeeks", "online", "Platform"]
list2 = ["Hello", "world", "this", "Python", "Coding", "Language"]

Output:

Printing the Differences between the lists : 
['Geeks', 'is', 'BTechGeeks', 'online', 'Platform', 'world', 'Python', 'Coding', 'Language']

Compare and get Differences between two Lists in Python

Let’s say we have two lists

There may be certain items in the first list that are not present in the second list. There are also several items that are present in the second list but not in the first list. We’d like to compare our two lists to figure out what the variations are.

There are several ways to compare and get differences between two lists some of them are:

Method #1:Using union function in sets

When we make a set from a list, it only includes the list’s unique elements. So, let’s transform our lists into sets, and then subtract these sets to find the differences.
We discovered the variations between the two lists, i.e. elements that are present in one list but not in the other.

Below is the implementation:

# given two lists
list1 = ["Hello", "Geeks", "this", "is", "BTechGeeks", "online", "Platform"]
list2 = ["Hello", "world", "this", "Python", "Coding", "Language"]
# converting two lists to sets
setlist1 = set(list1)
setlist2 = set(list2)
# getting the differences in both lists
listDif = (setlist1 - setlist2).union(setlist2 - setlist2)
print('Printing the Differences between the lists : ')
print(listDif)

Output:

Printing the Differences between the lists : 
['Geeks', 'is', 'BTechGeeks', 'online', 'Platform', 'world', 'Python', 'Coding', 'Language']

Method #2:Using set.difference()

Instead of subtracting two sets with the – operator in the previous solution, we can get the differences by using the set difference() feature.

So, let’s convert our lists to sets, and then use the difference() function to find the differences between two lists.

Below is the implementation:

# given two lists
list1 = ["Hello", "Geeks", "this", "is", "BTechGeeks", "online", "Platform"]
list2 = ["Hello", "world", "this", "Python", "Coding", "Language"]
# converting two lists to sets
setlist1 = set(list1)
setlist2 = set(list2)
# getting elements in first list which are not in second list
difference1 = setlist1.difference(setlist2)
# getting elements in second list which are not in first list
difference2 = setlist2.difference(setlist1)
listDif = difference1.union(difference2)
print('Printing the Differences between the lists : ')
print(listDif)

Output:

Printing the Differences between the lists : 
['Geeks', 'is', 'BTechGeeks', 'online', 'Platform', 'world', 'Python', 'Coding', 'Language']

Method #3:Using List Comprehension

To find the differences, we can iterate over both lists and search for elements in other lists. However, we can use list comprehension for iteration.

Below is the implementation:

# given two lists
list1 = ["Hello", "Geeks", "this", "is", "BTechGeeks", "online", "Platform"]
list2 = ["Hello", "world", "this", "Python", "Coding", "Language"]

# getting elements in first list which are not in second list
difference1 = [element for element in list1 if element not in list2]
# getting elements in second list which are not in first list
difference2 = [element for element in list2 if element not in list1]
listDif = difference1+difference2
print('Printing the Differences between the lists : ')
print(listDif)

Output:

Printing the Differences between the lists : 
['Geeks', 'is', 'BTechGeeks', 'online', 'Platform', 'world', 'Python', 'Coding', 'Language']

Method #4:Using set.symmetric_difference()

We had all of the variations between two lists in two steps in all of the previous solutions. Using symmetric difference(), however, we can accomplish this in a single stage.
Set has a member function called symmetric difference() that takes another sequence as an argument. It returns a new set containing elements from either the calling set object or the sequence statement, but not both. In other words, it returns the differences between set and list. Let’s see if we can use this to determine the differences between two lists.

Below is the implementation:

# given two lists
list1 = ["Hello", "Geeks", "this", "is", "BTechGeeks", "online", "Platform"]
list2 = ["Hello", "world", "this", "Python", "Coding", "Language"]

listDif = set(list1).symmetric_difference(list2)
print('Printing the Differences between the lists : ')
print(listDif)

Output:

Printing the Differences between the lists : 
{'is', 'online', 'world', 'BTechGeeks', 'Python', 'Language', 'Coding', 'Geeks', 'Platform'}

Method #5:Using set and ^ operator

Quick approach for solving this problem is to use ^ and sets.

Below is the implementation:

# given two lists
list1 = ["Hello", "Geeks", "this", "is", "BTechGeeks", "online", "Platform"]
list2 = ["Hello", "world", "this", "Python", "Coding", "Language"]

listDif = set(list1) ^ set(list2)
print('Printing the Differences between the lists : ')
print(listDif)

Output:

Printing the Differences between the lists : 
{'is', 'online', 'world', 'BTechGeeks', 'Python', 'Language', 'Coding', 'Geeks', 'Platform'}

Python Data Persistence – Using range

Python Data Persistence – Using range

Python’s built-in range ( ) function returns an immutable sequence of numbers that can be iterated over by for loop. The sequence generated by the range ( ) function depends on three parameters.

The start and step parameters are optional. If it is not used, then the start is always 0 and the step is 1. The range contains numbers between start and stop-1, separated by a step. Consider an example 2.15:

Example

range (10) generates 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9

range ( 1 , 5 ) results in 1 , 2 , 3 , 4

range ( 20 , 30 , 2 ) returns 20 , 22 , 24 , 26 , 28

We can use this range object as iterable as in example 2.16. It displays squares of all odd numbers between 11-20. Remember that the last number in the range is one less than the stop parameter (and step is 1 by default)

Example

#for-3.py
for num in range( 11 , 21 , 2 ):
sqr=num*num
print ( ' sqaure of { } is { } ' . format( num , sqr ) )

Output:

E:\python37>python for-3.py 
square of 11 is 121 
square of 13 is 169 
square of 15 is 225 
square of 17 is 289 
square of 19 is 361

In the previous chapter, you have used len ( ) function that returns a number of items in a sequence object. In the next example, we use len ( ) to construct a range of indices of items in a list. We traverse the list with the help of the index.

Example

#for-4.py
numbers=[ 4 , 7 , 2 , 5 , 8 ]
for indx in range(len(numbers)):
sqr=numbers[indx]*numbers[indx]
print ( ' sqaure of { } is { } ' . format ( numbers [ indx ] , sqr ) )

Output:

E:\python3 7 >python for - 4.py 
sqaure of 4 is 16 
sqaure of 7 is 49 
sqaure of 2 is 4 
sqaure of 5 is 25 
sqaure of 8 is 64 

E:\python37>

Have a look at another example of employing for loop over a range. The following script calculates the factorial value of a number. Note that the factorial of n (mathematical notation is n!) is the cumulative product of all integers between the range of 1 to n.

Example

#factorial.py
n=int ( input ( " enter number . . " ) )
#calculating factorial of n
f = 1
for i in range ( 1 , n+1 ):
f = f * i
print ( ' factorial of { } = { } ' . format ( n , f ) )

Output:

E:\python37>python factorial.py 
enter number..5 
factorial of 5 = 120

How To Scrape LinkedIn Public Company Data – Beginners Guide

How To Scrape LinkedIn Public Company Data

Nowadays everybody is familiar with how big the LinkedIn community is. LinkedIn is one of the largest professional social networking sites in the world which holds a wealth of information about industry insights, data on professionals, and job data.

Now, the only way to get the entire data out of LinkedIn is through Web Scraping.

Why Scrape LinkedIn public data?

There are multiple reasons why one wants to scrape the data out of LinkedIn. The scrape data can be useful when you are associated with the project or for hiring multiple people based on their profile while looking at their data and selecting among them who all are applicable and fits for the company best.

This scraping task will be less time-consuming and will automate the process of searching for millions of data in a single file which will make the task easy.

Another benefit of scraping is when one wants to automate their job search. As every online site has thousands of job openings for different kinds of jobs, so it must be hectic for people who are looking for a job in their field only. So scraping can help them automate their job search by applying filters and extracting all the information at only one page.

In this tutorial, we will be scraping the data from LinkedIn using Python.

Prerequisites:

In this tutorial, we will use basic Python programming as well as some python packages- LXML and requests.

But first, you need to install the following things:

  1. Python accessible here (https://www.python.org/downloads/)
  2. Python requests accessible here(http://docs.python-requests.org/en/master/user/install/)
  3. Python LXML( Study how to install it here: http://lxml.de/installation.html)

Once you are done with installing here, we will write the python code to extract the LinkedIn public data from company pages.

This below code will only run on python 2 and not above them because the sys function is not supported in it.

import json

import re

from importlib import reload

import lxml.html

import requests

import sys

reload(sys)

sys.setdefaultencoding('cp1251')




HEADERS = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

          'accept-encoding': 'gzip, deflate, sdch',

          'accept-language': 'en-US,en;q=0.8',

          'upgrade-insecure-requests': '1',

          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}

file = open('company_data.json', 'w')

file.write('[')

file.close()

COUNT = 0




def increment():

   global COUNT

   COUNT = COUNT+1




def fetch_request(url):

   try:

       fetch_url = requests.get(url, headers=HEADERS)

   except:

       try:

           fetch_url = requests.get(url, headers=HEADERS)

       except:

           try:

               fetch_url = requests.get(url, headers=HEADERS)

           except:

               fetch_url = ''

   return fetch_url




def parse_company_urls(company_url):




   if company_url:

       if '/company/' in company_url:

           parse_company_data(company_url)

       else:

           parent_url = company_url

           fetch_company_url=fetch_request(company_url)

           if fetch_company_url:

               sel = lxml.html.fromstring(fetch_company_url.content)

               COMPANIES_XPATH = '//div[@class="section last"]/div/ul/li/a/@href'

               companies_urls = sel.xpath(COMPANIES_XPATH)

               if companies_urls:

                   if '/company/' in companies_urls[0]:

                       print('Parsing From Category ', parent_url)

                       print('-------------------------------------------------------------------------------------')

                   for company_url in companies_urls:

                       parse_company_urls(company_url)

           else:

               pass







def parse_company_data(company_data_url):




   if company_data_url:

       fetch_company_data = fetch_request(company_data_url)

       if fetch_company_data.status_code == 200:

           try:

               source = fetch_company_data.content.decode('utf-8')

               sel = lxml.html.fromstring(source)

               # CODE_XPATH = '//code[@id="stream-promo-top-bar-embed-id-content"]'

               # code_text = sel.xpath(CODE_XPATH).re(r'<!--(.*)-->')

               code_text = sel.get_element_by_id(

                   'stream-promo-top-bar-embed-id-content')

               if len(code_text) > 0:

                   code_text = str(code_text[0])

                   code_text = re.findall(r'<!--(.*)-->', str(code_text))

                   code_text = code_text[0].strip() if code_text else '{}'

                   json_data = json.loads(code_text)

                   if json_data.get('squareLogo', ''):

                       company_pic = 'https://media.licdn.com/mpr/mpr/shrink_200_200' + \

                                     json_data.get('squareLogo', '')

                   elif json_data.get('legacyLogo', ''):

                       company_pic = 'https://media.licdn.com/media' + \

                                     json_data.get('legacyLogo', '')

                   else:

                       company_pic = ''

                   company_name = json_data.get('companyName', '')

                   followers = str(json_data.get('followerCount', ''))




                   # CODE_XPATH = '//code[@id="stream-about-section-embed-id-content"]'

                   # code_text = sel.xpath(CODE_XPATH).re(r'<!--(.*)-->')

                   code_text = sel.get_element_by_id(

                       'stream-about-section-embed-id-content')

               if len(code_text) > 0:

                   code_text = str(code_text[0]).encode('utf-8')

                   code_text = re.findall(r'<!--(.*)-->', str(code_text))

                   code_text = code_text[0].strip() if code_text else '{}'

                   json_data = json.loads(code_text)

                   company_industry = json_data.get('industry', '')

                   item = {'company_name': str(company_name.encode('utf-8')),

                           'followers': str(followers),

                           'company_industry': str(company_industry.encode('utf-8')),

                           'logo_url': str(company_pic),

                           'url': str(company_data_url.encode('utf-8')), }

                   increment()

                   print(item)

                   file = open('company_data.json', 'a')

                   file.write(str(item)+',\n')

                   file.close()

           except:

               pass

       else:

           pass
fetch_company_dir = fetch_request('https://www.linkedin.com/directory/companies/')

if fetch_company_dir:

   print('Starting Company Url Scraping')

   print('-----------------------------')

   sel = lxml.html.fromstring(fetch_company_dir.content)

   SUB_PAGES_XPATH = '//div[@class="bucket-list-container"]/ol/li/a/@href'

   sub_pages = sel.xpath(SUB_PAGES_XPATH)

   print('Company Category URL list')

   print('--------------------------')

   print(sub_pages)

   if sub_pages:

       for sub_page in sub_pages:

           parse_company_urls(sub_page)

else:

   pass

How To Scrape Amazon Data Using Python Scrapy

How To Scrape Amazon Data Using Python Scrapy

Will it not be good if all the information related to some product will be placed in only one table? I guess it will be really awesome and accessible if we can get the entire information at one place.

Since, Amazon is a huge website containing millions of data so scraping the data is quite challenging. Amazon is a tough website to scrape for beginners and people often get blocked by Amazon’s anti-scraping technology.

In this blog, we will be aiming to provide the information about the scrapy and how to scrape the Amazon website using it.

What is Scrapy?

Scrapy is a free and open-source web-crawling Python’s framework. It was originally designed for web scraping, extracting the data using API’s and or general-purpose web crawler.

This framework is used in data mining, information processing or historical archival. The applications of this framework is used widely in different industries and has been proven very useful. It not only scrapes the data from the website, but it is able to scrape the data from the web services also. For example, Amazon API, Facebook API, and many more.

How to install Scrapy?

Firstly, there are some third-party softwares which needs to be installed in order to install the Scrapy module.

  • Python: As Scrapy has the base of the Python language, one has to install it first.
  • pip: pip is a python package manager tool which maintains a package repository and installs python libraries, and its dependencies automatically. It is better to install pip according to system OS, and then try to follow the standard way of installing Scrapy.

There are different ways in which we can download Scrapy globally as well as locally but the most standard way of downloading it is by using pip.

Run the below command to install Scrapy using pip:

Pip install scrapy

How to get started with Scrapy?

Since we know that Scrapy is an application framework and it provides multiple commands to create an application and use them. But before everything, we have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

Scrapy startproject new_project

This will create a directory:

Scrapy is an application framework which follows object oriented programming style for the definition of items and spiders for overall applications.

The project structure contains different the following files:

  1. Scrapy.cfg : This file is a root directory of the project, which includes project name with the project settings.
  2. Test_project : It is an application directory with many different files which actually make running and scraping responsible from the web URLs.
  3. Items.py :Items are containers that will be loaded with the scraped data and they work like simple python dictionaries.Items provide additional prevention against typos and populating undeclared fields.
  4. Pipelines.py :After the scraping of an item has been done by the spider, it is sent to the item pipeline which processes it through several components. Each class has to implement a method called process_item for the processing of scraped items.
  5. Settings.py : It allows the customization of the behaviour of all scrapy components, including the core, extensions, pipelines, and spiders themselves.
  6. Spiders : It is a directory which contains all spiders/crawlers as python classes.

Scrape Amazon Data: How to Scrape an Amazon Web Page

For a better understanding of how the scrapy works, we will be scraping the product name, price, category, and it’s availability from the Amazon.com website.

Let’s name this project amazon_pro. You can use the project name according to your choice.

Start by writing the below code:

Scrapy startproject test_project

The directory will be created in the local folder by the name mentioned above.

Now we need three things which will help in the scraping process.

  1. Update items.py field which we want to scrape. For example names, price, availability, and so on.
  2. We have to create a new spider with all the necessary elements in it like allowed domains, start_urls and parse method.
  3. For data processing, we have to update pipelines.py file.

Now after creating the spider, follow thee further steps given in the terminal:

  1. Scrapy genspider amazon amazon.com

Now, we need to define the name, URLs, and possible domains to scrape the data.

How to Scrape an Amazon Web Page 2

An item object is defined in the parse method and is filled with required information using the utility of XPath response object. It is a search function that is used to find elements in the HTML tree structure. Lastly let’s yield the items object, so that scrapy can do further processing on it.

Next, after scraping data, scrapy calls Item pipelines to process them. These are called Pipeline classes and we can use these classes to store data in a file or database or in any other way. It is a default class like Items that scrapy generates for users.

How to Scrape an Amazon Web Page 3

The process_item is implemented by the pipeline classes method and items are being yielded by a Spider each and every time. It takes the item and spider class as arguments and returns a dict object. So for this example, we are just returning item dict as it is.

Now, we have to enable the ITEM_PIPELINES from settings.py file.

How to Scrape an Amazon Web Page 4

Now, after completing the entire code, we need to scrape the item by sending requests and accepting response objects.

We will call a spider by its unique name and scrapy will easily search from it.

Scrapy crawl amazon

Now, after the items have been scraped, we can save it to different formats using their extensions. For example, .json, .csv, and many more.

Scrapy crawl amazon -o data.csv

The above command will save the scraped data in the csv format in data.csv file.

Here is the output of the above code:

[ 
{"product_category": "Electronics,Computers & Accessories,Data Storage,External Hard Drives", "product_sale_price": "$949.95", "product_name": "G-Technology G-SPEED eS PRO High-Performance Fail-Safe RAID Solution for HD/2K Production 8TB (0G01873)", "product_availability": "Only 1 left in stock."},
{"product_category": "Electronics,Computers & Accessories,Data Storage,USB Flash Drives", "product_sale_price": "", "product_name": "G-Technology G-RAID with Removable Drives High-Performance Storage System 4TB (Gen7) (0G03240)", "product_availability": "Available from these sellers."},
{"product_category": "Electronics,Computers & Accessories,Data Storage,USB Flash Drives", "product_sale_price": "$549.95", "product_name": "G-Technology G-RAID USB Removable Dual Drive Storage System 8TB (0G04069)", "product_availability": "Only 1 left in stock."},
{"product_category": "Electronics,Computers & Accessories,Data Storage,External Hard Drives", "product_sale_price": "$89.95", "product_name": "G-Technology G-DRIVE ev USB 3.0 Hard Drive 500GB (0G02727)", "product_availability": "Only 1 left in stock."}
]

We have successfully scraped the data from Amazon.com using scrapy.

How to Code a Scraping Bot with Selenium and Python

How to Code a Scraping Bot with Selenium and Python

Selenium is a powerful tool for controlling web browsers through programs and performing browser automation. Selenium is also used in python for scraping the data. It is also useful for interacting with the page before collecting the data, this is the case that we will discuss in this article.

In this article, we will be scraping the investing.com to extract the historical data of dollar exchange rates against one or more currencies.

There are other tools in python by which we can extract the financial information. However, here we want to explore how selenium helps with data extraction.

The Website we are going to Scrape:

Understanding of the website is the initial step before moving on to further things.

Website consists of historical data for the exchange rate of dollars against euros.

In this page, we will find a table in which we can set the date range which we want.

That is the thing which we will be using.

We only want the currencies exchange rate against the dollar. If that’s not the case then replace the “usd” in the URL.

The Scraper’s Code:

The initial step is starting with the imports from the selenium, the Sleep function to pause the code for some time and the pandas to manipulate the data whenever necessary.

How to Code a Scraping Bot with Selenium and Python

Now, we will write the scraping function. The function will consists of:

  • A list of currency codes.
  • A start date.
  • An End date.
  • A boolean function to export the data into .csv file. We will be using False as a default.

We want to make a scraper that scrapes the data about the multiple currencies. We also have to initialise the empty list to store the scraped data.

How to Code a Scraping Bot with Selenium and Python 1

As we can see that the function has the list of currencies and our plan is to iterate over this list and get the data.

For each currency we will create a URL, instantiate the driver object, and we will get the page by using it.

Then the window function will be maximized but it will only be visible when we will keep the option.headless as False.

Otherwise, all the work will be done by the selenium without even showing you.

How to Code a Scraping Bot with Selenium and Python 2

Now, we want to get the data for any time period.

Selenium provides some awesome functionalities for getting connected to the website.

We will click on the date and fill the start date and end dates with the dates we want and then we will hit apply.

We will use WebDriverWait, ExpectedConditions, and By to make sure that the driver will wait for the elements we want to interact with.

The waiting time is 20 seconds, but it is to you whichever the way you want to set it.

We have to select the date button and it’s XPath.

The same process will be followed by the start_bar, end_bar, and apply_button.

The start_date field will take in the date from which we want the data.

End_bar will select the date till which we want the data.

When we will be done with this, then the apply_button will come into work.

How to Code a Scraping Bot with Selenium and Python 3

Now, we will use the pandas.read_html file to get all the content of the page. The source code of the page will be revealed and then finally we will quit the driver.

How to Code a Scraping Bot with Selenium and Python 4

How to handle Exceptions In Selenium:

The collecting data process is done. But selenium is sometimes a little unstable and fail to perform the function we are performing here.

To prevent this we have to put the code in the try and except block so that every time it faces any problem the except block will be executed.

So, the code will be like:

for currency in currencies:

        while True:

            try:

                # Opening the connection and grabbing the page

                my_url = f'https://br.investing.com/currencies/usd-{currency.lower()}-historical-data'

                option = Options()

                option.headless = False

                driver = webdriver.Chrome(options=option)

                driver.get(my_url)

                driver.maximize_window()

                  

                # Clicking on the date button

                date_button = WebDriverWait(driver, 20).until(

                            EC.element_to_be_clickable((By.XPATH,

                            "/html/body/div[5]/section/div[8]/div[3]/div/div[2]/span")))

               

                date_button.click()

               

                # Sending the start date

                start_bar = WebDriverWait(driver, 20).until(

                            EC.element_to_be_clickable((By.XPATH,

                            "/html/body/div[7]/div[1]/input[1]")))

                           

                start_bar.clear()

                start_bar.send_keys(start)




                # Sending the end date

                end_bar = WebDriverWait(driver, 20).until(

                            EC.element_to_be_clickable((By.XPATH,

                            "/html/body/div[7]/div[1]/input[2]")))

                           

                end_bar.clear()

                end_bar.send_keys(end)

              

                # Clicking on the apply button

                apply_button = WebDriverWait(driver,20).until(

                      EC.element_to_be_clickable((By.XPATH,

                      "/html/body/div[7]/div[5]/a")))

               

                apply_button.click()

                sleep(5)

               

                # Getting the tables on the page and quiting

                dataframes = pd.read_html(driver.page_source)

                driver.quit()

                print(f'{currency} scraped.')

                break

           

            except:

                driver.quit()

                print(f'Failed to scrape {currency}. Trying again in 30 seconds.')

                sleep(30)

                Continue

For each DataFrame in this dataframes list, we will check if the name matches, Now we will append this dataframe to the list we assigned in the beginning.

Then we will need to export a csv file. This will be the last step and then we will be over with the extraction.

How to Code a Scraping Bot with Selenium and Python 5

Wrapping up:

This is all about extracting the data from the website.So far this code gets the historical data of the exchange rate of a list of currencies against the dollar and returns a list of DataFrames and several .csv files.

https://www.investing.com/currencies/usd-eur-historical-data

How to Scrape Wikipedia Articles with Python

How to Scrape Wikipedia Articles with Python

We are going to make a scraper which will scrape the wikipedia page.

The scraper will get directed to the wikipedia page and then it will go to the random link.

I guess it will be fun looking at the pages the scraper will go.

Setting up the scraper:

Here, I will be using Google Colaboratory, but you can use Pycharm or any other platform you want to do your python coding.

I will be making a colaboratory named Wikipedia. If you will use any python platform then you need to create a .py file followed by the any name for your file.

To make the HTTP requests, we will be installing the requests module available in python.

Pip install requests

 We will be using a wiki page for the starting point.

Import requests


Response = requests.get(url = “https://en.wikipedia.org/wiki/Web_scraping”)


print(response.status_code)

 When we run the above command, it will show 200 as a status code.

How to Scrape Wikipedia Articles with Python 1

Okay!!! Now we are ready to step on the next thing!!

Extracting the data from the page:

We will be using beautifulsoup to make our task easier. Initial step is to install the beautiful soup.

Pip install beautifulsoup4

Beautiful soup allows you to find an element by the ID tag.

Title = soup.find( id=”firstHeading”)

 Bringing everything together, our code will look like:

How to Scrape Wikipedia Articles with Python 3

As we can see, when the program is run, the output is the title of the wiki article i.e Web Scraping.

 Scraping other links:

Other than scraping the title of the article, now we will be focusing on the rest of the things we want.

We will be grabbing <a> tag to another wikipedia article and scrape that page.

To do this, we will scrape all the <a> tags within the article and then I will shuffle it.

Do not forget to import the random module.

How to Scrape Wikipedia Articles with Python 3

You can see, the link is directed to some other wikipedia article page named as IP address.

Creating an endless scraper:

Now, we have to make the scraper scrape the new links.

For doing this, we have to move everything into scrapeWikiArticle function.

How to Scrape Wikipedia Articles with Python 4

The function scrapeWikiArticle will extract the links and and title. Then again it will call this function and will create an endless cycle of scrapers that bounce around the wikipedia.

After running the program, we got:

How to Scrape Wikipedia Articles with Python 5

Wonderful! In only few steps, we got the “web scraping” to “Wikipedia articles with NLK identifiers”.

Conclusion:

We hope that this article is useful to you and you learned how to extract random wikipedia pages. It revolves around wikipedia by following random links.

Python – Ways to remove duplicates from list

List is an important container and used almost in every code of day-day programming as well as web-development, more it is used, more is the requirement to master it and hence knowledge of its operations is necessary. This article focuses on one of the operations of getting the unique list from a list that contains a possible duplicated. Remove duplicates from list operation has large number of applications and hence, its knowledge is good to have.

How to Remove Duplicates From a Python List

Method 1 : Naive method

In Naive method, we simply traverse the list and append the first occurrence of the element in new list and ignore all the other occurrences of that particular element.

# Using Naive methods:

Using Naive method

Output :

Remove duplicates from list using Naive method output

Method 2 : Using list comprehension

List comprehensions are Python functions that are used for creating new sequences (such as lists, tuple, etc.) using previously created sequence. This makes code more efficient and easy to understand. This method has working similar to the above method, but this is just a one-liner shorthand of longer method done with the help of list comprehension.

# Using list comprehension

Remove duplicates from list using list comprehension
Output :

Remove duplicates from list using list comprehension output

Method 3 : Using set():

We can remove duplicates from a list using an inbuilt function called set(). The set() always return distinct elements. Therefore, we use the set() for removing duplicates.But the main and notable drawback of this approach is that the ordering of the element is lost in this particular method.

# Using set()

Remove duplicates from list using set method
Output :

Remove duplicates from list using set method output

Method 4 : Using list comprehension + enumerate():

Enumerate can also be used for removing duplicates when used with the list comprehension.It basically looks for already occurred elements and skips adding them. It preserves the list ordering.

# Using list comprehension + enumerate()

Using list comprehension + enumerate()
Output :

Using list comprehension + enumerate() output
Method 5 : Using collections.OrderedDict.fromkeys():

This is fastest method to achieve the particular task. It first removes the duplicates and returns a dictionary which has to be converted to list. This works well in case of strings also.

# Using collections.OrderedDict.fromkeys()

Using collections.OrderedDict.fromkeys()
Output :

Using collections.OrderedDict.fromkeys() output

Conclusion :

In conclusion, nowyou may know “how to remove duplicates from a list in python?“. There are different ways but the using collections.OrderedDict.fromkeys() method is the best in accordance with the programming efficiency of the computer.