Python Libraries-Applications

NumPy

Python approaches data analysis by array - oriented computing which becomes the basis for all data processing tasks. With NumPy, it is possible to implement multidimensional arrays and apply techniques ranging from simple cleaning and merging to advanced tasks such as linear algebra and statistical analysis.

Once installed, NumPy provides the foundation for other libraries such as pandas, scikit-learn, and SciPy. The functionality that is provided in NumPy applies to many data processing tasks.

Pandas

The pandas library in Python is used for doing data manipulation. Origin of the library name comes from Panel Data and Statistics.

pandas provides efficient ways to:

Format data into dataframes which makes it easy to identify and delete specific records, columns or chunks of data using indexing
- All file types such as csv, json, xml, excel can be called and converted to dataframes with a simple function.
- pandas uses loc and iloc functions to efficiently parse over datasets and copy slices, subsets and chunks of data.
Handle missing data
- Missing value treatment such as dropping rows or columns
- Removing or replacing the missing values with default values, averages or imputed values
Merge and combine data
- Data sets can be merged using single or multiple key values
- Data sets can be joined using SQL type joins such as left, right, inner and outer joins. These are relatively easy to implement as shown in the code sample below.

In the given example, we can see use of pandas library to upload a csv file and manipulate the data set (handling the missing value).

Matplotlib

Matplotlib is a data visualization package in Python. Matplotlib is built on NumPy and works on the multidimensional array representation of datasets.

Graphic routines in Matplotlib provide a wide range of options to customize visualizations by changing the various graphical attribute settings on axes, colors, bin size, titles, and so on.

In the given image we can observe the visualization of the array elements in the normal distribution form.

NLTK

NLTK contains several functions for performing a variety of natural language processing tasks such as tokenization, tagging, word count, semantic and sentimental analysis.

In the below image, we can see how nltk library is used to tokenize a given sentence.

NLTK

Tokenizing - taking full length of running text and breaking them down to individual words which are then stored in an array or a list.
POS tagging - Parts of Speech (POS) tagging which identifies words as nouns, verbs, adjectives, and so on. This, helps in improving the understanding and importance of words.
Word count - calculating the occurrences of words which can be analyzed for sources such as Twitter or message boards.
Semantic analysis - attempts to understand natural language as we speak. It is a method to quantify nuance, subtext, and context.
Sentiment analysis - understanding the underlying intentions as a positive or negative. Comments in surveys, social media posts, and discussion boards can be mined for how people feel about a particular topic. This can help businesses gain insights into what matters to the customers.

Seaborn

Seaborn is a Python data visualization library based on matplotlib.

Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.

As you can observe in the shown image, the use of seaborn to visualize the alphabets 'SPAUDR'.

SciPy

The SciPy library provides efficient numerical routines for integration, interpolation, optimization, linear algebra, and statistics.

SciPy is built on NumPy and utilizes the multidimensional array approach to datasets. SciPy features an advanced linear algebra capability as compared to NumPy. The functions in NumPy are relatively basic and not designed for core scientific computations.

SciPy functions provide capabilities such as curve fitting, random number generation, Fourier transforms, and Bessel functions.

Scikit-learn

scikit-learn is built on top of NumPy, pandas, Matplotlib and SciPy and hence utilizes the full strength of the Python libraries to make model building easy and efficient.

Performs task such as:

Regression
- Linear, Ridge and Lasso
Classification
- Logistic regression, Support Vector Machines, decision trees, k-nearest neighbors
Clustering
- K-Means clustering, Spectral clustering
Dimensionality reduction
- Principal Component Analysis, factor analysis

scikit-learn includes a large number of datasets to evaluate the analytical routines.In addition, there are advanced functions for model selection, feature extraction and normalization. Models can also be tuned using grid search and randomized parameter optimizaton.

Search This Blog

Blog