I enjoy hearing from readers that have used concepts from this blog to solve their own problems. It always amazes me when I see examples where only a few lines of python code can solve a real business problem and save organizations a lot of time and money. I am also impressed when people figure out how to do this with no formal training - just with some hard work and willingness to persevere through the learning curve.
I have talked quite a bit about how pandas is a great alternative to Excel for many tasks. One of Excel’s benefits is that it offers an intuitive and powerful graphical interface for viewing your data. In contrast, pandas + a Jupyter notebook offers a lot of programmatic power but limited abilities to graphically display and manipulate a DataFrame view.
This article will review several of these options in order to give you an idea of the landscape and evaluate which ones might be useful for your analysis process.
One of the most basic analysis functions is grouping and aggregating data. In some cases,
this level of analysis may be sufficient to answer business questions. In other instances,
this activity might be the first step in a more complex data science analysis. In pandas,
groupby function can be combined with one or more aggregation
functions to quickly and easily summarize data. This concept is deceptively simple and most new
pandas users will understand this concept. However, they might be surprised at how useful complex
aggregation functions can be for supporting sophisticated analysis.
This article will quickly summarize the basic pandas aggregation functions and show examples of more complex custom aggregations. Whether you are a new or more experienced pandas user, I think you will learn a few things from this article.
With pandas it is easy to read Excel files and convert the data into a DataFrame. Unfortunately Excel files in the real world are often poorly constructed. In those cases where the data is scattered across the worksheet, you may need to customize the way you read the data. This article will discuss how to use pandas and openpyxl to read these types of Excel files and cleanly convert the data to a DataFrame suitable for further analysis.
The main purpose of this blog is to show people how to use Python to solve real world problems. Over the years, I have been fortunate enough to hear from readers about how they have used tips and tricks from this site to solve their own problems. In this post, I am extremely delighted to present a real world case study. I hope it will give you some ideas about how you can apply these concepts to your own problems.
This example comes from Michael Biermann from Germany. He had the challenging task of trying to gather detailed historical weather data in order to do analysis on the relationship between air temperature and power consumption. This article will show how he used a pipeline of Python programs to automate the process of collecting, cleaning and processing gigabytes of weather data in order to perform his analysis.
The pandas read_html() function is a quick and convenient way to turn an HTML
table into a pandas DataFrame. This function can be useful for quickly incorporating tables
from various websites without figuring out how to scrape the site’s HTML.
However, there can be some challenges in cleaning and formatting the data before analyzing
it. In this article, I will discuss how to use pandas
read_html() to read and
clean several Wikipedia HTML tables so that you can use them for further numeric analysis.
I’ve written quite a bit about visualization in python - partially because the landscape is always evolving. Plotly stands out as one of the tools that has undergone a significant amount of change since my first post in 2015. If you have not looked at using Plotly for python data visualization lately, you might want to take it for a spin. This article will discuss some of the most recent changes with Plotly, what the benefits are and why Plotly is worth considering for your data visualization needs.
Today I am happy to announce the release of a new pandas utility library called sidetable. This library makes it easy to build a frequency table and simple summary of missing values in a DataFrame. I have found it to be a useful tool when starting data exploration on a new data set and I hope others find it useful as well.
This project is also an opportunity to illustrate how to use pandas new API to register custom DataFrame accessors. This API allows you to build custom functions for working with pandas DataFrames and Series and could be really useful for building out your own library of custom pandas accessor functions.
Jupyter notebooks are an amazing tool for evaluating and exploring data. I have been using them as an integral part of my day to day analysis for several years and reach for them almost any time I need to do data analysis or exploration. Despite how much I like using python in Jupyter notebooks, I do wish for the editor capabilities you can find in VS Code. I also would like my files to work better when versioning them with git.
Recently, I have started using a solution that supports the interactivity of the Jupyter notebook and the developer friendliness of plain .py text files. Visual Studio Code enables this approach through Jupyter code cells and the Python Interactive Window. Using this combination, you can visualize and explore your data in real time with a plain python file that includes some lightweight markup. The resulting file works seamlessly with all VS Code editing features and supports clean git check ins.
The rest of this article will discuss how to use this python development workflow within VS Code and some of the primary reasons why you may or may not want to do so.