[ad_1]
3-Minutes Pandas
What should we do to see the entire printed dataframe after the execution of a Python script?
Sometimes running through a Python script without reporting any errors is not the only task of the debugging process. We need to make sure the functions are executed as expected. It’s a typical step in the exploratory data analysis to check how the data looks like before and after some specific data processing.
So, we need to print out some data frames or essential variables during the execution of the script, in order to check whether they are “correct”. However, simple print command can only show the top and bottom rows of the data frame sometimes (as shown in the example below), which makes the checking procedure unnecessarily hard.
Usually, the data frames are in the format of pandas.DataFrame
, and if you use the print command directly, you might get something like this,
import pandas as pd
import numpy as npdata = np.random.randn(5000, 5)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D', 'E'])
print(df.head(100))
You may have already noticed that the middle part of the data frame is hidden by three dots. What if we really need to check what the top 100 rows are? For example, we want to check the result of a specific step in the middle of a large Python script, in order to make sure the functions are executed as expected.
set_option()
One of the most straightforward solutions is to edit the default number of rows that Pandas show,
pd.set_option('display.max_rows', 500)
print(df.head(100))
where set_option
is a method that allows you to control the behavior of Pandas functions, which includes setting the maximum number of rows or columns to display, as we did above. The first argument display.max_rows
is to adjust the maximum number of rows to display and 500 is the value we set as the maximum row number.
Even though this method is widely used, it’s not ideal to put it inside an executable Python file, especially if you have multiple data frames to print and they are desired to display different numbers of rows.
For example, I have a script structured as shown,
## Code Block 1 ##
...
print(df1.head(20))
...## Code Block 2 ##
...
print(df2.head(100))
...
## Code Block N ##
...
print(df_n)
...
we have different numbers of top rows to show through the entire script, and sometimes we want to see the entire printed data frame, but sometimes we only care about the dimension and structure of the data frame without the need to see the entire data.
In such a case, we probably need to use the function pd.set_option()
to set the desired display
or pd.reset_option()
to use the default options every time before we print a data frame, which makes it very messy and troublesome.
## Code Block 1 ##
...
pd.set_option('display.max_rows', 20)
print(df1.head(20))
...## Code Block 2 ##
...
pd.set_option('display.max_rows', 100)
print(df2.head(100))
...
## Code Block N ##
...
pd.reset_option('display.max_rows')
print(df_n)
...
There’s actually a more flexible and effective way of showing the entire data frame without specifying the display options for Pandas.
to_string()
to_string()
directly transfer the pd.DataFrame
object to a string object and when we print it out, it doesn’t care about the display limit from pandas
.
pd.set_option('display.max_rows', 10)
print(df.head(100).to_string())
We can see above that even though I set the maximum number of rows to display as 10, to_string()
helps us print the entire data frame of 100 rows.
The function, to_string()
, converts an entire data frame to the string
format, so it can keep all the values and indexes in the data frame in the printing step. Since set_option()
is only effective on pandas objects, our printing string
is not limited by the maximum number of rows to display set earlier.
So, the strategy is that you don’t need to set anything via set_option()
and you only need to use to_string()
to see the entire data frame. It will save you from thinking about which option to set in which part across the script.
Takeaways
- Use
set_option('display.max_rows')
when you have a consistent number of rows to display across the entire script. - Use
to_string()
if you want to print out the entire Pandas data frame no matter what Pandas options have been set.
Thanks for reading! Hope you enjoy using the Pandas trick in your work!
Please subscribe to my Medium if you want to read more stories from me. And you can also join the Medium membership by my referral link!
[ad_2]
Source link