[ad_1]
Querying a whole table
We can dive right into it by looking at the classic SELECT ALL from a table.
Here’s the SQL:
SELECT * FROM df
And here’s the pandas
df
All you need to do is call the DataFrame in Pandas to return the whole table and all its columns.
You may also want to just look at a small subset of your table as a quick check before writing a more complicated query. In SQL, you’d use LIMIT 10
or something similar to get only a select number of rows. In Pandas, similarly, you can call df.head(10)
or df.tails(10)
to get the first or last 10 rows of the table.
Querying a table without null values
To add to our initial select query, in addition to just limiting the number of rows, you would put conditions to filter the table inside a WHERE clause in SQL. For example, if you’d want all rows in the table without any null values in the Order_ID
column, the SQL would look like this:
SELECT * FROM df WHERE Order_ID IS NOT NULL
In Pandas, you have two options:
# Option 1
df.dropna(subset="Order_ID")# Option 2
df.loc[df["Order_ID"].notna()]
Now, the table we get back doesn’t have any null values from the Order_ID
column (which you can compare to the first output above). Both options will return a table without the null values, but they work slightly differently.
You can use the native dropna
method in Pandas to return the DataFrame without any null rows, specifying in the subset
parameter which columns you’d like to drop nulls from.
Alternatively, the loc
method lets you pass a mask or boolean label you can specify to filter the DataFrame. Here, we pass df["Order_ID"].notna()
, which if you would call it on its own would return a Series of True and False values that can map to the original DataFrame rows for whether the Order_ID
is null. When we pass it to the loc
method, it instead returns the DataFrame where df["Order_ID"].notna()
evaluates to True (so all rows where the Order_ID
column isn’t null.
Querying specific columns from a table
Next, instead of selecting all columns from the table, let’s instead select just a few specific columns. In SQL, you’d write the column names in the SELECT part of the query like this:
SELECT Order_ID, Product, Quantity_Ordered FROM df
In Pandas, we’d write the code like this:
df[["Order_ID", "Product", "Quantity_Ordered"]]
To select a specific subset of columns, you can pass a list of the column names into the DataFrame in Pandas. You can also define the list separately like this for clarity:
target_cols = ["Order_ID", "Product", "Quantity_Ordered"]
df[target_cols]
Assigning a list of target columns that you can then pass into a DataFrame can make working with a table over time when you need to make changes in your code a little easier. For example, you could have a function return the columns you need as a list, or append and remove columns to the list as needed depending on what kind of output the user needs.
The GROUP BY in SQL and Pandas
We can now move on to aggregating data. In SQL, we do this by passing a column to the SELECT and GROUP BY clauses that we want to group on and then adding the column to an aggregate measure like COUNT in the SELECT clause as well. As an example, doing so will let us group all the individual Order_ID
rows in the original table for each Product
and count how many there are. The query can look like this:
SELECT
Product,
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
GROUP BY Product
In Pandas, it would look like this:
df[df["Order_ID"].notna()].groupby(["Product"])["Order_ID"].count()
The output is a Pandas Series where the table is grouped the products and there’s a count of all the Order_ID
for each product. In addition to our previous query in Pandas where we included a filter, we now do three things:
- Add
groupby
and pass a column (or list of columns) that you want to group the DataFrame on; - Pass the name of the column in square brackets on the raw grouped DataFrame;
- Call the
count
(or any other aggregate) method to perform the aggregation on the DataFrame for the target column.
For better readability, we can assign the condition to a variable (this will come in handy later) and format the query so it’s easier to read.
condition = df["Order_ID"].notna()
grouped_df = (
df.loc[condition]
.groupby("Product")
["Order_ID"] # select column to count
.count()
)
grouped_df
Now that we have most of the components of a complete SQL query, let’s take a look at a more complicated one and see what it would look like in Pandas.
SELECT
Product,
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
AND Purchase_Address LIKE "%Los Angeles%"
AND Quantity_Ordered == 1
GROUP BY Product
ORDER BY COUNT(Order_ID) DESC
Here, we add a little to our previous query by including multiple filter conditions as well as an ORDER BY so that the table returned in our query is sorted by the measure we’re aggregating on. Since there are a few more components to this query, let’s take a look step by step at how we’d implement this in Pandas.
First, instead of passing multiple conditions when we call the loc
method, let’s instead define a list of conditions and assign them to a variable FILTER_CONDITIONS
.
FILTER_CONDITIONS = [
df["Order_ID"].notna(),
df["Purchase_Address"].str.contains("Los Angeles"),
df["Quantity_Ordered"] == "1",
]
As before, a condition passed into loc
should be a Pandas mask that evaluates to either true or false. It’s possible to pass multiple conditions to loc
, but the syntax should look like this:
df.loc[condition_1 & condition_2 & condition_3]
However, just passing a list of conditions like this won’t work:
df.loc[FILTER_CONDITIONS]
# doesn't work -> you can't just pass a list into loc
You’ll get an error if you try the above because each condition should be separated by the &
operator for “and” conditions (or the |
operator if you need “or” conditions). Instead, we can write some quick code to return the conditions in the correct format. We’ll make use of the functools.reduce
method to put the conditions together.
If you want to see what it looks like in a notebook and see what it looks like to combine some strings using the reduce
function, try this:
reduce(lambda x, y: f"x & y", ["condition_1", "condition_2", "condition_3"])
This outputs the string like this:
>>> 'condition_1 & condition_2 & condition_3'
Going back to our actual Pandas conditions, we can write this instead (without the string formatting and just using our defined list of conditions in the FILTER_CONDITIONS
variable).
reduce(lambda x, y: x & y, FILTER_CONDITIONS)
What reduce
does is apply a function cumulatively to the elements present in an iterable, or in our case run the lambda
function over the items in our FILTER_CONDITIONS
list which combines each of them with the &
operator. This runs until there are no conditions left, or in this case, for all three conditions it would effectively return:
df["Order_ID"].notna() & df["Purchase_Address"].str.contains("Los Angeles") & df["Quantity_Ordered"] == "1"
Finally, let’s add the list of conditions to create a final group by query in Pandas:
final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
.size()
.sort_values(ascending=False)
)
You’ll notice two additional differences from the previous query:
- Instead of specifying the specific column to count on, we can simply call the
size
method which will return the number of rows in the DataFrame (as before where everyOrder_ID
value was unique and meant to represent one row when we counted on it); - There are a few different ways to do the ORDER BY in Pandas- one way is to simply call
sort_values
and passascending=False
to sort on descending order.
If you wanted to use the previous syntax for aggregating the data it would look like this:
final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
["Order_ID"].count()
.sort_values(ascending=False)
)
The output of both methods will be the same as before, which is a Series with the column you’re grouping on and the counts for each product.
If instead, you wanted to output a DataFrame, you can call the reset_index
method on the series to get the original column names back for which column you grouped on and the column you’re aggregating on (in this case we grouped on “Product” and are counting the “Order_ID”.
final_df.reset_index()
And there we have it! All the components of a full SQL query but finally written in Pandas. Some of the things we can do further to optimize this process for working with data over time include:
- Putting the different lists of columns to SELECT or GROUP BY to their own variables or functions (so you or a user can modify them over time);
- Move the logic to combine the list of columns for a filter condition to its own function so the end user doesn’t need to be confused over what the
reduce
logic is doing; - After passing
reset_index
we can rename the output column (or columns if we’re aggregating on multiple) for clarity, for example to “Count_Order_ID”.
[ad_2]
Source link