One of the most common tasks you’ll encounter when analyzing Polars data is the need to summarize it. You can use the Polars .group_by()
method to create groupings based on column values. A related summarization technique is aggregation, where you take a sequence of related values and condense them into a single value.
By the end of this tutorial, you’ll understand that:
- You can summarize data using aggregation.
- You can use
.filter()
to view specific data. - Using
.group_by()
allows you to summarize one or more columns of your data. - Your time series data can be grouped using
.group_by_dynamic()
. - You can match summarized data with the original data using window functions.
- Pivot tables allow you to group and aggregate rows and columns of data.
In this tutorial, you’ll learn how to group data in several ways using the .group_by()
method in Polars and apply aggregation to each group.
Before you begin, you should be familiar with the basics of working with Polars DataFrames or another DataFrame library like pandas.
Note: If you’re familiar with Polars, you’ll know that in addition to DataFrames, Polars also supports LazyFrames. While this is an efficient tool when working with large datasets, and you’re certainly encouraged to learn how to use them, DataFrames are sufficient for the examples in this tutorial.
In addition, you may find it helpful to use Jupyter Notebook as you work through many of the examples in this tutorial. Alternatively, JupyterLab will enhance your notebook experience, but any Python environment you’re comfortable with will work just fine.
To get started, you’ll need some data. For the central part of this tutorial, you’ll use the student performance dataset freely available from the UC Irvine Machine Learning Repository. The data in its original form comes in two files: student-mat.csv
and student-por.csv
. These contain data on students taking math and Portuguese language courses.
Both files have been altered to include student identification numbers and the subject indicators M
and P
, formatted as math.parquet
and portuguese.parquet
. All data has also been merged into a single course.parquet
file. These three files are available in your download bundle, which you can access by clicking the link below:
Get Your Code: Click here to download the free sample code that you’ll use to learn about grouping data with Polars .group_by().
The table below shows the fields used in this tutorial:
Field Name | Description |
---|---|
absences |
Number of absences |
age |
Student’s age |
failures |
Number of failures |
G1 , G2 , G3 |
First, second, and final grade |
internet |
Student has home internet access |
reason |
Reason for taking course |
school |
School attended |
sex |
Student’s gender (M , F ) |
student_id |
Student’s enrollment number |
subject |
Subject studied (M , P ) |
The dataset is quite extensive, containing over thirty fields. If you want to experiment, the student.txt
file defines every field. This file is also available as part of the tutorial downloads.
To use Polars, you first need to install the Polars library into your Python environment. To install Polars from the command prompt, use the following command:
$ python -m pip install polars
In a Jupyter Notebook, the command is !python -m pip install polars
.
With everything set up, it’s time to start using Polars to aggregate and group data.
Get Your Code: Click here to download the free sample code that you’ll use to learn about grouping data with Polars .group_by().
Take the Quiz: Test your knowledge with our interactive “How to Group Data Using Polars .group_by()” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Group Data Using Polars .group_by()Put your Polars .group_by() and aggregation skills to the test. This quiz goes beyond the tutorial, so you'll need to dig deeper to find all the answers.
Aggregating Data
Whenever you want to aggregate data without grouping it, you apply an aggregation function directly to the column or columns you want to analyze. More often than not, you’ll also want to group your aggregated data. You’ll learn how to do that later—for now, you’ll focus on aggregation.
Learning the Basics
Suppose you wanted to find the highest number of absences among students in the math class. You can do this as follows:
>>> import polars as pl
>>> math_students = pl.read_parquet("math.parquet")
>>> (
... math_students
... .select(pl.col("absences").max())
... )
shape: (1, 1)
┌──────────┐
│ absences │
│ --- │
│ i64 │
╞══════════╡
│ 75 │
└──────────┘
After installing the Polars library earlier, you import
it into your code with the conventional alias pl
to access its functionality. This allows you to use the read_parquet()
function to read the contents of math.parquet
into a Polars DataFrame.
With the data now inside a DataFrame, you use its .select()
method to select the specific columns you want to work with. In this example, you need data from the absences
column, so you specify it using the pl.col("absences")
expression. You then use .max()
to locate the maximum absences
value in the column. As you can see, the highest number of absences is 75
.
Note: The .select()
method is also known as a context. A context is the name Polars uses for a method you pass a Polars expression to. The same expression will produce different results depending on the context you pass it to.
For example, passing pl.col("student_id")
to the .select()
context will include the student_id
column in the result, whereas using it in the .filter()
context will help define which values of student_id
you’ll see in the result.
The contexts you’ll use in this tutorial are .select()
, .filter()
, .group_by()
, and .with_columns()
.
Don’t confuse Polars contexts with the more widespread Python context manager. They’re not the same thing.
Although the previous example illustrates that aggregating data produces a single value, it doesn’t provide any background about where that value came from. Suppose, for example, you wanted to know which student, or students, have the worst attendance. You need to know more about where this 75
came from. To do this, you filter the DataFrame:
>>> (
... math_students
... .select(pl.col("student_id", "absences"))
... .filter(pl.col("absences") == pl.col("absences").max())
... )
shape: (1, 2)
┌────────────┬──────────┐
│ student_id ┆ absences │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════════╪══════════╡
│ 10277 ┆ 75 │
└────────────┴──────────┘
This time, you use .select()
to select both the student_id
and absences
columns. Although you’re most interested in the student_id
value or values, displaying the associated absences
column will also give their absence information.
To filter the DataFrame so that only information relating to the highest number of absences is displayed, you use .filter()
. As a filter condition, you pass in pl.col("absences") == pl.col("absences").max()
. This returns only the student identifier and absences value for those students whose number of absences is equal to the maximum among all math students. In this case, only the student with ID 10277
meets the filter criteria.
Note: Although the code you’ve written to select the maximum value in the absences
column is correct, it’s not the only way you’ll see computations like this written.
For example, pl.max("absences")
will do the same thing. This code is syntactic sugar for pl.col("absences").max()
to make it look as though you’re using a Polars function, when you’re actually calling a method on an expression object.
It’s also possible to use an attribute syntax. For example, pl.col.absences
creates an expression equal to pl.col("absences")
.
In this tutorial, you’ll use the pl.col("absences").max()
syntax form, which is more technically correct.
Next, you’ll gain an appreciation for some of the different aggregation calculations that Polars supports.
Investigating Other Aggregation Techniques
In the examples so far, you’ve limited your data aggregations to working out maximum values using the .max()
method of a Polars expression. But Polars supports a range of methods that can be used on expressions. It’s also possible to mix and match them within the same .select()
:
>>> (
... math_students
... .select(
... min=pl.col("absences").min(),
... max=pl.col("absences").max(),
... mean=pl.col("absences").mean(),
... )
... )
shape: (1, 3)
┌─────┬─────┬──────────┐
│ min ┆ max ┆ mean │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════════╡
│ 0 ┆ 75 ┆ 5.708861 │
└─────┴─────┴──────────┘
Here, you’ve displayed the lowest, highest, and average absences of math students using the expression methods shown.
This time, you passed each expression to .select()
by keyword. The keyword names the column containing the result. For example, you use mean=pl.col("absences").mean()
to determine the average of absences
and assign it to the mean
column in the output. You can use any word you like as the keyword. It doesn’t have to match the method name, but it’s best to choose a name that’s descriptive of the column’s data.
Keywords are necessary in this example because each column name must be unique. The output columns would be named after the data they contain if you didn’t use keyword arguments. So the result of pl.col("absences").mean()
would be placed into a column named absences
. Without min
, max
, and mean
, Polars would raise a DuplicateError
because absences
would be used for more than one column name.
Note: It’s also possible to aggregate rows of data. Suppose, you wanted to find the final total and mean of each student’s exam results:
>>> math_students.select(
... "student_id", "G1", "G2", "G3",
... total=pl.sum_horizontal("G1", "G2", "G3"),
... mean=pl.mean_horizontal("G1", "G2", "G3"),
... )
shape: (395, 6)
┌────────────┬─────┬─────┬─────┬───────┬───────────┐
│ student_id ┆ G1 ┆ G2 ┆ G3 ┆ total ┆ mean │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 │
╞════════════╪═════╪═════╪═════╪═══════╪═══════════╡
│ 10001 ┆ 5 ┆ 6 ┆ 6 ┆ 17 ┆ 5.666667 │
│ 10002 ┆ 5 ┆ 5 ┆ 6 ┆ 16 ┆ 5.333333 │
│ 10003 ┆ 7 ┆ 8 ┆ 10 ┆ 25 ┆ 8.333333 │
│ 10004 ┆ 15 ┆ 14 ┆ 15 ┆ 44 ┆ 14.666667 │
│ 10005 ┆ 6 ┆ 10 ┆ 10 ┆ 26 ┆ 8.666667 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 10391 ┆ 9 ┆ 9 ┆ 9 ┆ 27 ┆ 9.0 │
│ 10392 ┆ 14 ┆ 16 ┆ 16 ┆ 46 ┆ 15.333333 │
│ 10393 ┆ 10 ┆ 8 ┆ 7 ┆ 25 ┆ 8.333333 │
│ 10394 ┆ 11 ┆ 12 ┆ 10 ┆ 33 ┆ 11.0 │
│ 10395 ┆ 8 ┆ 9 ┆ 9 ┆ 26 ┆ 8.666667 │
└────────────┴─────┴─────┴─────┴───────┴───────────┘
To compute the sum and mean of student grades, you use the sum_horizontal()
and mean_horizontal()
expressions and pass them to .select()
.
To avoid column name clashes, you must pass both expressions as keyword arguments to clearly define the output column headers. For clarity, you also display the columns containing the original grade column data used in each calculation. The total
and mean
columns show the sum and average of the data in columns G1
, G2
, and G3
.
Just before you move on, it’s time to consolidate your learning.
Trying Out Your Data Aggregation Skills
The G3
column in math.parquet
contains the students’ final grades.
What’s the median, most frequent value, and variance of the G3
results?
One possible solution could be:
┌───────────────┬──────────────────────┬───────────┐
│ median_result ┆ most_frequent_result ┆ variance │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ f64 │
╞═══════════════╪══════════════════════╪═══════════╡
│ 11.0 ┆ 10 ┆ 20.989616 │
└───────────────┴──────────────────────┴───────────┘
You’ll find the code that produced this result in the solutions.ipynb
file in your downloads.
Now that you’ve been introduced to aggregating data, you’ll focus on grouping it.
Grouping Aggregated Data With Polars .group_by()
When you need to group data, you often perform the grouping before aggregation. To group data, you split it into separate categories and then aggregate each one. In this section, you’ll start with a basic example, then you’ll move to a more complex example that highlights the real power of the Polars expression syntax.
Learning the Basics
Suppose you want to find the minimum, maximum, and average number of absences for students in the math and Portuguese courses. One way would be to analyze both the math.parquet
and portuguese.parquet
files individually:
>>> import polars as pl
>>> math_students = pl.read_parquet("math.parquet")
>>> portuguese_students = pl.read_parquet("portuguese.parquet")
>>> (
... math_students
... .select(
... min=pl.col("absences").min(),
... max=pl.col("absences").max(),
... mean=pl.col("absences").mean(),
... )
... )
shape: (1, 3)
┌─────┬─────┬──────────┐
│ min ┆ max ┆ mean │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════════╡
│ 0 ┆ 75 ┆ 5.708861 │
└─────┴─────┴──────────┘
>>> (
... portuguese_students
... .select(
... min=pl.col("absences").min(),
... max=pl.col("absences").max(),
... mean=pl.col("absences").mean(),
... )
... )
shape: (1, 3)
┌─────┬─────┬──────────┐
│ min ┆ max ┆ mean │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════════╡
│ 0 ┆ 32 ┆ 3.659476 │
└─────┴─────┴──────────┘
This code is very similar to the earlier example, except this time, you’ve applied it to the contents of both your math.parquet
and portuguese.parquet
files. You’ll recognize the first set of results from before, but the additional analysis reveals higher absenteeism in mathematics than in Portuguese. However, some keen students in both classes have never missed a lesson.
While running the same code multiple times certainly gives you the correct answer, it only works because each file contains data for a single subject. Suppose that instead, the data for both classes was presented to you in a single file. The above code would still work, but the results you’d see would be those from the combined data. To analyze this composite file for each of the subjects within the file, you can use .group_by()
:
>>> all_students = pl.read_parquet("course.parquet")
>>> (
... all_students
... .group_by("subject")
... .agg(
... min=pl.col("absences").min(),
... max=pl.col("absences").max(),
... mean=pl.col("absences").mean(),
... )
... )
shape: (2, 4)
┌─────────┬─────┬─────┬──────────┐
│ subject ┆ min ┆ max ┆ mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪═════╪═════╪══════════╡
│ M ┆ 0 ┆ 75 ┆ 5.708861 │
│ P ┆ 0 ┆ 32 ┆ 3.659476 │
└─────────┴─────┴─────┴──────────┘
This time, you’ve computed both result sets using .group_by()
. When you pass subject
into .group_by()
, you return a GroupBy
object that groups the data according to the different values in the subject
column. There will be two groups: one for math students, denoted by the letter M
, and another for Portuguese students, denoted by P
.
Note: Polars .group_by()
doesn’t care about the order of the data it encounters. You can place the M
and P
data rows in any order you please within course.parquet
, and the output of the code will be the same. There’s no need to sort the data by subject
beforehand.
Once you have your GroupBy
object, you can use its .agg()
method to perform an aggregation. In this case, because you want the minimum, maximum, and average values for each of the two groups, you pass in the same expressions as before, again with keywords to avoid name conflicts.
This time, your results show you’ve applied .min()
, .max()
, and .mean()
to each of the two groups separately and not to the complete dataset.
Understanding the Power of Expressions
One of the powerful features of passing expressions into .agg()
is that they’re calculated based on the groups defined within the GroupBy
object created with .group_by()
. As long as you can formulate your aggregation into an expression, you can use it in .agg()
and apply it to groupings. This makes writing complex grouped aggregations more logical.
For example, suppose you wanted to analyze attendance a little deeper. You want to find out how many math students with a greater than mean absence for their age still managed to pass the subject. A student has passed if their overall grade—G3
—is 13
or higher.
To begin with, you need to group the G3
data by age
. You decide to write some intermediate code to see how this looks:
>>> math_students = pl.read_parquet("math.parquet")
>>> (
... math_students
... .group_by("age")
... .agg(
... passes=pl.col("G3"),
... )
... )
shape: (8, 2)
┌─────┬───────────────┐
│ age ┆ passes │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═════╪═══════════════╡
│ 17 ┆ [6, 6, … 16] │
│ 21 ┆ [7] │
│ 20 ┆ [18, 15, 9] │
│ 15 ┆ [10, 15, … 7] │
│ 16 ┆ [10, 15, … 8] │
│ 22 ┆ [8] │
│ 19 ┆ [9, 0, … 9] │
│ 18 ┆ [6, 0, … 10] │
└─────┴───────────────┘
To group the math students by their age, you pass age
into .group_by()
. To specify the data you want aggregated, you pass the G3
column into .agg()
, since it contains the students’ overall grades. The aggregated data for each value in the age
column is output as a list because you haven’t specified any aggregation computations to be carried out.
Currently, the passes
column contains all the aggregated data. Since you’re only interested in the data of passing students with an above average absence for their age, you’ll need to add some filters:
>>> (
... math_students
... .group_by("age")
... .agg(
... passes=pl.col("G3").filter(
... pl.col("absences") > pl.col("absences").mean(),
... pl.col("G3") >= 13,
... )
... )
... )
shape: (8, 2)
┌─────┬────────────────┐
│ age ┆ passes │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═════╪════════════════╡
│ 18 ┆ [14, 18, … 13] │
│ 20 ┆ [] │
│ 16 ┆ [15, 14, … 18] │
│ 15 ┆ [15, 13, … 15] │
│ 22 ┆ [] │
│ 21 ┆ [] │
│ 19 ┆ [13, 13, … 13] │
│ 17 ┆ [13, 18, … 15] │
└─────┴────────────────┘
The filters you apply ensure that only students with above-average absences and a G3
score of 13
or higher are included in the passes
column. As you can see, the lists now reflect this. However, the specific marks don’t interest you. Instead, you want to see a count of the successes. You can use the .count()
method for this:
>>> (
... math_students
... .group_by("age")
... .agg(
... passes=pl.col("G3").filter(
... pl.col("absences") > pl.col("absences").mean(),
... pl.col("G3") >= 13
... ).count(),
... poor_attenders=pl.col("G3").filter(
... pl.col("absences") > pl.col("absences").mean()
... ).count(),
... )
... .select(
... pl.col("age", "passes", "poor_attenders"),
... percentage=pl.col("passes") / pl.col("poor_attenders") * 100,
... )
... .with_columns(
... pl.col("percentage").replace(float("NaN"), 0)
... ).sort("age")
... )
shape: (8, 4)
┌─────┬────────┬────────────────┬────────────┐
│ age ┆ passes ┆ poor_attenders ┆ percentage │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ f64 │
╞═════╪════════╪════════════════╪════════════╡
│ 15 ┆ 15 ┆ 32 ┆ 46.875 │
│ 16 ┆ 11 ┆ 39 ┆ 28.205128 │
│ 17 ┆ 8 ┆ 29 ┆ 27.586207 │
│ 18 ┆ 11 ┆ 31 ┆ 35.483871 │
│ 19 ┆ 4 ┆ 10 ┆ 40.0 │
│ 20 ┆ 0 ┆ 1 ┆ 0.0 │
│ 21 ┆ 0 ┆ 0 ┆ 0.0 │
│ 22 ┆ 0 ┆ 0 ┆ 0.0 │
└─────┴────────┴────────────────┴────────────┘
Next, you add a column named poor_attenders
to the aggregation. This column contains a count of all students whose attendance was below the average for their age.
In addition to the age
, passes
, and poor_attenders
columns, you also define a column named percentage
that contains the percentage of passes for each student involved in the analysis. You use .select()
to make sure all four of these columns are displayed.
When poor_attenders
is zero, your percentage
column displays NaN
. To deal with this unsightly data, you pass pl.col("percentage").replace(float("NaN"), 0)
into .with_columns()
. This causes the original column to be replaced with a new one, also named percentage
, but without the NaN
values.
Finally, you use .sort()
to sort the column by age. By default, the sort is in ascending order. If you wanted to reverse this, you’d add descending=True
.
You also conclude that students with a high number of absences don’t perform well on their exams. While many may have valid reasons for missing class and might be working independently, the overall pass rates are low.
Note: On rare occasions, you may want to display your output so that the order of the data groupings is the same as in the original data. To do this, you can pass maintain_order=True
to .group_by()
.
In the previous example, removing .sort()
and using .group_by("age", maintain_order=True)
would display the grouped data in its original age order as opposed to sorting it. By default, maintain_order
is set to False
because setting it to True
causes a performance hit. Fortunately, this isn’t something you’ll use very often, but it’s available if you need it.
In addition, setting maintain_order=True
also blocks the possibility of using streaming, which may cause resource problems if your dataset is large.
Now that you’ve practiced grouping by a single column, you’ll learn how to group multiple columns of data.
Grouping and Aggregating by Multiple Columns
You’ve grouped your data using a single column in the examples so far. While this is certainly a useful thing to do, Polars allows you to group data by multiple columns. This is often referred to as sub-grouping and can allow you to make more sense of the results.
Performing a Data Sub-Grouping
Suppose you want to find the lowest, highest, and average absence figures for both subjects again, but this time you want to analyze the students’ reasons for choosing the course. To do this, you can pass both columns into .group_by()
in a Python list:
>>> import polars as pl
>>> all_students = pl.read_parquet("course.parquet")
>>> (
... all_students
... .group_by(["subject", "reason"])
... .agg(
... min=pl.col("absences").min(),
... max=pl.col("absences").max(),
... mean=pl.col("absences").mean(),
... ).sort("subject")
... )
shape: (8, 5)
┌─────────┬────────────┬─────┬─────┬──────────┐
│ subject ┆ reason ┆ min ┆ max ┆ mean │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪════════════╪═════╪═════╪══════════╡
│ M ┆ course ┆ 0 ┆ 23 ┆ 3.972414 │
│ M ┆ other ┆ 0 ┆ 20 ┆ 5.611111 │
│ M ┆ home ┆ 0 ┆ 75 ┆ 7.146789 │
│ M ┆ reputation ┆ 0 ┆ 56 ┆ 6.647619 │
│ P ┆ home ┆ 0 ┆ 30 ┆ 4.456376 │
│ P ┆ other ┆ 0 ┆ 16 ┆ 2.777778 │
│ P ┆ reputation ┆ 0 ┆ 32 ┆ 3.811189 │
│ P ┆ course ┆ 0 ┆ 26 ┆ 3.389474 │
└─────────┴────────────┴─────┴─────┴──────────┘
This time, the data is first grouped by subject
and then, within each subject, it’s sub-grouped by reason
. Finally, the various calculations are performed for each subject-reason combination.
Interestingly, the highest mean absences for both subjects correspond to those students who chose the course because it was delivered closest to their home.
If you look closely at the output, you might notice that the values within the reason
column are in a different order for each subject. To address this, you can do some sub-sorting:
>>> (
... all_students
... .group_by(["subject", "reason"])
... .agg(
... min=pl.col("absences").min(),
... max=pl.col("absences").max(),
... mean=pl.col("absences").mean(),
... ).sort("subject", "reason")
... )
shape: (8, 5)
┌─────────┬────────────┬─────┬─────┬──────────┐
│ subject ┆ reason ┆ min ┆ max ┆ mean │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪════════════╪═════╪═════╪══════════╡
│ M ┆ course ┆ 0 ┆ 23 ┆ 3.972414 │
│ M ┆ home ┆ 0 ┆ 75 ┆ 7.146789 │
│ M ┆ other ┆ 0 ┆ 20 ┆ 5.611111 │
│ M ┆ reputation ┆ 0 ┆ 56 ┆ 6.647619 │
│ P ┆ course ┆ 0 ┆ 26 ┆ 3.389474 │
│ P ┆ home ┆ 0 ┆ 30 ┆ 4.456376 │
│ P ┆ other ┆ 0 ┆ 16 ┆ 2.777778 │
│ P ┆ reputation ┆ 0 ┆ 32 ┆ 3.811189 │
└─────────┴────────────┴─────┴─────┴──────────┘
This time, the data are sorted first by subject
, and then by reason
within each subject. To do this, you pass both the subject
and reason
columns into .sort()
. If you wanted to sort them the other way around, then you’d pass the columns in the reverse order.
Tweaking Your Sorting
By default, sorting is done in ascending order. In other words, alphabetically and in order from lowest to highest numerically. Suppose you wanted the various reasons to be sorted in descending order. You can do this by chaining two .sort()
calls together:
>>> (
... all_students
... .group_by(["subject", "reason"])
... .agg(
... min=pl.col("absences").min(),
... max=pl.col("absences").max(),
... mean=pl.col("absences").mean(),
... )
... .sort("reason", descending=True)
... .sort("subject")
... )
shape: (8, 5)
┌─────────┬────────────┬─────┬─────┬──────────┐
│ subject ┆ reason ┆ min ┆ max ┆ mean │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪════════════╪═════╪═════╪══════════╡
│ M ┆ reputation ┆ 0 ┆ 56 ┆ 6.647619 │
│ M ┆ other ┆ 0 ┆ 20 ┆ 5.611111 │
│ M ┆ home ┆ 0 ┆ 75 ┆ 7.146789 │
│ M ┆ course ┆ 0 ┆ 23 ┆ 3.972414 │
│ P ┆ reputation ┆ 0 ┆ 32 ┆ 3.811189 │
│ P ┆ other ┆ 0 ┆ 16 ┆ 2.777778 │
│ P ┆ home ┆ 0 ┆ 30 ┆ 4.456376 │
│ P ┆ course ┆ 0 ┆ 26 ┆ 3.389474 │
└─────────┴────────────┴─────┴─────┴──────────┘
When adding in chained calls to .sort()
, you need to take care with their order. To make sure you sort by reason within subject, .sort("reason")
must be called before .sort( "subject")
. This is the reverse of when you passed both columns to a single .sort()
earlier. To sort the reason
column’s data in descending order, you must pass descending=True
to it as well.
As the updated output shows, the subject
column is in the same ascending order as before, but the reason
column is now in descending order within each of the two subjects.
Just before you move on, it’s time for another round of learning consolidation.
Trying Out Your Data Grouping and Aggregation Skills
The internet
column in course.parquet
indicates if the student has home internet access. You’ve been asked to investigate how many students taking each subject have access and if this affects their pass rates. Remember, if their G3
score is 13
or higher, a student has passed.
Display details of the total students, number of passes, and percentage of passes for each subject for those who either have home internet access or don’t.
As an extra challenge, see if you can format the percentages using the percent (%
) sign.
One possible solution could be:
┌─────────┬──────────┬───────┬────────┬────────────┐
│ subject ┆ internet ┆ total ┆ passes ┆ percentage │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 ┆ u32 ┆ str │
╞═════════╪══════════╪═══════╪════════╪════════════╡
│ M ┆ no ┆ 66 ┆ 14 ┆ 21.21% │
│ M ┆ yes ┆ 329 ┆ 117 ┆ 35.56% │
│ P ┆ no ┆ 151 ┆ 45 ┆ 29.8% │
│ P ┆ yes ┆ 498 ┆ 231 ┆ 46.39% │
└─────────┴──────────┴───────┴────────┴────────────┘
You’ll find the code that produced this result in the solutions.ipynb
file in your downloads.
Next, you’ll learn a little about grouping time series data.
Grouping and Aggregating Time Series With .group_by_dynamic()
Time series analysis is a major part of everyday data analysis. Polars has its own grouping method, .group_by_dynamic(),
which is dedicated to grouping data based on time values.
In this section, you’ll use a file named math_classes.parquet
that contains details of absence rates for a six-month semester of algebra, geometry, and calculus classes. The file is included in the downloadable materials and contains the following fields:
Field | Description |
---|---|
class_start |
Start date and time of the class |
class_subject |
Subject name: algebra, geometry, or calculus |
absences |
Number of student absences |
lecturer_initials |
Initials of the lecturer teaching the class |
Before you begin, you decide to take a look at a part of the data:
>>> import polars as pl
>>> math_attendance = pl.read_parquet("math_classes.parquet")
>>> math_attendance.head()
shape: (5, 4)
┌─────────────────────┬───────────────┬──────────┬───────────────────┐
│ class_start ┆ class_subject ┆ absences ┆ lecturer_initials │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ str │
╞═════════════════════╪═══════════════╪══════════╪═══════════════════╡
│ 2024-01-03 09:00:00 ┆ algebra ┆ 3 ┆ DH │
│ 2024-01-04 13:30:00 ┆ geometry ┆ 4 ┆ PS │
│ 2024-01-05 10:00:00 ┆ calculus ┆ 3 ┆ LM │
│ 2024-01-10 09:00:00 ┆ algebra ┆ 2 ┆ DH │
│ 2024-01-11 13:30:00 ┆ geometry ┆ 7 ┆ PS │
└─────────────────────┴───────────────┴──────────┴───────────────────┘
The file contains a row for each weekly class in the three mathematical subjects. Each class runs once a week. The semester starts in January and continues through to the end of June. You used math_attendance.head()
to view only the first five rows.
Performing a Time-Series Grouping
Suppose you’re reviewing the absences from the course. You’d like to see the total and mean absences for each week. You do this using .group_by_dynamic()
:
>>> (
... math_attendance
... .group_by_dynamic(
... index_column="class_start",
... every="1w",
... closed="both",
... )
... .agg(
... total_absences=pl.col("absences").sum(),
... mean_absences=pl.col("absences").mean(),
... )
... )
shape: (26, 3)
┌─────────────────────┬────────────────┬───────────────┐
│ class_start ┆ total_absences ┆ mean_absences │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ f64 │
╞═════════════════════╪════════════════╪═══════════════╡
│ 2024-01-01 00:00:00 ┆ 10 ┆ 3.333333 │
│ 2024-01-08 00:00:00 ┆ 10 ┆ 3.333333 │
│ 2024-01-15 00:00:00 ┆ 11 ┆ 3.666667 │
│ 2024-01-22 00:00:00 ┆ 10 ┆ 3.333333 │
│ 2024-01-29 00:00:00 ┆ 4 ┆ 1.333333 │
│ … ┆ … ┆ … │
│ 2024-05-27 00:00:00 ┆ 15 ┆ 5.0 │
│ 2024-06-03 00:00:00 ┆ 12 ┆ 4.0 │
│ 2024-06-10 00:00:00 ┆ 17 ┆ 5.666667 │
│ 2024-06-17 00:00:00 ┆ 6 ┆ 2.0 │
│ 2024-06-24 00:00:00 ┆ 14 ┆ 4.666667 │
└─────────────────────┴────────────────┴───────────────┘
The .group_by__dynamic()
method allows you to group data based on a time value. Here, you use it to calculate the total and mean weekly absences. Here’s a breakdown of the arguments you used:
-
First, you set
index_column="class_start"
to tell.group_by_dynamic()
to use this column for grouping. -
Next, you pass
every="1w"
to group theclass_start
values into weekly intervals. The"1w"
string starts each week on Monday by default. -
The final parameter you pass is
closed="both"
to ensure that both the first and last weeks are included in the output. -
The expressions you pass to
.agg()
are similar to the ones you used earlier. Once again, you define the column headings for the results using keywords to avoid naming conflicts and make the results easy to read.
Your output contains a row for each week’s data. Each row of the total_absences
column contains the total absences for each week, while the mean_absences
column contains their average.
Suppose you want to dig deeper and find the total absences for each quarter, but group the data by each subject. The code you use to do this is as follows:
>>> (
... math_attendance
... .group_by_dynamic(
... index_column="class_start",
... every="1q",
... closed="both",
... group_by="class_subject",
... )
... .agg(pl.col("absences").sum())
... )
shape: (6, 3)
┌───────────────┬─────────────────────┬──────────┐
│ class_subject ┆ class_start ┆ absences │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ i64 │
╞═══════════════╪═════════════════════╪══════════╡
│ algebra ┆ 2024-01-01 00:00:00 ┆ 56 │
│ algebra ┆ 2024-04-01 00:00:00 ┆ 44 │
│ geometry ┆ 2024-01-01 00:00:00 ┆ 35 │
│ geometry ┆ 2024-04-01 00:00:00 ┆ 40 │
│ calculus ┆ 2024-01-01 00:00:00 ┆ 41 │
│ calculus ┆ 2024-04-01 00:00:00 ┆ 39 │
└───────────────┴─────────────────────┴──────────┘
Your code here isn’t too different from the previous version. You assign 1q
to the every
parameter to group the output into quarterly intervals.
To specify that you want to group by each subject within each quarter, you assign class_subject
to the group_by
parameter of .group_by_dynamic()
. You also pass the pl.col("absences").sum()
expression to .agg()
.
This time, your output contains a separate row for each quarter for each of the three subjects. The first and second quarters start on January 1st and April 1st, respectively. The final column shows the total absences for each subject and quarter.
Before you move on, it’s time for some more practice.
Trying Out Your Time Series Analysis Skills
Using the data in math_classes.parquet
, find each lecturer’s monthly average student absences
.
One possible solution could be:
┌───────────────────┬─────────────────────┬──────────┐
│ lecturer_initials ┆ class_start ┆ absences │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ f64 │
╞═══════════════════╪═════════════════════╪══════════╡
│ DH ┆ 2024-01-01 00:00:00 ┆ 4.0 │
│ DH ┆ 2024-02-01 00:00:00 ┆ 3.4 │
│ DH ┆ 2024-03-01 00:00:00 ┆ 4.25 │
│ DH ┆ 2024-04-01 00:00:00 ┆ 2.5 │
│ DH ┆ 2024-05-01 00:00:00 ┆ 2.0 │
│ … ┆ … ┆ … │
│ LM ┆ 2024-02-01 00:00:00 ┆ 3.0 │
│ LM ┆ 2024-03-01 00:00:00 ┆ 3.4 │
│ LM ┆ 2024-04-01 00:00:00 ┆ 3.8 │
│ LM ┆ 2024-05-01 00:00:00 ┆ 3.666667 │
│ LM ┆ 2024-06-01 00:00:00 ┆ 4.6 │
└───────────────────┴─────────────────────┴──────────┘
You’ll find the code that produced this result in the solutions.ipynb
file in your downloads.
Next up, you’ll learn about Polars window functions.
Grouping and Aggregating Using Window Functions
Window functions allow you to perform more complex aggregations within .select()
.
In the grouping examples you’ve seen, the resulting output contained a single row for each data grouping. For example, previously, you grouped the records by reason within subject, so your result contained a separate row for each subject-reason combination.
Window functions also perform aggregations over groups. However, they apply their results to each of the original rows within the group. So, the resulting DataFrame will contain the same number of rows as the original, but with an aggregated value assigned to each row.
Working With Window Functions
To use a window function, you use an aggregation function just like you did before. This time, however, you apply the function to the column expression within .select()
, and then use .over()
to apply the results over each group member.
This is best understood by example. To gauge student motivation, suppose you want to see how the actual absence figure for each student compares to the mean value for each reason within school within subject. This is where window functions can help:
>>> all_students = pl.read_parquet("course.parquet")
>>> all_students.select(
... pl.col("subject", "school", "student_id", "reason", "absences"),
... mean_absences=(
... pl.col("absences")
... .mean()
... .over("subject", "school", "reason")
... )
... )
shape: (1_044, 6)
┌─────────┬────────┬────────────┬────────┬──────────┬───────────────┐
│ subject ┆ school ┆ student_id ┆ reason ┆ absences ┆ mean_absences │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str ┆ i64 ┆ f64 │
╞═════════╪════════╪════════════╪════════╪══════════╪═══════════════╡
│ M ┆ GP ┆ 10001 ┆ course ┆ 6 ┆ 4.104839 │
│ M ┆ GP ┆ 10002 ┆ course ┆ 4 ┆ 4.104839 │
│ M ┆ GP ┆ 10003 ┆ other ┆ 10 ┆ 6.518519 │
│ M ┆ GP ┆ 10004 ┆ home ┆ 2 ┆ 7.397959 │
│ M ┆ GP ┆ 10005 ┆ home ┆ 4 ┆ 7.397959 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ P ┆ MS ┆ 10645 ┆ course ┆ 4 ┆ 2.627119 │
│ P ┆ MS ┆ 10646 ┆ course ┆ 4 ┆ 2.627119 │
│ P ┆ MS ┆ 10647 ┆ course ┆ 6 ┆ 2.627119 │
│ P ┆ MS ┆ 10648 ┆ course ┆ 6 ┆ 2.627119 │
│ P ┆ MS ┆ 10649 ┆ course ┆ 4 ┆ 2.627119 │
└─────────┴────────┴────────────┴────────┴──────────┴───────────────┘
You created your new column using pl.col("absences").mean()
to indicate that you want the mean of the original absences. This time, instead of using it within .agg()
, you call .over("subject", "school", "reason")
to calculate the mean of each reason within school within subject grouping, and to apply the mean value to each row of the associated original data. In other words, you’ve applied the corresponding mean value to each student’s row.
For example, look at the mean_absences
figures in the highlighted lines. These both contain 7.397959
because this is the mean absence value corresponding to the M
-GP
-home
combination. Students 10004
and 10005
each belong to this grouping.
Similarly, the bottom five rows in the output also contain identical mean absences because they all belong to the same grouping. In this case, it’s P
-MS
-course
.
Looking at the output, you notice that some students—such as student 10001
—have a higher-than-average number of absences compared to others who are taking the same course, at the same school, for the same reason. If you’re only interested in students with above-average absences, then you can filter by the new column:
>>> all_students.select(
... pl.col("subject", "school", "student_id", "reason", "absences"),
... mean_absences=(
... pl.col("absences")
... .mean()
... .over("subject", "school", "reason")
... )
... ).filter(pl.col("absences") > pl.col("mean_absences"))
shape: (381, 6)
┌─────────┬────────┬────────────┬────────────┬──────────┬───────────────┐
│ subject ┆ school ┆ student_id ┆ reason ┆ absences ┆ mean_absences │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str ┆ i64 ┆ f64 │
╞═════════╪════════╪════════════╪════════════╪══════════╪═══════════════╡
│ M ┆ GP ┆ 10001 ┆ course ┆ 6 ┆ 4.104839 │
│ M ┆ GP ┆ 10003 ┆ other ┆ 10 ┆ 6.518519 │
│ M ┆ GP ┆ 10006 ┆ reputation ┆ 10 ┆ 6.72 │
│ M ┆ GP ┆ 10019 ┆ course ┆ 16 ┆ 4.104839 │
│ M ┆ GP ┆ 10026 ┆ home ┆ 14 ┆ 7.397959 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ P ┆ MS ┆ 10645 ┆ course ┆ 4 ┆ 2.627119 │
│ P ┆ MS ┆ 10646 ┆ course ┆ 4 ┆ 2.627119 │
│ P ┆ MS ┆ 10647 ┆ course ┆ 6 ┆ 2.627119 │
│ P ┆ MS ┆ 10648 ┆ course ┆ 6 ┆ 2.627119 │
│ P ┆ MS ┆ 10649 ┆ course ┆ 4 ┆ 2.627119 │
└─────────┴────────┴────────────┴────────────┴──────────┴───────────────┘
The filter expression is similar to those you used earlier. You use pl.col("absences") > pl.col("mean_absences")
to make sure only those absences
greater than the mean_absences
are included in the results.
Before moving on, it’s time for yet another exercise:
Trying Out Your Window Function Skills
Using the course.parquet
data, work out the mean values of the first (G1
) and second (G2
) period grades for each reason
within subject
. Your output should include the following columns, in this order: subject
, reason
, G1
, mean_G1
, G2
, and mean_G2
. Display only the rows where both G1
and G2
are above their group’s mean.
One possible solution could be:
┌─────────┬────────────┬─────┬───────────┬─────┬───────────┐
│ subject ┆ reason ┆ G1 ┆ mean_G1 ┆ G2 ┆ mean_G2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ f64 ┆ i64 ┆ f64 │
╞═════════╪════════════╪═════╪═══════════╪═════╪═══════════╡
│ M ┆ home ┆ 15 ┆ 10.816514 ┆ 14 ┆ 10.743119 │
│ M ┆ reputation ┆ 15 ┆ 11.457143 ┆ 15 ┆ 11.257143 │
│ M ┆ home ┆ 12 ┆ 10.816514 ┆ 12 ┆ 10.743119 │
│ M ┆ home ┆ 16 ┆ 10.816514 ┆ 18 ┆ 10.743119 │
│ M ┆ home ┆ 14 ┆ 10.816514 ┆ 15 ┆ 10.743119 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ P ┆ home ┆ 17 ┆ 11.657718 ┆ 18 ┆ 11.785235 │
│ P ┆ home ┆ 14 ┆ 11.657718 ┆ 15 ┆ 11.785235 │
│ P ┆ other ┆ 14 ┆ 10.694444 ┆ 17 ┆ 10.777778 │
│ P ┆ course ┆ 15 ┆ 10.982456 ┆ 15 ┆ 11.147368 │
│ P ┆ course ┆ 11 ┆ 10.982456 ┆ 12 ┆ 11.147368 │
└─────────┴────────────┴─────┴───────────┴─────┴───────────┘
You’ll find the code that produced this result in the solutions.ipynb
file in your downloads.
The final section of your journey through aggregation and grouping will show you how to create pivot tables in Polars.
Grouping and Aggregating Using Pivot Tables
A pivot table is a data analysis tool that allows you to take columns of raw data from a DataFrame, summarize them, and then analyze the summary data. This summarization may include several statistical calculations, such as sums, averages, and so on, revealing insights previously hidden within the original data.
Note: Polars pivot tables will only work on DataFrames, not LazyFrames. This is because all data needs to be available to allow them to determine what columns are required in their output.
In this section, you’ll learn how pivot tables allow you to group and aggregate data.
Creating a Pivot Table
Suppose you wanted to analyze the average number of failures and absences for each subject at each school to see if there were any patterns between the two. One way of doing this would be to reuse the techniques you’ve already seen:
>>> import polars as pl
>>> all_students = pl.read_parquet("course.parquet")
>>> (
... all_students
... .group_by(["subject", "school"])
... .agg(
... mean_absence=pl.col("absences").mean(),
... mean_failure=pl.col("failures").mean(),
... )
... )
shape: (4, 4)
┌─────────┬────────┬──────────────┬──────────────┐
│ subject ┆ school ┆ mean_absence ┆ mean_failure │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞═════════╪════════╪══════════════╪══════════════╡
│ M ┆ GP ┆ 5.965616 ┆ 0.318052 │
│ M ┆ MS ┆ 3.76087 ┆ 0.456522 │
│ P ┆ MS ┆ 2.619469 ┆ 0.314159 │
│ P ┆ GP ┆ 4.21513 ┆ 0.172577 │
└─────────┴────────┴──────────────┴──────────────┘
The output shows the data grouped by school within each subject. For each grouping, the mean absence and failure figures are shown.
As an alternative, you could create a pivot table of the data using .pivot()
:
>>> (
... all_students.pivot(
... on="school",
... index="subject",
... values=["absences", "failures"],
... aggregate_function="mean"
... )
... )
shape: (2, 5)
┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ subject ┆ absences_GP ┆ absences_MS ┆ failures_GP ┆ failures_MS │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ M ┆ 5.965616 ┆ 3.76087 ┆ 0.318052 ┆ 0.456522 │
│ P ┆ 4.21513 ┆ 2.619469 ┆ 0.172577 ┆ 0.314159 │
└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘
To create your pivot table, you pass several parameters to .pivot()
. By passing on="school"
, you tell Polars to aggregate the data for each school. In this dataset, they’re annotated by GP
and MS
, and each school will have a separate set of results. Passing in a list of columns is possible if you want a more atomic analysis. You’ll see how this works later.
The index
parameter defines the rows of the output. By passing subject
, you specify that you want each subject’s data in a separate row. You’ve created separate rows for math (M
) and Portuguese (P
). This is similar to the grouping functionality you saw earlier. Although not relevant here, it’s also possible to specify sub-groupings by passing in a list of columns to use.
The values
parameter defines the data columns on which you want to perform the aggregate calculation. In this case, you want to do so on both the absences
and failures
columns. This gives you results for both of these.
Finally, you need to define which aggregate calculation you wish to perform. In this case, you want to calculate mean values, which you do by setting aggregate_function
to "mean"
. One drawback with .pivot()
is that you can only pass a single aggregation function to the aggregate_function
parameter. If you use .agg()
, then you can pass in several aggregation functions as you saw earlier. However, you only need one here.
When you first look at your pivot table, you’ll notice that it has created some of its own column headings. As mentioned earlier, the main aggregated data comes from the original absences
and failures
columns. To present these aggregations for each school, Polars creates a column for each of them. That’s where the absences_GP
, absences_MS
, failures_GP
, and failures_MS
column headings come from.
Customizing Column Order
If you want to make the output more readable so that each school’s mean absence and failure figures are next to each other, you can use .select()
to define the order in which the columns are displayed:
>>> (
... all_students
... .pivot(
... on="school",
... index="subject",
... values=["absences", "failures"],
... aggregate_function="mean",
... )
... .select(
... pl.col(
... "subject",
... "absences_GP",
... "failures_GP",
... "absences_MS",
... "failures_MS",
... )
... )
... )
shape: (2, 5)
┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ subject ┆ absences_GP ┆ failures_GP ┆ absences_MS ┆ failures_MS │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ M ┆ 5.965616 ┆ 0.318052 ┆ 3.76087 ┆ 0.456522 │
│ P ┆ 4.21513 ┆ 0.172577 ┆ 2.619469 ┆ 0.314159 │
└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘
Although the data in each column is the same, you’ve arranged them so that the columns for each school appear next to one another.
The output shows the mean absences and failures for each subject and school. Interestingly, GP
school appears to fare slightly better than MS
. Although its absence rate is higher, its failure rate is lower. If any students are reading this, please don’t take it to mean that less attendance will lead to greater success. In practice, the opposite is most definitely true.
Digging Deeper With Your Analysis
Finally, you might want to sub-analyze your data to determine if it’s related to the reasons for attending the schools. This can be done by tweaking your pivot table:
>>> (
... all_students
... .pivot(
... on="school",
... index=["subject", "reason"],
... values=["absences", "failures"],
... aggregate_function="mean",
... )
... .select(
... pl.col(
... "subject",
... "reason",
... "absences_GP",
... "failures_GP",
... "absences_MS",
... "failures_MS",
... )
... )
... )
shape: (8, 6)
┌─────────┬────────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ subject ┆ reason ┆ absences_GP ┆ failures_GP ┆ absences_MS ┆ failures_MS │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪════════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ M ┆ course ┆ 4.104839 ┆ 0.33871 ┆ 3.190476 ┆ 0.52381 │
│ M ┆ other ┆ 6.518519 ┆ 0.333333 ┆ 2.888889 ┆ 0.222222 │
│ M ┆ home ┆ 7.397959 ┆ 0.357143 ┆ 4.909091 ┆ 0.636364 │
│ M ┆ reputation ┆ 6.72 ┆ 0.25 ┆ 5.2 ┆ 0.2 │
│ P ┆ course ┆ 3.928144 ┆ 0.245509 ┆ 2.627119 ┆ 0.449153 │
│ P ┆ other ┆ 3.407407 ┆ 0.185185 ┆ 2.4 ┆ 0.244444 │
│ P ┆ home ┆ 4.869565 ┆ 0.130435 ┆ 3.058824 ┆ 0.147059 │
│ P ┆ reputation ┆ 4.166667 ┆ 0.105263 ┆ 2.413793 ┆ 0.068966 │
└─────────┴────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
To perform the calculation by reason
within subject
, you pass both columns as a list to the index
parameter. You also adjust .select()
to include reason
. These small changes have made a significant difference to your output. This time, the results are more atomically analyzed.
Again, it appears that, on average, more students seem to be failing math at MS
despite their better attendance. The same trend is broadly similar for Portuguese. This time, you can also see that students who chose a school because it was closest to their home have the highest average absence rates.
There’s time for one final exercise.
Trying Out Your Pivot Table Skills
Create a pivot table that contains rows for gender (sex
) within each subject
. For each of these combinations, calculate the average of grade G1
and the average of grade G2
for each school.
One possible solution could be:
┌─────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ subject ┆ sex ┆ G1_GP ┆ G1_MS ┆ G2_GP ┆ G2_MS │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ M ┆ F ┆ 10.579235 ┆ 10.92 ┆ 10.398907 ┆ 10.32 │
│ M ┆ M ┆ 11.337349 ┆ 10.380952 ┆ 11.204819 ┆ 10.047619 │
│ P ┆ F ┆ 12.28692 ┆ 10.582192 ┆ 12.50211 ┆ 10.719178 │
│ P ┆ M ┆ 11.602151 ┆ 9.7875 ┆ 11.688172 ┆ 10.0875 │
└─────────┴─────┴───────────┴───────────┴───────────┴───────────┘
You’ll find the code that produced this result in the solutions.ipynb
file in your downloads.
Congratulations on completing this tutorial. You can now look forward to confidently aggregating and grouping your datasets to reveal previously hidden insights.
Get Your Code: Click here to download the free sample code that you’ll use to learn about grouping data with Polars .group_by().
Conclusion
This tutorial showed you how to summarize data using various aggregation and grouping techniques. Now, you should have a solid understanding of how to use Polars .group_by()
to group data by column values.
You’ve also learned how to:
- Use simple aggregation to find summary values
- Filter data based on aggregated values
- Subdivide your data into groups and perform aggregations on each group
- Summarize time series data
- Use window functions to compare summarized and original data
- Perform more intense summarizations with pivot tables
While you’ve had a good overview of typical data aggregation and grouping techniques, you’re strongly encouraged to practice using your own datasets or those freely available on sites like Kaggle. It’s also worthwhile to keep an eye on the Polars documentation as well. Matt Harrison’s Effective Polars will give you many examples to play with.
Frequently Asked Questions
Now that you have some experience grouping data using Polars .group_by()
in Python, you can use the questions and answers below to check your understanding and recap what you’ve learned.
These FAQs are related to the most important concepts you’ve covered in this tutorial. Click the Show/Hide toggle beside each question to reveal the answer.
You use Polars .group_by()
to create groups based on unique values in a specified column. Once grouped, you can perform aggregations on each group independently to summarize your data.
Polars .group_by()
doesn’t preserve the order of the data. The method processes data based on the unique values of the specified column, disregarding the original order.
You handle time series data in Polars using .group_by_dynamic()
, which lets you group data based on time intervals. You specify the time column and the interval, such as “1w” for weekly, to create time-based groupings for analysis.
Window functions in Polars allow you to perform aggregations over groups while retaining the original DataFrame’s structure. You use them by applying aggregation functions with .over()
, which assigns the aggregated value to each row in the group.
You create pivot tables in Polars using .pivot()
, which summarizes data by aggregating specified columns over defined indexes and columns. You choose the aggregation function, such as “mean” or “sum,” to analyze patterns in your data.
Take the Quiz: Test your knowledge with our interactive “How to Group Data Using Polars .group_by()” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Group Data Using Polars .group_by()Put your Polars .group_by() and aggregation skills to the test. This quiz goes beyond the tutorial, so you'll need to dig deeper to find all the answers.