06.12.2017       Выпуск 207 (04.12.2017 - 10.12.2017)       Статьи

Django 2.0: оконные выражения (Window expressions) Django

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

So what is an over clause?

An over clause is of this format

SELECT depname, empno, salary,
  avg(salary)
    OVER (PARTITION BY depname)
FROM empsalary;

Compare this to a similar GROUP BY statement

SELECT depname, avg(salary)
FROM empsalary
GROUP BY depname;

The difference is a GROUP BY has as many rows as grouping elements, here number of depname. An over clause adds the the aggregated result to each row of the select.

Postgres documentation says, "A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result." This is true for all other DB implementation as well.

What are real world uses of over expressions?

We will use the Django ORM with the Window expression to to some analysis on the most prolific committers to Django. To do this we will export the commiter names and time of commit to a csv.

git log  --no-merges --date=iso --pretty=format:'%h|%an|%aI' > commits.iso.csv

This is not ranking of Django developers, just of their number of commits, which allows us an interestig dataset. I am grateful to everyone who has contributed to Django - they have made my life immesureably better.

With some light data wrangling using Pandas, we transform this to a per author, per year data and import to Postgres. Our table structure looks like this.

experiments=# \d commits_by_year;
   Table "public.commits_by_year"
    Column     |  Type   | Modifiers
---------------+---------+-----------
 id            | bigint  |
 author        | text    |
 commit_year   | integer |
 commits_count | integer |

We define a model to interact with this table.

from django.db import models


class Committer(models.Model):
    author = models.CharField(max_length=100)
    commit_year = models.PositiveIntegerField()
    commits_count = models.PositiveIntegerField()

    class Meta:
        db_table = 'commits_by_year'

Lets quickly test if our data is imported. You can get a csv from here, and import to Postgres to follow along.

In [2]: Committer.objects.all().count()
Out[2]: 2318

Let us setup our environment and get the imports we need.

## Some ORM imports which we are going to need

from django.db.models import Avg, F, Window
from django.db.models.functions import  Rank, DenseRank, CumeDist
from django_commits.models import Committer

# We will use pandas to display the queryset in tanular format
import pandas
pandas.options.display.max_rows=20

# An utility function to display querysets
def as_table(values_queryset):
    return pandas.DataFrame(list(values_queryset))

Lets quickly look at the data we have.

as_table(Committer.objects.all().values(
  "author", "commit_year", "commits_count"
))
authorcommit_yearcommits_count
0Tim Graham2017373
1Sergey Fedoseev2017158
2Mariusz Felisiak2017113
3Claude Paroz2017102
4Mads Jensen201755
5Simon Charette201740
6Jon Dufresne201733
7Anton Samarchyan201727
8François Freitag201717
9Srinivas Reddy Thatiparthy201714
............
2308Malcolm Tredinnick2006175
2309Georg Bauer200690
2310Russell Keith-Magee200686
2311Jacob Kaplan-Moss200683
2312Luke Plant200620
2313Wilson Miner200612
2314Adrian Holovaty20051015
2315Jacob Kaplan-Moss2005130
2316Georg Bauer2005112
2317Wilson Miner200520

2318 rows × 3 columns

We will now use the Window expression to get the contributors ranked by number of commits, within each year. We will go over the code in detail, but lets look at the queryset and results.

# Find out who have been the most prolific contributors
# in the years 2010-2017

dense_rank_by_year = Window(
    expression=DenseRank(),
    partition_by=F("commit_year"),
    order_by=F("commits_count").desc()
)

commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        the_rank=dense_rank_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "the_rank"
    )
as_table(commiters_with_rank)
authorcommit_yearcommits_countthe_rank
0Tim Graham20173731
1Sergey Fedoseev20171582
2Mariusz Felisiak20171133
3Claude Paroz20171024
4Mads Jensen2017555
5Simon Charette2017406
6Jon Dufresne2017337
7Anton Samarchyan2017278
8François Freitag2017179
9Srinivas Reddy Thatiparthy20171410
...............
171Joseph Kocherhans20105311
172Ramiro Morales20105311
173Jacob Kaplan-Moss20104212
174Chris Beaven20102913
175Malcolm Tredinnick20102614
176Honza Král20102015
177Carl Meyer20101716
178Ian Kelly20101716
179Simon Meers20101117
180Gary Wilson Jr20101018

181 rows × 4 columns

Lets look a the the ORM code in more detail here.

# We are creating the Window function part of our SQL query here
dense_rank_by_year = Window(
    # We want to get the Rank with no gaps
    expression=DenseRank(),
    # We want to partition the queryset on commit_year
    # Each distinct commit_year is a different partition
    partition_by=F("commit_year"),
    # This decides the ordering within each partition
    order_by=F("commits_count").desc()
)


commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    # Standard filter oprtation, limit rows to 2010-2017
    ).annotate(
    # For each commiter, we are annotating its rank
        the_rank=dense_rank_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "the_rank"
    )
as_table(commiters_with_rank)

Now lets try getting the average commits per commiter for each year along with the other data.

avg_commits_per_year = Window(
    # We want the average of commits per committer, with each partition
    expression=Avg("commits_count"),
    # Each individual year is a partition.
    partition_by=F("commit_year")
)

commiters_with_yearly_average = Committer.objects.filter().annotate(
      avg_commit_per_year=avg_commits_per_year
    ).values(
        "author", "commit_year", "commits_count", "avg_commit_per_year"
    )
# We could have done further operation with avg_commit_per_year
# Eg: F(commits_count) - F(avg_commit_per_year),
# would tell us committers who commit more than average
as_table(commiters_with_yearly_average)

This gives us

authoravg_commit_per_yearcommit_yearcommits_count
0Wilson Miner319.250000200520
1Adrian Holovaty319.25000020051015
2Jacob Kaplan-Moss319.2500002005130
3Georg Bauer319.2500002005112
4Russell Keith-Magee188.571429200686
5Jacob Kaplan-Moss188.571429200683
6Luke Plant188.571429200620
7Wilson Miner188.571429200612
8Adrian Holovaty188.5714292006854
9Malcolm Tredinnick188.5714292006175
...............
2308Adam Johnson4.916084201713
2309Tom4.916084201713
2310Srinivas Reddy Thatiparthy4.916084201714
2311François Freitag4.916084201717
2312Anton Samarchyan4.916084201727
2313Jon Dufresne4.916084201733
2314Simon Charette4.916084201740
2315Mads Jensen4.916084201755
2316Claude Paroz4.9160842017102
2317Mariusz Felisiak4.9160842017113

2318 rows × 4 columns

You could try other Window functions such as CumeDist, Rank or Ntile.

from django.db.models.functions import CumeDist
cumedist_by_year = Window(
    expression=CumeDist(),
    partition_by=F("commit_year"),
    order_by=F("commits_count").desc()
)

commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        cumedist_by_year=cumedist_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "cumedist_by_year"
    )
as_table(commiters_with_rank)

Until now, we have partitioned on commit_year. We can partition on other fields too. We will partition on author to find out how their contributions have changed over the years using the Lag window expression.

from django.db.models.functions import Lag
from django.db.models import Value
commits_in_previous_year = Window(
    expression=Lag("commits_count", default=Value(0)),
    partition_by=F("author"),
    order_by=F("commit_year").asc(),
)

commiters_with_pervious_year_commit = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        commits_in_previous_year=commits_in_previous_year
    ).order_by(
        "author", "-commit_year"
    ).values(
        "author", "commit_year", "commits_count", "commits_in_previous_year"
    )
as_table(commiters_with_pervious_year_commit)
authorcommit_yearcommits_countcommits_in_previous_year
0Adam Chainz20164212
1Adam Chainz2015120
2Adam Johnson2017130
3Adrian Holovaty20124098
4Adrian Holovaty20119872
5Adrian Holovaty2010720
6Akshesh2016310
7Alasdair Nicol20161319
8Alasdair Nicol20151917
9Alasdair Nicol2013170
...............
171Timo Graham20121370
172Timo Graham20117060
173Timo Graham2010600
174Tom2017130
175Unai Zalakain2013170
176Vajrasky Kok2013140
177areski2014150
178eltronix2016100
179wrwrwr2014210
180Łukasz Langa2013150

181 rows × 4 columns

I hope this tutorial has been helpful in understanding the window expressions. While still not as felxible as SqlAlchemy, Django ORM has become extremely powerful with recent Django releases. Stay tuned for more advanced ORM tutorials.



Лучшая Python рассылка




Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus