04.07.2019       Выпуск 289 (01.07.2019 - 07.07.2019)       Статьи

Метод apply в pandas

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

Filtering a dataframe

Filtering….

Pandas make filtering and subsetting dataframes pretty easy. You can filter and subset dataframes using normal operators and &,
,~

operators.

# Single condition: dataframe with all movies rated greater than 8df_gt_8 = df[df['Rating']>8]# Multiple conditions: AND - dataframe with all movies rated greater than 8 and having more than 100000 votesAnd_df = df[(df['Rating']>8) & (df['Votes']>100000)]# Multiple conditions: OR - dataframe with all movies rated greater than 8 or having a metascore more than 90Or_df = df[(df['Rating']>8) 
(df['Metascore']>80)]

# Multiple conditions: NOT - dataframe with all emovies rated greater than 8 or having a metascore more than 90 have to be excludedNot_df = df[~((df['Rating']>8)
(df['Metascore']>80))]

Pretty simple stuff.

But sometimes we may need to do complex filtering operations.

And sometimes we need to do some operations which we won’t be able to do using just the above format.

For instance: Let us say we want to filter those rows where the number of words in the movie title is greater than or equal to than 4.

How would you do it?

Trying the below will give you an error. Apparently, you cannot do anything as simple as split with a series.

new_df = df[len(df['Title'].split(" "))>=4]
-------------------------------------------
AttributeError: 'Series' object has no attribute 'split'


One way is to first create a column which contains no of words in the title using apply and then filter on that column.

#create a new column
df['num_words_title'] = df.apply(lambda x : len(x['Title'].split(" ")),axis=1)

#simple filter on new column
new_df = df[df['num_words_title']>=4]

And that is a perfectly fine way as long as you don’t have to create a lot of columns. But, I prefer this:

new_df = df[df.apply(lambda x : len(x['Title'].split(" "))>=4,axis=1)]

What I did here is that my apply function returns a boolean which can be used to filter.

Now once you understand that you just have to create a column of booleans to filter, you can use any function/logic in your apply statement to get however complex a logic you want to build.

Let us see another example. I will try to do something a little complex to just show the structure.

We want to find movies for which the revenue is less than the average revenue for that particular year?

year_revenue_dict = df.groupby(['Year']).agg({'Rev_M':np.mean}).to_dict()['Rev_M']def bool_provider(revenue, year):
return revenue<year_revenue_dict[year]

new_df = df[df.apply(lambda x : bool_provider(x['Rev_M'],x['Year']),axis=1)]



We have a function here which we can use to write any logic. That provides a lot of power for advanced filtering as long as we can play with simple variables.






Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus