com/redactor/raw/2020-10-11_12-50-03-484dad1e2ae6699a90c68800f1a24ca9.png">

To visualise this data in a bar chart we can use the plotly express (our px) bar() function:

bar = px.bar(x = top10_category.index, # index = category name
             y = top10_category.values)

bar.show()

Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.

But what if we look at it from a different perspective? What matters is not just the total number of apps in the category but how often apps are downloaded in that category. This will give us an idea of how popular a category is. First, we have to group all our apps by category and sum the number of installations:

category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

Then we can create a horizontal bar chart, simply by adding the orientation parameter:

h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h')

h_bar.show()

We can also add a custom title and axis labels like so:

h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h',
               title='Category Popularity')

h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()

Now we see that Games and Tools are actually the most popular categories. If we plot the popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. Do few apps have most of the downloads or are the downloads spread out over many apps?

Challenge

As a challenge, lets use plotly to create a scatter plot that looks like this:

Create a DataFrame that has the number of apps in one column and the number of installs in another:

Then use the plotly express examples from the documentation alongside the .scatter() API reference to create scatter plot that looks like the chart above.

Hint: Use the size, hover_name and color parameters in .scatter(). To scale the y-axis, call .update_layout() and specify that the y-axis should be on a log-scale like so: yaxis=dict(type='log')

Solution: Create a scatter plot with Plotly

First, we need to work out the number of apps in each category (similar to what we did previously).

cat_number = df_apps_clean.groupby('Category').agg({'App': pd.Series.count})

Then we can use .merge() and combine the two DataFrames.

cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how="inner")
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)

Now we can create the chart. Note that we can pass in an entire DataFrame and specify which columns should be used for the x and y by column name.

scatter = px.scatter(cat_merged_df, # data
                    x='App', # column name
                    y='Installs',
                    title='Category Concentration',
                    size='App',
                    hover_name=cat_merged_df.index,
                    color='Installs')

scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))

scatter.show()

What we see is that the categories like Family, Tools, and Game have many different apps sharing a high number of downloads. But for the categories like video players and entertainment, all the downloads are concentrated in very few apps.