Python Data Project: Analyzing the App Stores

Identifying Profitable App Profiles for the Apple and Android Markets

Analysis by Lauren Holstein: January 2024

Project Goals
Exploring the Data Sources
Data Cleaning
- 3a. Removing Incorrect Data
- 3b. Removing Duplicates
- 3c. Removing Non-English Apps
- 3d. Isolating Free Apps
Data Analysis
- 4a. Most Common App Genres
- 4b. Insights: Common App Genres
- 4c. Popular Apps by Genre
- 4d. Insights: Popular Apps by Genre

1. Project Goals

As an analyst at a company that creates free apps for the Android and Apple stores, my goal is to understand what type of apps attract the most users.

In this project, I will perform the following tasks:

Analyze data from Google Play and the App Store
Identify app usage trends
Provide recommendations for increasing ad revenue through the company’s free apps

2. Exploring the Data Sources

Approximately 2 million iOS apps and 2.1 million Android apps were available through their respective stores as of September 2018. For this project, I will analyze a sample of the apps instead of the stores in their entireties:

Google Play Store: A dataset containing information on approximately 10,000 Android apps from Google Play collected in August 2018.

Apple App Store: A dataset containing information on approximately 7,000 iOS apps from the App Store collected in July 2017.

As a first step, I will download the datasets and load them into Jupyter Notebook:

from csv import reader

# The Android dataset from the Google Play Store 
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# The Apple dataset from the App Store 
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

To make the datasets easier to explore, I’ll use the function explore_data(), which prints rows for better readability.

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Here is a glimpse at the headers and first three rows of each dataset:

# Android Dataset
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13

Android dataset summary: This dataset contains information about 10,841 apps. The attributes most relevant to this analysis include App, Category, Installs, Type, Price, and Genres.

# Apple Dataset
print(apple_header)
print('\n')
explore_data(apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16

Apple dataset summary: This dataset contains information about 7,197 apps. The attributes most relevant to this analysis include App, Category, Installs, Type, Price, and Genres.

3. Data Cleaning

3a. Removing Incorrect Data

The Android dataset’s source page on Kaggle includes a discussion section. Here, several users have identified a problem with entry 10742.

As a first step to remove the incorrect data, I’ll print row 10472 to confirm an error exists:

print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

Comparing entry 10742 to the list’s headers and a correct row shows that the Category field for app 10472 is missing, which has offset its columns by one:

print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

print(android[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

After confirming that entry 10472 contains an error, I will delete its row:

del android[10472]
# It is important to only run this code once, 
# otherwise it will delete the row 
# that now occupies index 10472.

print(android[10472])
# If only the faulty row was deleted, 
# this code should print info on
# an app called Osmino Wi-Fi: free Wifi.

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']

3b. Removing Duplicate Entries

Next, I’ll identify and remove duplicate entries from the dataset.

duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

The code above loops through every app in the Android list to identify duplicates based on app name; it found 1,181 duplicates.

print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Duplicate apps list sample: ', duplicate_apps[0:5])

Number of duplicate apps:  1181


Duplicate apps list sample:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']

After identifying duplicate entries, I’ll need to determine the criteria for keeping or discarding duplicates.

Inspecting duplicates — for example, duplicates of the Google app — reveals that the entries are identical except for the number of reviews. This suggests that entries showing more reviews were collected more recently than entries with fewer reviews, which would make review count a decent criterion for inclusion or deletion from the dataset.

In other words, I want to include only the most up-to-date data possible.

for app in android:
    name = app[0]
    if name == 'Google':
        print(app)

['Google', 'TOOLS', '4.4', '8033493', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Google', 'TOOLS', '4.4', '8021623', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 3, 2018', 'Varies with device', 'Varies with device']

The following code creates a dictionary of app names and their corresponding max rating count:

max_app_ratings = {}

for app in android:
    name = app[0]
    ratings_count = float(app[3])
    if name in max_app_ratings and ratings_count > max_app_ratings[name]:
            max_app_ratings[name] = ratings_count
    elif name not in max_app_ratings:
        max_app_ratings[name] = ratings_count

# Shows the first five entries in the dictionary
first_five = dict(list(max_app_ratings.items())[:5])
print('Max app ratings dictionary sample: ', first_five)
print('\n')
print('Max app ratings dictionary length: ', len(max_app_ratings))
print('Expected length: ', len(android)-len(duplicate_apps))

Max app ratings dictionary sample:  {'Photo Editor & Candy Camera & Grid & ScrapBook': 159.0, 'Coloring book moana': 974.0, 'U Launcher Lite – FREE Live Cool Themes, Hide Apps': 87510.0, 'Sketch - Draw & Paint': 215644.0, 'Pixel Draw - Number Art Coloring Book': 967.0}


Max app ratings dictionary length:  9659
Expected length:  9659

After confirming the max_app_ratings dictionary is the expected length, I’ll use it to remove duplicate rows from the Android dataset:

android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == max_app_ratings[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print('Android clean list length: ', len(android_clean))
print('\n')
print('Android clean list sample: ', android_clean[:3])
print('\n')
print('Already added list sample: ', already_added[:3])

Android clean list length:  9659


Android clean list sample:  [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


Already added list sample:  ['Photo Editor & Candy Camera & Grid & ScrapBook', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint']

I now have a list of Android apps cleaned of duplicates.

The Apple store dataset did not contain any duplicates.

3c. Removing Non-English Apps

My company develops apps for an English-speaking audience, so I’ll want to remove any non-English apps from the dataset before my analysis.

Both the Apple and Android datasets contain non-English apps, such as the examples below:

print(apple[813])
print(android_clean[4412])

['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']
['中国語 AQリスニング', 'FAMILY', 'NaN', '21', '17M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'June 22, 2016', '2.4.0', '4.0 and up']

To find non-English apps, I’ll identify characters in an app’s name that are not commonly used in English text. Every character has an assigned number according to the ASCII (American Standard Code for Information Interchange); English characters fall in the range of 0 to 127.

By using indexing to select individual characters and using the ord() function to retrieve the corresponding numbers, I can check whether each character is equal to or less than 127.

The following function, eng_name(), takes in a string and returns false if it is not an English name and true if it is.

def eng_name(name):
    for character in name:
        if ord(character) > 127:
            return False
    return True
        
print('Instagram is an English name: ', eng_name('Instagram'))
print('爱奇艺PPS -《欢乐颂2》电视剧热播 is an English name: ', eng_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))

Instagram is an English name:  True
爱奇艺PPS -《欢乐颂2》电视剧热播 is an English name:  False

The limitation of this function is that it will incorrectly label names as non-English if they contain characters such as trademark symbols or emojis. See the following examples:

print('Docs To Go™ Free Office Suite is an English name: ', eng_name('Docs To Go™ Free Office Suite'))
print('Instachat 😜 is an English name: ', eng_name('Instachat 😜'))

Docs To Go™ Free Office Suite is an English name:  False
Instachat 😜 is an English name:  False

To decrease the number of misidentified English apps, I’ll add the criterion that a name must contain more than 3 non-English characters to be excluded from the dataset.

def eng_name(name):
    non_eng = 0
    for character in name:
        if ord(character) > 127:
            non_eng += 1
    if non_eng > 3:
        return False
    return True

print('Docs To Go™ Free Office Suite is an English name: ', eng_name('Docs To Go™ Free Office Suite'))
print('Instachat 😜 is an English name: ', eng_name('Instachat 😜'))
print('爱奇艺PPS -《欢乐颂2》电视剧热播 is an English name: ', eng_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))

Docs To Go™ Free Office Suite is an English name:  True
Instachat 😜 is an English name:  True
爱奇艺PPS -《欢乐颂2》电视剧热播 is an English name:  False

While not a perfect solution, the function now correctly identifies English apps that contain up to three emojis or other symbols.

I’ll clean the Android list first:

english_apps_android = []
non_eng_apps_android = []
for app in android_clean:
    name = app[0]
    if eng_name(name) == True:
        english_apps_android.append(app)
    else:
        non_eng_apps_android.append(app)
print('English Android app list length: ', len(english_apps_android))   
print('\n')
names_eng_android = []
for app in english_apps_android:
    names_eng_android.append(app[0])
print('Sample of names on English Android app list: ', names_eng_android[:5])
print('\n')
names_non_eng_android = []
for app in non_eng_apps_android:
    names_non_eng_android.append(app[0])
print('Non-English Android app list length: ', len(non_eng_apps_android))
print('\n')
print('Sample of names on non-English Android app list: ', names_non_eng_android[:5])

English Android app list length:  9614


Sample of names on English Android app list:  ['Photo Editor & Candy Camera & Grid & ScrapBook', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book', 'Paper flowers instructions']


Non-English Android app list length:  45


Sample of names on non-English Android app list:  ['Flame - درب عقلك يوميا', 'သိင်္ Astrology - Min Thein Kha BayDin', 'РИА Новости', 'صور حرف H', 'L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]']

The function separated 45 non-English Android apps from our cleaned-of-duplicates list, leaving 9,614 English apps.

Next, I’ll clean the Apple list:

english_apps_apple = []
non_eng_apps_apple = []
for app in apple:
    name = app[1]
    if eng_name(name) == True:
        english_apps_apple.append(app)
    else:
        non_eng_apps_apple.append(app)
print('English Apple app list length: ', len(english_apps_apple))   
print('\n')
names_eng_apple = []
for app in english_apps_apple:
    names_eng_apple.append(app[1])
print('Sample of names on English Apple app list: ', names_eng_apple[:5])
print('\n')
names_non_eng_apple = []
for app in non_eng_apps_apple:
    names_non_eng_apple.append(app[1])
print('Non-English Apple app list length: ', len(non_eng_apps_apple))
print('\n')
print('Sample of names on non-English Apple app list: ', names_non_eng_apple[:5])

English Apple app list length:  6183


Sample of names on English Apple app list:  ['Facebook', 'Instagram', 'Clash of Clans', 'Temple Run', 'Pandora - Music & Radio']


Non-English Apple app list length:  1014


Sample of names on non-English Apple app list:  ['爱奇艺PPS -《欢乐颂2》电视剧热播', '聚力视频HD-人民的名义,跨界歌王全网热播', '优酷视频', '网易新闻 - 精选好内容，算出你的兴趣', '淘宝 - 随时随地，想淘就淘']

The function separated 1,014 non-English Apple apps from our list, leaving 6,183 English apps.

3d. Isolating Free Apps

# Free Android Apps
free_android_apps = []
for app in english_apps_android:
    price = app[7]
    if price in '0':
        free_android_apps.append(app)

print('Free Android app list length: ', len(free_android_apps))
print('\n')
print('Sample of free Android apps: ', free_android_apps[:3])
print('\n')

# Free Apple Apps
free_apple_apps = []
for app in english_apps_apple:
    price = app[4]
    if price in '0.0':
        free_apple_apps.append(app)

print('Free Apple app list length: ', len(free_apple_apps))
print('\n')
print('Sample of free Apple apps: ', free_apple_apps[:3])

Free Android app list length:  8864


Sample of free Android apps:  [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


Free Apple app list length:  3222


Sample of free Apple apps:  [['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]

There are 8,864 free Android apps and 3,222 free Apple apps in the datasets.

4. Data Analysis

The overall goal of this project is to determine what type of free apps will likely attract more users, which should increase ad revenue.

The company’s app validation strategy involves first building a minimal version of the app for the Android market and making it available on Google Play. If the app performs well in that market, the company continues to develop it. After six months, if the app is profitable, the company builds an iOS version and makes it available on the Apple App Store. Therefore, the end goal for any successful app is to add it to both the Android and Apple markets.

4a. Most Common App Genres

As a first step in my data analysis, I’ll determine the most common app genres in the Android and Apple markets.

As a reminder, here are the columns in each dataset:

Android: ‘App’, ‘Category’, ‘Rating’, ‘Reviews’, ‘Size’, ‘Installs’, ‘Type’, ‘Price’, ‘Content Rating’, ‘Genres’, ‘Last Updated’, ‘Current Ver’, ‘Android Ver’

Apple: ‘id’, ‘track_name’, ‘size_bytes’, ‘currency’, ‘price’, ‘rating_count_tot’, ‘rating_count_ver’, ‘user_rating’, ‘user_rating_ver’, ‘ver’, ‘cont_rating’, ‘prime_genre’, ‘sup_devices.num’, ‘ipadSc_urls.num’, ‘lang.num’, ‘vpp_lic’

The primary columns I’ll need for this analysis are the prime_genre column from the Apple dataset and the Genres and Category columns from the Google dataset.

I’ll make frequency tables showing the percentage of all apps that fit each genre/category using the freq_table() function. Then I’ll display it using the display_table() function.

def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        item = row[index]
        total += 1
        if item in table:
            table[item] += 1
        else:
            table[item] = 1
    
    percent_table = {}
    for key in table:
        percent_table[key] = (table[key] / total) * 100.0
            
    return percent_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

# Apple Prime_Genre Table
display_table(free_apple_apps, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665

# Android Category Table
display_table(free_android_apps, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0.6430505415162455
COMICS : 0.6204873646209386
BEAUTY : 0.5979241877256317

# Android Genres Table
display_table(free_android_apps, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
... 
+85 more rows

4b. Insights: Common App Genres

Now that I’ve created frequency tables for genres in the Apple and Android app stores, I’ll explore the data for insights.

Apple

The Apple table shows that games make up the most common genre in the free English app market by a significant margin with 58.2% of all apps in that category. Entertainment apps make up the next largest genre with 7.9%, followed by Photo & Video with 5.0%, Education with 3.7%, and Social Networking with 3.3%.

Apps used for fun and entertainment (in genres like games, entertainment, photo & video, sports, music, etc.) appear to significantly outnumber apps used for practical or educational purposes (in genres like education, utilities, productivity, finance, weather, etc.).

Android

The Android dataset shows a different distribution; its genres and categories are broken into more segments than the Apple dataset. The Android dataset also shows a more balanced representation of fun and practical apps.

The Android category table shows that 18.9% of free English apps fit into the family category, which includes children’s games, learning tools, and other child-friendly apps. The next largest categories are games with 9.7%; tools with 8.5%; and business with 4.6% of free English apps.

The Android genre table, which is broken down into many subcategories, shows that 8.5% of free English apps fit the tools genre; 6.1 % fit the entertainment genre; 5.3% fit the education genre; and 4.6% fit the business genre.

Takeaways

The analysis suggests that creating an entertainment app might best fit the Apple store, while either an entertainment or practical app would fit the Android store. However, it’s important to keep in mind that just because these genre categories contain the most apps, it does not mean these genres contain the most used apps. For this reason, app popularity based on the number of users will be the next subject of my analysis.

4c. Most Popular Apps by Genre

Apps only generate ad revenue if people use them, which is why app usage is an important consideration. For this analysis, I’ll look at the rating_count_tot column in the Apple dataset and the installs column in the Google dataset.

Apple

# Average number of user ratings per genre
# Apple

apple_genre_table = freq_table(free_apple_apps, 11)
avg_rating_dict = {}

for genre in apple_genre_table:
    total = 0
    num_apps_genre = 0
    
    # Summing reviews and counting apps
    for app in free_apple_apps:
        genre_app = app[11]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            num_apps_genre += 1
    avg_ratings = total / num_apps_genre
    avg_rating_dict[genre] = avg_ratings
    
# For easier readability, I'll sort the dictionary by values 
sorted_dict = sorted(avg_rating_dict.items(), key=lambda kv: 
                 kv[1], reverse = True)
print('Average number of reviews by genre:')
print('\n')
for row in sorted_dict:
    print(row[0], ": ", row[1])

Average number of reviews by genre:


Navigation :  86090.33333333333
Reference :  74942.11111111111
Social Networking :  71548.34905660378
Music :  57326.530303030304
Weather :  52279.892857142855
Book :  39758.5
Food & Drink :  33333.92307692308
Finance :  31467.944444444445
Photo & Video :  28441.54375
Travel :  28243.8
Shopping :  26919.690476190477
Health & Fitness :  23298.015384615384
Sports :  23008.898550724636
Games :  22788.6696905016
News :  21248.023255813954
Productivity :  21028.410714285714
Utilities :  18684.456790123455
Lifestyle :  16485.764705882353
Entertainment :  14029.830708661417
Business :  7491.117647058823
Education :  7003.983050847458
Catalogs :  4004.0
Medical :  612.0

Navigation is the genre with the most average reviews. However, I want to take a closer look before making any assumptions. I’ll look at the number of reviews for each app in the navigation genre:

for app in free_apple_apps:
    if app[11] == 'Navigation':
        print(app[1], ": ", app[5])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5

On closer inspection, the navigation genre only contains 6 apps, and the average number of reviews for the genre is skewed heavily by two widely used apps, Waze and Google Maps. There is a sharp dropoff in review count for apps 3 to 6. For these reasons, I would not recommend making a navigation app, despite the genre having the highest number of average reviews.

I’ll explore the review counts for other genres. I’ll limit my look to the top ten for each genre so I can see which have sharp dropoffs in review numbers and which have a more even distribution.

count = 0
for app in free_apple_apps:
    if app[11] == 'Reference' and count < 10:
        print(app[1], ": ", app[5])
        count += 1

Bible :  985920
Dictionary.com Dictionary & Thesaurus :  200047
Dictionary.com Dictionary & Thesaurus for iPad :  54175
Google Translate :  26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran :  18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition :  17588
Merriam-Webster Dictionary :  16849
Night Sky :  12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) :  8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools :  4693

Like navigation, the reference genre has a few dominant apps with most of the reviews. However, by the time we reach the 10th most popular app, there are only several thousand reviews. This suggests reference might be a difficult genre to compete in; only the top 5 or so apps in the space manage to attract a significant number of users.

count = 0
for app in free_apple_apps:
    if app[11] == 'Social Networking' and count < 10:
        print(app[1], ": ", app[5])
        count += 1

Facebook :  2974676
Pinterest :  1061624
Skype for iPhone :  373519
Messenger :  351466
Tumblr :  334293
WhatsApp Messenger :  287589
Kik :  260965
ooVoo – Free Video Call, Text and Voice :  177501
TextNow - Unlimited Text + Calls :  164963
Viber Messenger – Text & Call :  164249

In contrast to the navigation and reference genres, the social networking genre shows a higher number of reviews across a larger number of apps. Even the tenth most popular app in the genre has 150k+ reviews.

Expanding my search shows that apps in the top 20 still have ~50k reviews. This suggests that an app in the social networking genre may not need to reach Facebook or Pinterest levels of success to attract a meaningful number of users.

count = 0
for app in free_apple_apps:
    if app[11] == 'Social Networking' and count < 10:
        count += 1
    elif app[11] == 'Social Networking' and count < 20:
        print(app[1], ": ", app[5])
        count += 1

Followers - Social Analytics For Instagram :  112778
MeetMe - Chat and Meet New People :  97072
We Heart It - Fashion, wallpapers, quotes, tattoos :  90414
InsTrack for Instagram - Analytics Plus More :  85535
Tango - Free Video Call, Voice and Chat :  75412
LinkedIn :  71856
Match™ - #1 Dating App. :  60659
Skype for iPad :  60163
POF - Best Dating App for Conversations :  52642
Timehop :  49510

Photo & Video is another genre that stands out as having a larger number of popular apps. Looking past the biggest names in the space, there are still many apps that have tens of thousands of reviews.

count = 0
for app in free_apple_apps:
    if app[11] == 'Photo & Video' and count < 10:
        count += 1
    elif app[11] == 'Photo & Video' and count < 20:
        print(app[1], ": ", app[5])
        count += 1

Mixgram - Picture Collage Maker - Pic Photo Editor :  54282
Shutterfly: Prints, Photo Books, Cards Made Easy :  51427
Pic Jointer – Photo Collage, Camera Effects Editor :  51330
Color Pop Effects - Photo Editor & Picture Editing :  45320
Photo Grid - photo collage maker & photo editor :  40531
iSwap Faces LITE :  39722
MOLDIV - Photo Editor, Collage & Beauty Camera :  39501
Photo Editor by Aviary :  39501
Photo Lab: Picture Editor, effects & fun face app :  34585
Rookie Cam - Photo Editor & Filter Camera :  33921

Android

Next, I’ll perform the same analysis steps on the Android dataset. I’ll focus on the number of installs per category, which seems to be most analogous to Apple’s genres.

# Average number of user ratings per genre
# Android

android_genre_table = freq_table(free_android_apps, 1)
avg_installs_dict = {}

for genre in android_genre_table:
    total = 0
    num_apps_genre = 0
    
    # Summing reviews and counting apps
    for app in free_android_apps:
        genre_app = app[1]
        if genre_app == genre:
            
            # Removing commas and plus sign, converting string to float
            installs_v1 = app[5]
            installs_v2 = installs_v1.replace(",", "")
            installs_v3 = installs_v2.replace("+", "")
            installs = float(installs_v3)
            
            total += installs
            num_apps_genre += 1
    avg_installs = total / num_apps_genre
    avg_installs_dict[genre] = avg_installs
    
# For easier readability, I'll sort the dictionary by values 
sorted_dict = sorted(avg_installs_dict.items(), key=lambda kv: 
                 kv[1], reverse = True)
print('Average number of installs by genre:')
print('\n')
for row in sorted_dict:
    print(row[0], ": ", row[1])

Average number of installs by genre:


COMMUNICATION :  38456119.167247385
VIDEO_PLAYERS :  24727872.452830188
SOCIAL :  23253652.127118643
PHOTOGRAPHY :  17840110.40229885
PRODUCTIVITY :  16787331.344927534
GAME :  15588015.603248259
TRAVEL_AND_LOCAL :  13984077.710144928
ENTERTAINMENT :  11640705.88235294
TOOLS :  10801391.298666667
NEWS_AND_MAGAZINES :  9549178.467741935
BOOKS_AND_REFERENCE :  8767811.894736841
SHOPPING :  7036877.311557789
PERSONALIZATION :  5201482.6122448975
WEATHER :  5074486.197183099
HEALTH_AND_FITNESS :  4188821.9853479853
MAPS_AND_NAVIGATION :  4056941.7741935486
FAMILY :  3695641.8198090694
SPORTS :  3638640.1428571427
ART_AND_DESIGN :  1986335.0877192982
FOOD_AND_DRINK :  1924897.7363636363
EDUCATION :  1833495.145631068
BUSINESS :  1712290.1474201474
LIFESTYLE :  1437816.2687861272
FINANCE :  1387692.475609756
HOUSE_AND_HOME :  1331540.5616438356
DATING :  854028.8303030303
COMICS :  817657.2727272727
AUTO_AND_VEHICLES :  647317.8170731707
LIBRARIES_AND_DEMO :  638503.734939759
PARENTING :  542603.6206896552
BEAUTY :  513151.88679245283
EVENTS :  253542.22222222222
MEDICAL :  120550.61980830671

For the Android dataset, apps in the communication category are the most used with an average install count of 38,456,119. Video player apps are the second most popular at 24,727,872 installs, followed by social apps with 23,253,652 installs.

Instead of showing precise install numbers, the Android dataset rounds installs to the nearest milestone (10,000,000; 5,000,000; 1,000,000; etc.).

As with the Apple dataset, I’ll explore these categories in more depth.

install_dict = {}
for app in free_android_apps:
    installs_v1 = app[5]
    installs_v2 = installs_v1.replace(",", "")
    installs_v3 = installs_v2.replace("+", "")
    installs = float(installs_v3)
    name = app[0]
    if app[1] == 'COMMUNICATION':
        install_dict[name] = installs
                                      

print('\n')
# For easier readability, I'll sort the dictionary by values 
sorted_dict = sorted(install_dict.items(), key=lambda kv: 
                 kv[1], reverse = True)
print('Installs per app:')
print('\n')
count = 0
for row in sorted_dict:
    if count < 20:
        print(row[0], ": ", row[1])
        count += 1

Installs per app:


WhatsApp Messenger :  1000000000.0
Messenger – Text and Video Chat for Free :  1000000000.0
Skype - free IM & video calls :  1000000000.0
Google Chrome: Fast & Secure :  1000000000.0
Gmail :  1000000000.0
Hangouts :  1000000000.0
Google Duo - High Quality Video Calls :  500000000.0
imo free video calls and chat :  500000000.0
LINE: Free Calls & Messages :  500000000.0
UC Browser - Fast Download Private & Secure :  500000000.0
Viber Messenger :  500000000.0
imo beta free calls and text :  100000000.0
Android Messages :  100000000.0
Who :  100000000.0
GO SMS Pro - Messenger, Free Themes, Emoji :  100000000.0
Firefox Browser fast & private :  100000000.0
Messenger Lite: Free Calls & Messages :  100000000.0
Kik :  100000000.0
KakaoTalk: Free Calls & Text :  100000000.0
Opera Mini - fast web browser :  100000000.0

There appears to be more competition from well-known companies in the Android market, which might make gaining traction in the communication genre difficult.

I explored the other categories, and like in the Apple market, the Photography category stood out as having a wide variety of successful apps. Instead of being dominated by a few major players, this category seemed to allow room for competition; many mid-tier apps managed to attract large user bases.

install_dict = {}
for app in free_android_apps:
    installs_v1 = app[5]
    installs_v2 = installs_v1.replace(",", "")
    installs_v3 = installs_v2.replace("+", "")
    installs = float(installs_v3)
    name = app[0]
    if app[1] == 'PHOTOGRAPHY':
        install_dict[name] = installs
                                      

print('\n')
# For easier readability, I'll sort the dictionary by values 
sorted_dict = sorted(install_dict.items(), key=lambda kv: 
                 kv[1], reverse = True)
print('Installs per app:')
print('\n')
count = 0
for row in sorted_dict:
    if count < 20:
        print(row[0], ": ", row[1])
        count += 1

Installs per app:


Google Photos :  1000000000.0
B612 - Beauty & Filter Camera :  100000000.0
YouCam Makeup - Magic Selfie Makeovers :  100000000.0
Sweet Selfie - selfie camera, beauty cam, photo edit :  100000000.0
Retrica :  100000000.0
Photo Editor Pro :  100000000.0
BeautyPlus - Easy Photo Editor & Selfie Camera :  100000000.0
PicsArt Photo Studio: Collage Maker & Pic Editor :  100000000.0
Photo Collage Editor :  100000000.0
Z Camera - Photo Editor, Beauty Selfie, Collage :  100000000.0
PhotoGrid: Video & Pic Collage Maker, Photo Editor :  100000000.0
Candy Camera - selfie, beauty camera, photo editor :  100000000.0
YouCam Perfect - Selfie Photo Editor :  100000000.0
Camera360: Selfie Photo Editor with Funny Sticker :  100000000.0
S Photo Editor - Collage Maker , Photo Collage :  100000000.0
AR effect :  100000000.0
Cymera Camera- Photo Editor, Filter,Collage,Layout :  100000000.0
LINE Camera - Photo editor :  100000000.0
Photo Editor Collage Maker Pro :  100000000.0
Motorola Camera :  50000000.0

4d. Insights: Popular Apps by Genre

Despite having a high number of average reviews, several genres (such as navigation and reference) were heavily skewed by the big companies in those spaces. I chose to instead search for genres where a relatively large number of apps could achieve review counts in the tens of thousands or more; or apps in the Google Store that could achieve millions of installs.

Apple

In the Apple dataset, two genres stood out as having a large number of popular apps and room for competition: Social Networking and Photo & Video.

Two of the top 20 social networking apps focused on Instagram analytics; perhaps analytics for another popular social platform could be a viable focus for a new app.

Alternatively, in the Photo & Video genre, developing a photo effect, emoji, augmented reality, collage, or video filter are all possible directions to take a new app.

Android

In the Android dataset, the Photography category stood out as allowing for the most competition. Many mid-tier apps attracted large userbases instead of the category being dominated by only a few major companies.

I believe this category could be worth focusing on because it overlaps with the Photo & Video genre in the Apple market; an app developed for this genre could potentially perform well in both app stores.

Header photo by Yura Fresh on Unsplash.

Experienced Writer with Data Specialization

Experienced Writer with Data Specialization

Python Data Project: Analyzing the App Stores

Identifying Profitable App Profiles for the Apple and Android Markets

Table of Contents

1. Project Goals

2. Exploring the Data Sources

3. Data Cleaning

3a. Removing Incorrect Data

3b. Removing Duplicate Entries

3c. Removing Non-English Apps

3d. Isolating Free Apps

4. Data Analysis

4a. Most Common App Genres

4b. Insights: Common App Genres

Apple

Android

Takeaways

4c. Most Popular Apps by Genre

Apple

Android

4d. Insights: Popular Apps by Genre

Apple

Android