Loading

Task 1: Next Product Recommendation

Task 1 - Getting Started

Make your first submission on Task 1

dipam

Amazon KDD Cup 2023 - Task 1 - Next Product Recommendation

This notebook will contains instructions and example submission with random predictions.

Installations 🤖

  1. aicrowd-cli for downloading challenge data and making submissions
  2. pyarrow for saving to parquet for submissions
In [ ]:
!pip install aicrowd-cli pyarrow
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.15-py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 1.7 MB/s eta 0:00:00
Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (4.65.0)
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl (54 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.5/54.5 KB 4.3 MB/s eta 0:00:00
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (0.10.2)
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp39-cp39-manylinux2010_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 19.4 MB/s eta 0:00:00
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.1/170.1 KB 9.1 MB/s eta 0:00:00
Collecting click<8,>=7.1.2
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.8/82.8 KB 5.3 MB/s eta 0:00:00
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 214.4/214.4 KB 14.1 MB/s eta 0:00:00
Collecting python-slugify<6,>=5.0.0
  Downloading python_slugify-5.0.2-py2.py3-none-any.whl (6.7 kB)
Requirement already satisfied: requests<3,>=2.25.1 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (2.25.1)
Collecting semver<3,>=2.13.0
  Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.10-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 KB 4.6 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.16.6 in /usr/local/lib/python3.9/dist-packages (from pyarrow) (1.22.4)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify<6,>=5.0.0->aicrowd-cli) (1.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2022.12.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.15)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 4.1 MB/s eta 0:00:00
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.9/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Installing collected packages: commonmark, smmap, semver, pyzmq, python-slugify, colorama, click, rich, requests-toolbelt, gitdb, GitPython, aicrowd-cli
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 23.2.1
    Uninstalling pyzmq-23.2.1:
      Successfully uninstalled pyzmq-23.2.1
  Attempting uninstall: python-slugify
    Found existing installation: python-slugify 8.0.1
    Uninstalling python-slugify-8.0.1:
      Successfully uninstalled python-slugify-8.0.1
  Attempting uninstall: click
    Found existing installation: click 8.1.3
    Uninstalling click-8.1.3:
      Successfully uninstalled click-8.1.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 2.2.3 requires click>=8.0, but you have click 7.1.2 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.15 click-7.1.2 colorama-0.4.6 commonmark-0.9.1 gitdb-4.0.10 python-slugify-5.0.2 pyzmq-22.1.0 requests-toolbelt-0.10.1 rich-10.16.2 semver-2.13.0 smmap-5.0.0

Login to AIcrowd and download the data 📚

In [ ]:
!aicrowd login
In [ ]:
!aicrowd dataset download --challenge task-1-next-product-recommendation
sessions_test_task1.csv: 100% 19.4M/19.4M [00:01<00:00, 14.2MB/s]
sessions_test_task2.csv: 100% 1.92M/1.92M [00:00<00:00, 4.04MB/s]
sessions_test_task3.csv: 100% 3.15M/3.15M [00:00<00:00, 5.91MB/s]
products_train.csv: 100% 589M/589M [01:10<00:00, 8.32MB/s]
sessions_train.csv: 100% 259M/259M [00:38<00:00, 6.74MB/s]

Setup data and task information

In [ ]:
import os
import numpy as np
import pandas as pd
from functools import lru_cache
In [ ]:
train_data_dir = '.'
test_data_dir = '.'
task = 'task1'
PREDS_PER_SESSION = 100
In [ ]:
# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

Data Description

The Multilingual Shopping Session Dataset is a collection of anonymized customer sessions containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: user sessions and product attributes. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.


Each product as its associated information:

locale: the locale code of the product (e.g., DE)

id: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

title: title of the item (e.g., “Japanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt”)

price: price of the item in local currency (e.g., 24.99)

brand: item brand name (e.g., “Japanese Aesthetic Flowers & Vaporwave Clothing”)

color: color of the item (e.g., “Black”)

size: size of the item (e.g., “xxl”)

model: model of the item (e.g., “iphone 13”)

material: material of the item (e.g., “cotton”)

author: author of the item (e.g., “J. K. Rowling”)

desc: description about a item’s key features and benefits called out via bullet points (e.g., “Solid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers …”)

EDA 💽

In [ ]:
def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")
In [ ]:
products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)
Locale: DE 
Number of products: 518327 
Number of train sessions: 1111416 
Train session lengths - Mean: 57.89 | Median 40.00 | Min: 27.00 | Max 2060.00 
Number of test sessions: 104568
Test session lengths - Mean: 57.23 | Median 40.00 | Min: 27.00 | Max 700.00 

======================================================================== 

Locale: JP 
Number of products: 395009 
Number of train sessions: 979119 
Train session lengths - Mean: 59.61 | Median 40.00 | Min: 27.00 | Max 6257.00 
Number of test sessions: 96467
Test session lengths - Mean: 59.90 | Median 40.00 | Min: 27.00 | Max 1479.00 

======================================================================== 

Locale: UK 
Number of products: 500180 
Number of train sessions: 1182181 
Train session lengths - Mean: 54.85 | Median 40.00 | Min: 27.00 | Max 2654.00 
Number of test sessions: 115936
Test session lengths - Mean: 53.51 | Median 40.00 | Min: 27.00 | Max 872.00 

======================================================================== 

Locale: ES 
Number of products: 42503 
Number of train sessions: 89047 
Train session lengths - Mean: 48.82 | Median 40.00 | Min: 27.00 | Max 792.00 
Number of test sessions: 0
======================================================================== 

Locale: FR 
Number of products: 44577 
Number of train sessions: 117561 
Train session lengths - Mean: 47.25 | Median 40.00 | Min: 27.00 | Max 687.00 
Number of test sessions: 0
======================================================================== 

Locale: IT 
Number of products: 50461 
Number of train sessions: 126925 
Train session lengths - Mean: 48.80 | Median 40.00 | Min: 27.00 | Max 621.00 
Number of test sessions: 0
======================================================================== 

In [ ]:
products.sample(5)
Out[ ]:
id locale title price brand color size model material author desc
1535876 B08B3QTXJZ IT kwmobile Custodia Compatibile con Apple iPhone... 8.49 KW-Commerce blu chiaro matt NaN 49982.58_m000813 Silicone NaN ANTI URTO: i bordi rialzati della copertina pr...
1198060 B0B2KN1Q6M UK Me To You Bear Sister Just For You Birthday Card 2.99 Carte Blanche NaN NaN NaN NaN NaN NaN
1024050 B099W7JSMT UK Syhood 32.8 Feet Christmas Metallic Tinsel Twi... 8.99 Syhood Blue NaN NaN Metal NaN Christmas style decor: the Christmas metallic ...
895070 B0BCG44MBT JP ラップタオル大人用 速乾大きいサイズ 風呂用サウナ 着るバスシャワー超吸水水泳 温泉湯浴み着... 1589.00 OTTCFRN ピンク ワンサイズ NaN ポリエステル NaN 3 Dトリミング設計、(非ベルクロデザイン)よりユーザーフレンドリーで、使用時に音が出ず、肌...
1084330 B007E9VUQS UK Smiffys Make-Up FX Face and Body Paint, 16 ml ... 2.94 Smiffy's Brown (dark) One Size 39184 NaN NaN Add colour to your dress-up costume!
In [ ]:
train_sessions = read_train_data()
train_sessions.sample(5)
Out[ ]:
prev_items next_item locale
2994683 ['B07T3GN2VH' 'B07T2FDFKZ' 'B07T3DJMT5' 'B098T... B07ZJZNRMP UK
190907 ['B07ZRN33PQ' 'B07ZRMCRG7' 'B09C24TXP4' 'B09C2... B091G94JDR DE
3595388 ['B09BJNQRNZ' 'B08XK8M5Z3' 'B09DPD5QJ8' 'B09KN... B0B4RYT3ZS IT
465436 ['B07GXQCFXK' 'B07GXQD5Y3' 'B00E6722OK'] B00D3HZYGW DE
3518477 ['B09ZY6WYJX' 'B0B85CSNXW' 'B09ZY6WYJX' 'B0B85... B083DRSWKR IT
In [ ]:
test_sessions = read_test_data(task)
test_sessions.sample(5)
Out[ ]:
prev_items locale
202046 ['B08H95Y452' 'B0BG3GRMF9' 'B0BG3GRMF9' 'B0BF5... UK
98284 ['B09RQ8T72D' 'B09998MBFM' 'B09RQ8T72D'] DE
191260 ['B0871Z739B' 'B09N92NHGR' 'B0871Z739B'] JP
113547 ['B0B56Q2VXW' 'B0B56NPJ4G' 'B0B56Q2VXW'] JP
102804 ['B08G97TPH8' 'B08G91WFQR' 'B08G93D8LZ' 'B082P... DE

Generate Submission 🏋️‍♀️

Submission format:

  1. The submission should be a parquet file with the sessions from all the locales.
  2. Predicted products ids per locale should only be a valid product id of that locale.
  3. Predictions should be added in new column named "next_item_prediction".
  4. Predictions should be a list of string id values
In [ ]:
def random_predicitons(locale, sess_test_locale):
    random_state = np.random.RandomState(42)
    products = read_product_data().query(f'locale == "{locale}"')
    predictions = []
    for _ in range(len(sess_test_locale)):
        predictions.append(
            list(products['id'].sample(PREDS_PER_SESSION, replace=True, random_state=random_state))
        ) 
    sess_test_locale['next_item_prediction'] = predictions
    sess_test_locale.drop('prev_items', inplace=True, axis=1)
    return sess_test_locale
In [ ]:
test_sessions = read_test_data(task)
predictions = []
test_locale_names = test_sessions['locale'].unique()
for locale in test_locale_names:
    sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
    predictions.append(
        random_predicitons(locale, sess_test_locale)
    )
predictions = pd.concat(predictions).reset_index(drop=True)
predictions.sample(5)
Out[ ]:
locale next_item_prediction
197622 JP [B0B3JKGTBH, B07WFD1L1R, B0B1N2FMMG, B0BLJSMWJ...
108611 JP [B07WV5GXPB, B0B7DS3HQL, B0866HDFTS, B009GQYDX...
284074 UK [B09BB5SPR3, B0816CXMSZ, B08JV76967, B08MW68KC...
34652 DE [B007H6POYW, B08M5GZGFT, B08JQZMFL7, B0BKP9BSL...
268639 UK [B06XCGCKG7, B0B9NXKN54, B091YX63K7, B00Z65X1G...

Validate predictions ✅

In [ ]:
def check_predictions(predictions, check_products=False):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"

        if check_products:
            # This check is not done on the evaluator
            # but you can run it to verify there is no mixing of products between locales
            # Since the ground truth next item will always belong to the same locale
            # Warning - This can be slow to run
            products = read_product_data().query(f'locale == "{locale}"')
            predicted_products = np.unique( np.array(list(preds_locale["next_item_prediction"].values)) )
            assert np.all( np.isin(predicted_products, products['id']) ), f"Invalid products in {locale} predictions"
In [ ]:
check_predictions(predictions)
In [ ]:
# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'submission_{task}.parquet', engine='pyarrow')

Submit to AIcrowd 🚀

In [ ]:
# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-1-next-product-recommendation -f "submission_task1.parquet"

Comments

Hooked
Over 1 year ago

Thanks!

curiosityoftiane
Over 1 year ago

Thanks!

lu_zhihong
Over 1 year ago

Thanks

sanghyeon_lee
Over 1 year ago

Thanks for sharing this notebook. It’s really helpful. In the EDA section, however, “sess_train[‘prev_items’].apply(lambda sess: len(sess))” returns the length of the string, not the number of products in the session. You need to use eval() first to make the string into an actual list.

asad_baloch
About 2 months ago

zxczx

You must login before you can post a comment.

Execute