Using the Amazon musical instrument review data on Kaggle, retrieve the summary column, perform tokenization, stemming, and lemmatization. Construct results of text processing as a jupyter notebook.

You can download the PDF here and the jupyter notebook here

Data Loading & Pre-Processing

import pandas as pd

#Get the dataframe
path="Downloads/Musical_instruments_reviews.csv"
df = pd.read_csv(path)
df.head()
reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime
0 A2IBPI20UZIR0U 1384719342 cassandra tu "Yeah, well, that's just like, u... [0, 0] Not much to write about here, but it does exac... 5.0 good 1393545600 02 28, 2014
1 A14VAT5EAX3D9S 1384719342 Jake [13, 14] The product does exactly as it should and is q... 5.0 Jake 1363392000 03 16, 2013
2 A195EZSQDW3E21 1384719342 Rick Bennette "Rick Bennette" [1, 1] The primary job of this device is to block the... 5.0 It Does The Job Well 1377648000 08 28, 2013
3 A2C00NNG1ZQQG2 1384719342 RustyBill "Sunday Rocker" [0, 0] Nice windscreen protects my MXL mic and preven... 5.0 GOOD WINDSCREEN FOR THE MONEY 1392336000 02 14, 2014
4 A94QU4C90B1AX 1384719342 SEAN MASLANKA [0, 0] This pop filter is great. It looks and perform... 5.0 No more pops when I record my vocals. 1392940800 02 21, 2014
#Get summary column
summary=df['summary']
summary
0                                                     good
1                                                     Jake
2                                     It Does The Job Well
3                            GOOD WINDSCREEN FOR THE MONEY
4                    No more pops when I record my vocals.
                               ...                        
10256                                           Five Stars
10257    Long life, and for some players, a good econom...
10258                                     Good for coated.
10259                                          Taylor Made
10260    These strings are really quite good, but I wou...
Name: summary, Length: 10261, dtype: object

Tokenizer

import nltk
from nltk.tokenize import word_tokenize
import spacy
C:\Users\annet\anaconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
# Tokenize Summary column
nltk.download('punkt')
df['text_token'] = df.apply(lambda row: word_tokenize(row['summary']), axis=1)
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\annet\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
df['text_token'] 
0                                                   [good]
1                                                   [Jake]
2                               [It, Does, The, Job, Well]
3                      [GOOD, WINDSCREEN, FOR, THE, MONEY]
4         [No, more, pops, when, I, record, my, vocals, .]
                               ...                        
10256                                        [Five, Stars]
10257    [Long, life, ,, and, for, some, players, ,, a,...
10258                               [Good, for, coated, .]
10259                                       [Taylor, Made]
10260    [These, strings, are, really, quite, good, ,, ...
Name: text_token, Length: 10261, dtype: object

Stemmer

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
df['stemmed_summary'] = summary.apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df['stemmed_summary']
0                                                     good
1                                                     jake
2                                      it doe the job well
3                            good windscreen for the money
4                     no more pop when i record my vocals.
                               ...                        
10256                                            five star
10257    long life, and for some players, a good econom...
10258                                     good for coated.
10259                                          taylor made
10260    these string are realli quit good, but i would...
Name: stemmed_summary, Length: 10261, dtype: object
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
df['stemmed_summary2'] = df['text_token'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])
df['stemmed_summary2'] 
0                                                   [good]
1                                                   [jake]
2                                [it, doe, the, job, well]
3                      [good, windscreen, for, the, money]
4           [no, more, pop, when, i, record, my, vocal, .]
                               ...                        
10256                                         [five, star]
10257    [long, life, ,, and, for, some, player, ,, a, ...
10258                                 [good, for, coat, .]
10259                                       [taylor, made]
10260    [these, string, are, realli, quit, good, ,, bu...
Name: stemmed_summary2, Length: 10261, dtype: object

Lemmatization

from nltk.stem import WordNetLemmatizer
nlp = spacy.load('en_core_web_sm')
df['lemmatized_summary'] = df['summary'].apply(lambda x: [token.lemma_ for token in nlp(str(x))])
df['lemmatized_summary']
0                                                   [good]
1                                                   [Jake]
2                                 [it, do, the, Job, well]
3                      [good, WINDSCREEN, for, the, money]
4           [no, more, pop, when, I, record, my, vocal, .]
                               ...                        
10256                                         [five, star]
10257    [long, life, ,, and, for, some, player, ,, a, ...
10258                                 [good, for, coat, .]
10259                                       [Taylor, make]
10260    [these, string, be, really, quite, good, ,, bu...
Name: lemmatized_summary, Length: 10261, dtype: object
df
reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime text_token stemmed_summary stemmed_summary2 lemmatized_summary
0 A2IBPI20UZIR0U 1384719342 cassandra tu "Yeah, well, that's just like, u... [0, 0] Not much to write about here, but it does exac... 5.0 good 1393545600 02 28, 2014 [good] good [good] [good]
1 A14VAT5EAX3D9S 1384719342 Jake [13, 14] The product does exactly as it should and is q... 5.0 Jake 1363392000 03 16, 2013 [Jake] jake [jake] [Jake]
2 A195EZSQDW3E21 1384719342 Rick Bennette "Rick Bennette" [1, 1] The primary job of this device is to block the... 5.0 It Does The Job Well 1377648000 08 28, 2013 [It, Does, The, Job, Well] it doe the job well [it, doe, the, job, well] [it, do, the, Job, well]
3 A2C00NNG1ZQQG2 1384719342 RustyBill "Sunday Rocker" [0, 0] Nice windscreen protects my MXL mic and preven... 5.0 GOOD WINDSCREEN FOR THE MONEY 1392336000 02 14, 2014 [GOOD, WINDSCREEN, FOR, THE, MONEY] good windscreen for the money [good, windscreen, for, the, money] [good, WINDSCREEN, for, the, money]
4 A94QU4C90B1AX 1384719342 SEAN MASLANKA [0, 0] This pop filter is great. It looks and perform... 5.0 No more pops when I record my vocals. 1392940800 02 21, 2014 [No, more, pops, when, I, record, my, vocals, .] no more pop when i record my vocals. [no, more, pop, when, i, record, my, vocal, .] [no, more, pop, when, I, record, my, vocal, .]
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10256 A14B2YH83ZXMPP B00JBIVXGC Lonnie M. Adams [0, 0] Great, just as expected. Thank to all. 5.0 Five Stars 1405814400 07 20, 2014 [Five, Stars] five star [five, star] [five, star]
10257 A1RPTVW5VEOSI B00JBIVXGC Michael J. Edelman [0, 0] I've been thinking about trying the Nanoweb st... 5.0 Long life, and for some players, a good econom... 1404259200 07 2, 2014 [Long, life, ,, and, for, some, players, ,, a,... long life, and for some players, a good econom... [long, life, ,, and, for, some, player, ,, a, ... [long, life, ,, and, for, some, player, ,, a, ...
10258 AWCJ12KBO5VII B00JBIVXGC Michael L. Knapp [0, 0] I have tried coated strings in the past ( incl... 4.0 Good for coated. 1405987200 07 22, 2014 [Good, for, coated, .] good for coated. [good, for, coat, .] [good, for, coat, .]
10259 A2Z7S8B5U4PAKJ B00JBIVXGC Rick Langdon "Scriptor" [0, 0] Well, MADE by Elixir and DEVELOPED with Taylor... 4.0 Taylor Made 1404172800 07 1, 2014 [Taylor, Made] taylor made [taylor, made] [Taylor, make]
10260 A2WA8TDCTGUADI B00JBIVXGC TheTerrorBeyond [0, 0] These strings are really quite good, but I wou... 4.0 These strings are really quite good, but I wou... 1405468800 07 16, 2014 [These, strings, are, really, quite, good, ,, ... these string are realli quit good, but i would... [these, string, are, realli, quit, good, ,, bu... [these, string, be, really, quite, good, ,, bu...

10261 rows × 13 columns