Text Mining: A Case Study

Outline

  • Text Classification Examples
  • Machine Learning/Classification Pipeline
  • Plain Vanilla approach: TF-IDF weighting + Support Vector Machine (SVM)
  • Demo: banner classification

Text Classification Examples

Text Classification Examples

Categorization based on item description

Machine Learning/Classification Pipeline

Banner Classification Pipeline

Term Frequency

The number of times that a given word(s) occur(s) on the receipts of the given banner (document).

Term Frequency Matrix (10000 receipts)

word Walmart(2240) non-Walmart (7760)
live \[1934\] \[204\]
money \[1871\] \[88\]
walmart \[1632\] \[29\]
manager \[1529\] \[24\]
  • binary case 'live live' is only 1.

Inverse Document Frequency

Measure of how much information the word provides, that is, whether the term is common or rare across all receipts (documents).

Number of receipts containing the given words (Document Frequency)

word Walmart(2240) non-Walmart (7760)
live \[\log\frac{10000}{1934}\] \[\log\frac{10000}{204}\]
money \[\log\frac{10000}{1871}\] \[\log\frac{10000}{88}\]
walmart \[\log \frac{10000}{1632}\] \[\log \frac{10000}{29}\]
manager \[\log \frac{10000}{1529}\] \[\log \frac{10000}{24}\]

TF-IDF + SVM

  • multiply TF with IDF = TF-IDF matrix

  • use the TF-IDF matrix to get the features

  • pump the features into SVM

Tools

In [1]:
import pandas as pd

fn = 'c:/work/fun/ds-meetup/data.csv'
data = pd.read_csv(fn)
data[3:10]
Out[1]:
banner_key text
3 whole_foods WFZLE FOODS Invest in a future without poverty...
4 dillons_marketplace dlito 11.4#;1111,770dIons.;_um Great food. Low...
5 sams_club LLUR MANAGER J CUNNINGHAM (907) 522 - 2333 ANC...
6 cvs CVS13114airmacy 10623 618(3(0110N, RIVERVIEW, ...
7 rite_aid 1120 331 1ith us, ifs personal. Stcre #00443 3...
8 walmart Wallmart Save money. Live better. Self Checkou...
9 walmart Walmart Save money. Livn better. 205 1 7si 972...
In [2]:
stat = data[['banner_key']]
stat['ratio'] = 0
stat = (stat.groupby('banner_key').aggregate(len) / float(stat.shape[0])).sort(['ratio'], ascending=False)
stat[:10]
Out[2]:
ratio
banner_key
walmart 0.2240
target 0.0913
walgreens 0.0460
publix 0.0405
kroger 0.0399
cvs 0.0338
costco 0.0273
dollar_tree 0.0249
safeway 0.0208
meijer 0.0189
In [3]:
#let us focus on only the biggest banner: walmart ==> binary classifier

#map ~walmart ==> other
data['banner_key'][~data['banner_key'].isin(['walmart'])] = 0
data['banner_key'][data['banner_key'].isin(['walmart'])] = 1
data[3:10]
Out[3]:
banner_key text
3 0 WFZLE FOODS Invest in a future without poverty...
4 0 dlito 11.4#;1111,770dIons.;_um Great food. Low...
5 0 LLUR MANAGER J CUNNINGHAM (907) 522 - 2333 ANC...
6 0 CVS13114airmacy 10623 618(3(0110N, RIVERVIEW, ...
7 0 1120 331 1ith us, ifs personal. Stcre #00443 3...
8 1 Wallmart Save money. Live better. Self Checkou...
9 1 Walmart Save money. Livn better. 205 1 7si 972...
In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
In [29]:
""" Get train data; X=input & y=target """
#only 200 samples
X_train = data['text'][:200]
y_train = data['banner_key'][:200].astype(int)
In [39]:
""" Pipeline: raw text ==> TFIDF ==> Linear SVM ==> banner """
pl = Pipeline([
   ('vectorizer', TfidfVectorizer(sublinear_tf=True,analyzer='word')),
   ('classifier', LinearSVC(C=1))
   ])
In [40]:
""" Setup the paramaters """
parameters = {'vectorizer__use_idf':[True,False], 
              'vectorizer__ngram_range':[(1,3)],  
              'vectorizer__binary':[True,False],  
              'classifier__dual':[True],          
              'classifier__C':[1,10]}             
In [41]:
""" GridSearch w/ cross-validation """
n_cores = 1
grid_search = GridSearchCV(pl, parameters, cv = 5, scoring = 'f1', 
                           n_jobs = n_cores, verbose=1, refit=True, 
                           iid=False) 
grid_search.fit(X_train, y_train)  #Search the best parameter setting
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.7s
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:   31.6s finished

Fitting 5 folds for each of 8 candidates, totalling 40 fits

Out[41]:
GridSearchCV(cv=5,
       estimator=Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),...ling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0))]),
       fit_params={}, iid=False, loss_func=None, n_jobs=1,
       param_grid={'vectorizer__use_idf': [True, False], 'vectorizer__ngram_range': [(1, 3)], 'classifier__C': [1, 10], 'vectorizer__binary': [True, False], 'classifier__dual': [True]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='f1',
       verbose=1)
In [42]:
print 'f1 score : %.2f%%' % (grid_search.best_score_*100)
print("Best parameter set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

clf_best = grid_search.best_estimator_
f1 score : 97.89%
Best parameter set:
	classifier__C: 1
	classifier__dual: True
	vectorizer__binary: False
	vectorizer__ngram_range: (1, 3)
	vectorizer__use_idf: False