yassiracharki's picture
Update README.md
429f7f6 verified
|
raw
history blame
3.06 kB
metadata
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: text-classification
tags:
  - CNN
  - NLP
  - Yelp
  - Reviews
  - pre_trained
language:
  - en
datasets:
  - yassiracharki/Yelp_Reviews_for_Binary_Senti_Analysis
library_name: fasttext

Model Card for Model ID

Downloads

!pip install contractions !pip install textsearch !pip install tqdm

import nltk nltk.download('punkt')

Fundamental classes

import tensorflow as tf from tensorflow import keras import pandas as pd import numpy as np

Time

import time import datetime

Preprocessing

from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing import sequence from sklearn.preprocessing import LabelEncoder import contractions from bs4 import BeautifulSoup import re import tqdm import unicodedata

seed = 3541 np.random.seed(seed)

Define a dummy loss to bypass the error during model loading

def dummy_loss(y_true, y_pred): return tf.reduce_mean(y_pred - y_true)

Loading the model Trained on Yelp reviews

modelYelp = keras.models.load_model( '/kaggle/input/pre-trained-model-binary-cnn-nlp-yelpreviews/tensorflow1/pre-trained-model-binary-cnn-nlp-yelp-reviews/1/Binary_Classification_90_Yelp_Reviews_CNN.h5', compile=False )

Compile the model with the correct loss function and reduction

modelYelp.compile( optimizer='adam', loss=keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE), metrics=['accuracy'] )

Loading Yelp test data

dataset_test_Yelp = pd.read_csv('/kaggle/input/yelp-reviews-for-sentianalysis-binary-np-csv/yelp_review_sa_binary_csv/test.csv')

Loading Yelp train data (to be used on the label encoder)

dataset_train_Yelp = pd.read_csv('/kaggle/input/yelp-reviews-for-sentianalysis-binary-np-csv/yelp_review_sa_binary_csv/train.csv')

Shuffling the Test Data

test_Yelp = dataset_test_Yelp.sample(frac=1) train_Yelp = dataset_train_Yelp.sample(frac=1)

Taking a tiny portion of the database (because it will only be used on the label encoder)

train_Yelp = dataset_train_Yelp.iloc[:100, :]

Taking only necessary columns

y_test_Yelp = test_Yelp['class_index'].values X_train_Yelp = train_Yelp['review_text'].values y_train_Yelp = train_Yelp['class_index'].values

Preprocess corpus function

def pre_process_corpus(corpus): processed_corpus = [] for doc in tqdm.tqdm(corpus): doc = contractions.fix(doc) doc = BeautifulSoup(doc, "html.parser").get_text() doc = unicodedata.normalize('NFKD', doc).encode('ascii', 'ignore').decode('utf-8', 'ignore') doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A) doc = doc.lower() doc = doc.strip() processed_corpus.append(doc) return processed_corpus

Preprocessing the Data

X_test_Yelp = pre_process_corpus(test_Yelp['review_text'].values) X_train_Yelp = pre_process_corpus(X_train_Yelp)

Creating and Fitting the Tokenizer

etc ...

More info on the Model page on Kaggle :

https://www.kaggle.com/models/yacharki/pre-trained-model-binary-cnn-nlp-yelpreviews