Hello, if we have a dataset with text, numerical and categorical values to be used for text classification. What options we have to use these additional (numerical and categorical) columns for classification? Here are the options which I can think of
option 1: Combine categorical values with text using [SEP]
option 2 : concatenate numerical/categorical data to the CLS embedding and pass it to linear layer
Any help on this is or tutorials is greatly appreciated.
Thanks
7 Likes
great question and I have been looking for answers on the same for my academic project. I hope we will get some help here
4 Likes
@lewtun @sgugger can you please share some suggestions/insights into this topic
An interesting toy project of mine is finding paragraph breaks in lines of text broken from a well-formatted original based on string length, like text extracted from PDFs. An approach based on domain knowledge (human grammar and punctuation) would be that a paragraph break generally only occurs after a sentence break, meaning a â.â or â!â. A paragraph break is especially likely if the final punctuation is followed by significant whitespace before the line end. This the sort of pattern that can be captured nicely by a regular expression such as: â(.|!)/s*$â.
Matching this regexp against a string returns a true or false boolean value, aka 0,1. A combination of (a) that matching test, expressed as a string function with a boolean return value, with (b) a pre-trained BERT model, should be more accurate than either separately.
How best to accomplish this, given the rather rigid structure of BERT-type LLMs? A good solution can point the way to combining narrow human rule-based heuristics with the complexity of LLMs, which despite their rich context-based learning can have difficulty learning narrow rules like a regular expression match. Such hybrid models may excel at extracting rules and formulas from natural language text, for example, legal documents.