nltk tweettokenizer
(nltk)Natural Language Toolkit
Difficulty Level : Easy
TweetTokenizer and word_tokenize are tokenizers almost work the same way, to split a given sentence into words. But you can think of TweetTokenizer as a subset of word_tokenize. TweetTokenizer keeps hashtags intact while word_tokenize doesn’t.
Example
1 2 3 4 5 6 7 8 9 10 |
from nltk.tokenize import TweetTokenizer from nltk.tokenize import word_tokenize tt = TweetTokenizer() tweet = "This is a coooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaayyyy too much for you!!!!!!" print(tt.tokenize(tweet)) print(word_tokenize(tweet)) # output # ['This', 'is', 'a', 'coooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@remy', ':', 'This', 'is', 'waaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!'] # ['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'remy', ':', 'This', 'is', 'waaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!'] |
Alternatives
- RegexpTokenizer
- SExprTokenizer
- stanford_segmenter
Download and Install NLTK
1 2 |
sudo pip install -U numpy sudo pip install -U nltk |
Alternative On Linux
Run the command
1 |
python -m nltk.downloader all |
To ensure central installation, run the command
1 |
sudo python -m nltk.downloader -d /usr/local/share/nltk_data all |