CIB Mango Tree
A Civic Tech DC Project
Copypasta
This test scans for identical strings of text that appear across
multiple posts in your dataset. The test reveals possible networks
of bots or human users seeking to copy and paste identical messages,
AKA "copypasta." By copying and pasting identical text, human and
bot networks can influence discourse towards a particular narrative.
See Code for Test on GitHub
Weber, Derek, and Frank Neumann. “Amplifying influence through
coordinated behaviour in social networks.” Social Network Analysis
and Mining 11, no. 1 (October 2021).
Username
Unique Post Number
Post Content
-
N-gram tokenizing:
The test will break up the content of each post
in your dataset into sequences of words, or “n-grams.”
Each word separated by a space is an individual token
in the n-gram, and the number of token indicates the
length of an n-gram.
-
Scanning for repeated n-grams:
The test will examine if any n-grams between length
three (e.g. “he eats snails”) and five (e.g. “The
last time I voted”) are repeated in another post.
The test will reveal any n-grams that appear more
than once across different posts.
-
Counting repeated n-grams:
The test will tally the number of times an n-gram is
repeated and will sort n-grams by frequency of repetitions
(highest to lowest) in the data output. This allows you
to see if a particular phrase was copied and pasted many
times across posts.
Output Format: .csv File
-
The test will produce a csv file where repeated n-grams
are presented in rows, with each row corresponding to
an individual post where a repeated n-gram was found.
-
The rows are arranged by n-grams of longest lengths
first. The csv will present all repeated n-grams of
length 5 first (if found), followed by repeated n-grams
of length 4 (if found), then 3 (if found).
-
The content of the n-gram is displayed, and rows are
further sorted by the number of times the particular
n-gram was repeated by all users in the dataset,
starting from the n-grams repeated the most to the
n-grams repeated the least (a minimum of two occurrences).
-
The unique username of the user that posted the particular
n-gram is presented, and the rows are further sorted by the
unique usernames which posted a given n-gram the most times,
to the least.
-
The content of the entire post in which the n-gram appeared
is presented, as well as the timestamp for when the post was made.