CIB Mango Tree
A Civic Tech DC Project
Copypasta This test scans for identical strings of text that appear across multiple posts in your dataset. The test reveals possible networks of bots or human users seeking to copy and paste identical messages, AKA "copypasta." By copying and pasting identical text, human and bot networks can influence discourse towards a particular narrative. See Code for Test on GitHub Weber, Derek, and Frank Neumann. “Amplifying influence through coordinated behaviour in social networks.” Social Network Analysis and Mining 11, no. 1 (October 2021).
Username
Unique Post Number
Post Content
Timestamp

  1. N-gram tokenizing: The test will break up the content of each post in your dataset into sequences of words, or “n-grams.” Each word separated by a space is an individual token in the n-gram, and the number of token indicates the length of an n-gram.
  2. Scanning for repeated n-grams: The test will examine if any n-grams between length three (e.g. “he eats snails”) and five (e.g. “The last time I voted”) are repeated in another post. The test will reveal any n-grams that appear more than once across different posts.
  3. Counting repeated n-grams: The test will tally the number of times an n-gram is repeated and will sort n-grams by frequency of repetitions (highest to lowest) in the data output. This allows you to see if a particular phrase was copied and pasted many times across posts.

Output Format: .csv File
  • The test will produce a csv file where repeated n-grams are presented in rows, with each row corresponding to an individual post where a repeated n-gram was found.
  • The rows are arranged by n-grams of longest lengths first. The csv will present all repeated n-grams of length 5 first (if found), followed by repeated n-grams of length 4 (if found), then 3 (if found).
  • The content of the n-gram is displayed, and rows are further sorted by the number of times the particular n-gram was repeated by all users in the dataset, starting from the n-grams repeated the most to the n-grams repeated the least (a minimum of two occurrences).
  • The unique username of the user that posted the particular n-gram is presented, and the rows are further sorted by the unique usernames which posted a given n-gram the most times, to the least.
  • The content of the entire post in which the n-gram appeared is presented, as well as the timestamp for when the post was made.
Sample CSV output