Command Line Usage

Basic Usage

The simplest usage is copydetect -t DIRS, where DIRS is a space-separated list of directories to search for input files. This will recursively search for all files in the provided directories and compare every file with every other file. To look only at specific file extensions, use -e followed by another space-separated list (for example, copydetect -t student_code -e cc cpp h)

If the files you want to compare to are different from the files you want to check for plagiarism (for example, if you want to also compare to submissions from previous semesters), use -r to provide a list of reference directories. For example, copydetect -t PA01_F20 -r PA01_F20 PA01_S20 PA01_F19. To avoid matches with code that was provided to students, use -b to specify a list of directories containing boilerplate code.

There are several options for tuning the sensitivity of the detector. The noise threshold, set with -n, is the minimum number of matching characters between two documents that is considered plagiarism. Note that this is AFTER tokenization and filtering, where variable names have been replaced with V, function names with F, etc. If you change -n (default value: 25), you will also have to change the guarantee threshold, -g (default value: 30). This is the number of matching characters for which the detector is guaranteed to detect the match. If speed isn’t an issue, you can set this equal to the noise threshold. Finally, the display threshold, -d (default value: 0.33), is used to determine what percentage of code similarity is considered interesting enough to display on the output report. The distribution of similarity scores is plotted on the output report to assist selection of this value.

There are several other command line options for different use cases. If you only want to check for “lazy” plagiarism (direct copying without changing variable names or reordering code), -f can be used to disable code filtering. If you don’t want to compare files in the same leaf directory (for example, if code is split into per-student directories and you don’t care about self plagiarism), use -l. For a complete list of configuration options, see the following section.

Configuration Options

Configuration options can be provided either by using the command line arguments or by using a JSON file. If a JSON file is used, specify it on the command line using -c (e.g., copydetect -c configuration.json). A sample configuration file is available here. The following list provides the names of each JSON configuration key along with its associated command line arguments.

  • test_directories (-t, --test-dirs): a list of directories to recursively search for files to check for plagiarism.
  • reference_directories (-r, --ref-dirs): a list of directories to search for files to compare the test files to. This should generally be a superset of test_directories. If not provided, the test directories are used as reference directories.
  • boilerplate_directories (-b, --boilerplate-dirs): a list of directories containing boilerplate code. Matches between fingerprints present in the boilerplate code will not be considered plagiarism.
  • extensions (-e, --extensions): a list of file extensions containing code the detector should look at.
  • noise_threshold (-n, --noise-thresh): the smallest sequence of matching characters between two files which should be considered plagiarism. Note that tokenization and filtering replaces variable names with V, function names with F, object names with O, and strings with S so the threshold should be lower than you would expect from the original code.
  • guarantee_threshold (-g, --guarantee-thresh): the smallest sequence of matching characters between two files for which the system is guaranteed to detect a match. This must be greater than or equal to the noise threshold. If computation time is not an issue, you can set guarantee_threshold = noise_threshold.
  • display_threshold (-d, --display-thresh): the similarity percentage cutoff for displaying similar files on the detector report.
  • force_language (-o, --force-language): forces the tokenizer to tokenize input as a specific language, rather than automatically detecting the language using the file extension.
  • same_name_only (-s, --same-name): if true, the detector will only compare files that have the same name (for example, decision_tree.py will not be compared to k_nn.py). Note that this also means that, for example, bryson_k_nn.py will not be compared to sara_k_nn.py.
  • ignore_leaf (-l, --ignore-leaf): if true, the detector will not compare files located in the same leaf directory.
  • disable_filtering (-f, --disable-filter): if true, the detector will not tokenize and filter code before generating file fingerprints.
  • disable_autoopen (-a, --disable-autoopen): if true, the detector will not automatically open a browser window to display the report.
  • truncate (-T, --truncate): if true, highlighted code will be truncated to remove non-highlighted regions from the displayed output (sections not within 10 lines of highlighted code will be replaced with “…”).
  • out_file (-O, --out-file): path to save output report to. A ‘.html’ extension will be added to the path if not provided. If a directory is provided instead of a file, the report will be saved to that directory as report.html.
  • encoding (--encoding): encoding to use for reading files (the default is UTF-8). If files use varying encodings, –encoding DETECT can be used to detect the encoding of all files (note: encoding detection requires the chardet package).