API

Copydetect can also be run via the python API. An example of basic usage is provided below.

>>> from copydetect import CopyDetector
>>> detector = CopyDetector(test_dirs=["tests"], extensions=["py"],
...                         display_t=0.5)
>>> detector.add_file("copydetect/utils.py")
>>> detector.run()
  0.00: Generating file fingerprints
   100%|████████████████████████████████████████████████████| 8/8
  0.31: Beginning code comparison
   100%|██████████████████████████████████████████████████| 8/8
  0.31: Code comparison completed
>>> detector.generate_html_report()
Output saved to report/report.html

For advanced use cases, the API contains a CodeFingerprint class for performing general file comparisons. An example of basic usage is provided below:

>>> import copydetect
>>> fp1 = copydetect.CodeFingerprint("sample1.py", 25, 1)
>>> fp2 = copydetect.CodeFingerprint("sample2.py", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(
...     fp1, fp2)
>>> token_overlap
53
>>> similarities[0]
0.828125
>>> similarities[1]
0.8412698412698413
>>> code1, _ = copydetect.utils.highlight_overlap(
...     fp1.raw_code, slices[0], ">>", "<<")
>>> code2, _ = copydetect.utils.highlight_overlap(
...     fp2.raw_code, slices[1], ">>", "<<")
>>> print(code1)
def hashed_kgrams(string, k):
    """Return hashes of all k-grams in string"""
    >>hashes = [hash(string[offset:offset+k])
              for offset in range(len(string) - k + 1)]
    return np.array(hashes)<<

>>> print(code2)
def hash_f(s, k):
    >>h = [hash(s[o:o+k]) for o in range(len(s)-k+1)]
    return np.array(h)<<

Detector

This module contains functions for detecting overlap between a set of test files (files to check for plagairism) and a set of reference files (files that might have been plagairised from).

class copydetect.detector.CodeFingerprint(file, k, win_size, boilerplate=None, filter=True, language=None, fp=None, encoding: str = 'utf-8')

Class for tokenizing, filtering, fingerprinting, and winnowing a file. Maintains information about fingerprint indexes and token indexes to assist code highlighting for the output report.

Parameters:
  • file (str) – Path to the file fingerprints should be extracted from.
  • k (int) – Length of k-grams to extract as fingerprints.
  • win_size (int) – Window size to use for winnowing (must be >= 1).
  • boilerplate (array_like, optional) – List of fingerprints to use as boilerplate. Any fingerprints present in this list will be discarded from the hash list.
  • filter (bool, default=True) – If set to to False, code will not be tokenized & filtered.
  • fp (TextIO, default=None) – I/O stream for data to create a fingerprint for. If provided, the “file” argument will not be used to load a file from disk but will still be used for language detection and displayed on the report.
  • encoding (str, default="utf-8") – Text encoding to use for reading the file. If “DETECT”, the chardet library will be used (if installed) to automatically detect file encoding
filename

Name of the originally provided file.

Type:str
raw_code

Unfiltered code.

Type:str
filtered_code

Code after tokenization and filtering. If filter=False, this is the same as raw_code.

Type:str
offsets

The cumulative number of characters removed during filtering at each index of the filtered code. Used for translating locations in the filtered code to locations in the unfiltered code.

Type:Nx2 array of ints
hashes

Set of fingerprint hashes extracted from the filtered code.

Type:Set[int]
hash_idx

Mapping of each fingerprint hash back to all indexes in the original code in which this fingerprint appeared.

Type:Dict[int, List[int]]
k

Value of provided k argument.

Type:int
language

If set, will force the tokenizer to use the provided language rather than guessing from the file extension.

Type:str
token_coverage

The number of tokens in the tokenized code which are considered for fingerprint comparison, after performing winnowing and removing boilerplate.

Type:int
class copydetect.detector.CopyDetector(test_dirs=None, ref_dirs=None, boilerplate_dirs=None, extensions=None, noise_t=25, guarantee_t=25, display_t=0.33, same_name_only=False, ignore_leaf=False, autoopen=True, disable_filtering=False, force_language=None, truncate=False, out_file='./report.html', silent=False, encoding: str = 'utf-8')

Main plagairism detection class. Searches provided directories and uses detection parameters to calculate similarity between all files found in the directories

Parameters:
  • test_dirs (list) – (test_directories) A list of directories to recursively search for files to check for plagiarism.
  • ref_dirs (list) – (reference_directories) A list of directories to search for files to compare the test files to. This should generally be a superset of test_directories
  • boilerplate_dirs (list) – (boilerplate_directories) A list of directories containing boilerplate code. Matches between fingerprints present in the boilerplate code will not be considered plagiarism.
  • extensions (list) – A list of file extensions containing code the detector should look at.
  • noise_t (int) – (noise_threshold) The smallest sequence of matching characters between two files which should be considered plagiarism. Note that tokenization and filtering replaces variable names with V, function names with F, object names with O, and strings with S so the threshold should be lower than you would expect from the original code.
  • guarantee_t (int) – (guarantee_threshold) The smallest sequence of matching characters between two files for which the system is guaranteed to detect a match. This must be greater than or equal to the noise threshold. If computation time is not an issue, you can set guarantee_threshold = noise_threshold.
  • display_t (float) – (display_threshold) The similarity percentage cutoff for displaying similar files on the detector report.
  • same_name_only (bool) – If true, the detector will only compare files that have the same name
  • ignore_leaf (bool) – If true, the detector will not compare files located in the same leaf directory.
  • autoopen (bool) – If true, the detector will automatically open a webbrowser to display the results of generate_html_report
  • disable_filtering (bool) – If true, the detector will not tokenize and filter code before generating file fingerprints.
  • force_language (str) – If set, forces the tokenizer to use a particular programming language regardless of the file extension.
  • truncate (bool) – If true, highlighted code will be truncated to remove non- highlighted regions from the displayed output
  • out_file (str) – Path to output report file.
  • silent (bool) – If true, all logging output will be supressed.
  • encoding (str, default="utf-8") – Text encoding to use for reading the file. If “DETECT”, the chardet library will be used (if installed) to automatically detect file encoding
add_file(filename, type='testref')

Adds a file to the list of test files, reference files, or boilerplate files.

Parameters:
  • filename (str) – Name of file to add.
  • type ({"testref", "test", "ref", "boilerplate"}) – Type of file to add. “testref” will add the file as both a test and reference file.
classmethod from_config(config)

Initializes a CopyDetection object using the provided configuration dictionary.

Parameters:config (dict) – Configuration dictionary using CLI parameter names.
Returns:Detection object initialized with config
Return type:CopyDetector
generate_html_report(output_mode='save')

Generates an html report listing all files with similarity above the display_threshold, with the copied code segments highlighted.

Parameters:output_mode ({"save", "return"}) – If “save”, the output will be saved to the file specified by self.out_file. If “return”, the output HTML will be directly returned by this function.
get_copied_code_list()

Get a list of copied code to display on the output report. Returns a list of tuples containing the similarity score, the test file name, the compare file name, the highlighted test code, and the highlighted compare code.

Returns:list of similarity data between each file pair which achieves a similarity score above the display threshold, ordered by percentage of copying in the test file. Each element of the list contains [test similarity, reference similarity, path to test file, path to reference file, highlighted test code, highlighted reference code, numer of overlapping tokens]
Return type:list
run()

Runs the copy detection loop for detecting overlap between test and reference files. If no files are in the provided directories, the similarity matrix will remain empty and any attempts to generate a report will fail.

copydetect.detector.compare_files(file1_data, file2_data)

Computes the overlap between two CodeFingerprint objects using the generic methods from copy_detect.py. Returns the number of overlapping tokens and two tuples containing the overlap percentage and copied slices for each unfiltered file.

Parameters:
Returns:

  • token_overlap (int) – Number of overlapping tokens between the two files.
  • similarities (tuple of 2 ints) – For both files: number of overlapping tokens divided by the total number of tokens in that file.
  • slices (tuple of 2 2xN int arrays) – For both files: locations of copied code in the unfiltered text. Dimension 0 contains slice starts, dimension 1 contains slice ends.

Utils

This module contains functions for tokenizing/filtering code as well as generic functions for detecting overlap between two documents.

copydetect.utils.filter_code(code, filename, language=None)

Tokenize and filter a code document. Replace variable names with V, function names with F, object names with O, and strings with S. Return the filtered document and a list of offsets indicating how many characters were removed by filtering at each index in the resulting document where filtering occured (this is used later to highlight the original code using plagiarism detection results on the filtered code)

copydetect.utils.find_fingerprint_overlap(hashes1, hashes2, idx1, idx2)

Finds the indexes of overlapping values between two lists of hashes. Returns two lists of indexes, one for the first hash list and one for the second. The indexes of the original hashes are provided in case boilerplate results in gaps.

copydetect.utils.get_copied_slices(idx, k)

Given k and a list of indexes detected by find_fingerprint_overlap, generates a list of slices where the copied code begins and ends. Returns a 2D array where the first dimension is slice start locations and the second dimension is slice end locations.

copydetect.utils.get_document_fingerprints(doc, k, window_size, boilerplate=None)

Given a document, computes all k-gram hashes and uses the winnowing algorithm to reduce their number. Optionally takes a list of boilerplate hashes to remove from the winnowed list. Returns the selected hashes and their indexes in the original list

copydetect.utils.get_token_coverage(idx: Dict[int, List[int]], k: int, token_len: int)

Determines the number of tokens in the original document which are included in the winnowed indices

copydetect.utils.hashed_kgrams(string, k)

Return hashes of all k-grams in a string

copydetect.utils.highlight_overlap(doc, slices, left_hl, right_hl, truncate=-1, escape_html=False)

Highlights copied code in a document given the slices containing copied code and strings to use for the highlight start and end. Returns the document annoted with the highlight strings as well as the percentage of code which was highlighted. If truncate is set to an integer, everything not within that many lines of highlighted code will be replaced with “…”

copydetect.utils.winnow(hashes, window_size, remove_duplicates=True)

implementation of the robust winnowing algorithm decribed in https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf Returns a list of selected hashes and the indexes of those hashes.