NEAR DEDUPLICATION
Best for large-scale document reviews and/or messy data imports.
- Near dedupe differs from our standard “dedupe” as it goes beyond hash values (electronic thumbprints) in identifying duplicates, and/or similar documents
- Process can be done on native or imaged/produced data
- Steps of analysis:
- Extract the text of every document
- Text files are then analyzed through a “scoring” process
- This creates one “Master” doc and scores all other nearly duplicative docs against it
- Each score equates to a percentage of similarity
- I.e. Document B is 97% similar to the Master-Document A
- The client has full reign as to how stringent they want to be on what’s considered a Duplicate
- I.e. Anything above 85% similar, could be considered a duplicate in some cases, whereas another client may want to be more strict and set the threshold to 90% and above instead
- Once this scoring is complete, the results are viewable in Nextpoint
- The Masters are isolated into one folder, and the Duplicates into another
- The “Similarity Score” is brought into a coding field for visibility and sorting purposes
- The Related Document window, will further show you a “cluster” of nearly duplicate documents
- The Master will show up on the top level, with all other duplicates showing up as “related” documents underneath