Going from UnidiffFormater to InlineDiffBuilder

Apr 18, 2011 at 7:28 PM

First, thanks for diffplex!

Now my issue: I'm using the latest version of diffplex and trying to use InlineDiffBuilder to create a string showing me the changes between two paragraphs of text. I used to use UnidiffFormatter in an older project and after changing the code to use CreateWordDiffs, I would get my string back with contiguous word deletions (or insertions) as a single delete (or insert). InlineDiffBuilder is marking each individual word as a seperate delete (or insert). Is this a difference between how the two DiffBuilders work? Is the best solution to go back to UnidiffFormater? Any direction you could provide me would be great! thanks.

Coordinator
Apr 22, 2011 at 3:02 PM
Edited Apr 22, 2011 at 3:03 PM

Hi,

I am a bit confused what behavior you are looking for. The InlineDiffBuilder is currently designed to do only line diffs and it groups deletes and inserts together.

So with the inline diff builder if you have:

OldText:

line one
line two
same line
line three

and NewText:

line uno
line dos
same line
line tres


Then it creates this diff:

- line one
- line two
+ line uno
+ line dos
  same line
- line three
+ line tres

 
It does a line by line diff and groups inserts and deletes together.

What behavior are you expecting or want?
 

Thanks,
-Matt

 

Apr 22, 2011 at 4:41 PM

Hi Matt,

I have two paragraphs of text and I need the output to be one paragraph that combines the additions/deletions. It would look like this:

This is paragraph 1 and it
has some text in it that
looks like this and that

This is paragraph 2 and it
has more stuff than before
that looks like this

OUTPUT:

This is paragraph 12 and it
has some text in it than before
that looks like this and that

I might have made a mistake but hopefully you get the point :)

From other discussions, I got the impression that I should move away from UnidiffFormatter to InlineDiffBuilder, but I didn't see an easy way to make it work the way I described. I went back to UnidiffFormatter (after adding the line of code that seemed to be missing at the end).

The only issue now I'm seeing has to do with the locality of the changes. For example, when text 1 is short and text 2 is much longer, sometimes it will find a word in text 1 near the end of text 2 and show those words as matching. Technically, yes, it matches, but from the reader's perspective, it doesn't really make much sense. I might try chunking the text into smaller sections so this doesn't happen.

Hope this was all semi-clear and thanks for any directions/help you can provide.

Coordinator
Apr 22, 2011 at 4:45 PM

So you edited the InlineDiffBuilder to do word diffs? Since the checked in code does not do that.

Is the output you showed what you want? or what you are getting?

Can you show me both what you want and what you get?

Apr 22, 2011 at 4:49 PM

I tried playing with InlineDiffBuilder but went back to Unidiff. I tried InlineDiffBuilder because you had stated that it replaced Unidiff and was better written. I was asking if there was a way to get InlineDiffBuilder to do what I wanted.

The output i showed is both what I want AND what I'm getting now using Unidiff (with some code changes).

Jun 13, 2011 at 8:54 PM
Edited Jun 24, 2011 at 7:54 AM

I have made a variant of UnidiffFormater that can collect sequences of inserts and deletes to make the result readable like:

[run  on CreateWordDiffs result]

We the People of the united states of america in order to form a more perfect union establish justice ensure domestic tranquility provide for promote the
common defence general welfare and secure the blessing of liberty to ourselves and our posterity do ordain and establish this constitution for the United States of America


[run  on CreateCharacterDiffs result]

We the People in ofrder theo unitedform states mofre amperfect unicaon establish justice ensure domestic tranquility promovidte for the
commogeneral dwelfare anced secure the blessing of liberty to ourselves and our posterity do ordain and establish this constitution for the United States of America

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Text;
using DiffPlex;
using DiffPlex.Model;

public static class UnidiffSeqFormater
{
    private const string NoChangeSymbol = "=";
    private const string InsertSymbol = "+";
    private const string DeleteSymbol = "-";

    public static List<string> Generate (DiffResult diffresult)
    {
        var uniLines = new List<string>();
        int bPos = 0;

        foreach (DiffBlock diffBlock in diffresult.DiffBlocks)
        {
            for (; bPos < diffBlock.InsertStartB; bPos++)
            {
                uniLines.Add(NoChangeSymbol + diffresult.PiecesNew[bPos]);
            }

            int i = 0;
            for (; i < Math.Min(diffBlock.DeleteCountA, diffBlock.InsertCountB); i++)
            {
                uniLines.Add(DeleteSymbol + diffresult.PiecesOld[i + diffBlock.DeleteStartA]);
                uniLines.Add(InsertSymbol + diffresult.PiecesNew[i + diffBlock.InsertStartB]);
                bPos++;
            }

            if (diffBlock.DeleteCountA > diffBlock.InsertCountB)
            {
                for (; i < diffBlock.DeleteCountA; i++)
                {
                    uniLines.Add(DeleteSymbol + diffresult.PiecesOld[i + diffBlock.DeleteStartA]);
                }
            }
            else
            {
                for (; i < diffBlock.InsertCountB; i++)
                {
                    uniLines.Add(InsertSymbol + diffresult.PiecesNew[i + diffBlock.InsertStartB]);
                    bPos++;
                }
            }
        }
       //***** added bugfix (thanks Wawel for pointing this out)
       for (; bPos < diffresult.PiecesNew.Length; bPos++)
       {
           // bugfix: uniLines.Add(diffresult.PiecesNew[bPos]);
	   uniLines.Add(NoChangeSymbol + diffresult.PiecesNew[bPos]);
       }
       //**** end bugfix 




        return uniLines;
    }

    public static string GenerateSeq (DiffResult diffresult)
    {
        return GenerateSeq(diffresult,
                           "[+[",
                           "]+]",
                           "[-[",
                           "]-]");
    }

    public static string GenerateSeq (DiffResult diffresult, string insertSymbolS, string insertSymbolE, string deleteSymbolS, string deleteSymbolE)
    {
        List<string> result = Generate(diffresult);
        var outputSb = new StringBuilder();
        while (result.Count > 0)
        {
            if (result[0].StartsWith(DeleteSymbol))
            {
                int dix = 0;
                outputSb.Append(deleteSymbolS);
                while (dix < result.Count &&  result.Count > 0 && !result[dix].StartsWith(NoChangeSymbol))
                {
                    if (result[dix].StartsWith(DeleteSymbol))
                    {
                        outputSb.Append(result[dix].Substring(1));
                        result.RemoveAt(dix);
                    }
                    else
                    {
                        ++dix;
                    }
                }
                outputSb.Append(deleteSymbolE);
            }
            else if (result[0].StartsWith(InsertSymbol))
            {
                int dix = 0;
                outputSb.Append(insertSymbolS);
                while (dix < result.Count &&  result.Count > 0 && !result[dix].StartsWith(NoChangeSymbol))
                {
                    if (result[dix].StartsWith(InsertSymbol))
                    {
                        outputSb.Append(result[dix].Substring(1));
                        result.RemoveAt(dix);
                    }
                    else
                    {
                        ++dix;
                    }
                }
                outputSb.Append(insertSymbolE);
            }
            else
            {
                outputSb.Append(result[0].Substring(1));
                result.RemoveAt(0);
            }
        }
        return outputSb.ToString();
    }
}

internal static class Program
{
    private const string OldText = @"We the people of the united states of america establish justice ensure domestic tranquility 
provide for the common defence secure the blessing of liberty to ourselves and our posterity";
    private const string NewText = @"We the People in order to form a more perfect union establish justice ensure domestic tranquility 
promote the general welfare and secure the blessing of liberty to ourselves and our posterity do ordain and establish this constitution 
for the United States of America";

    private static void Main (string[] args)
    {
        var d = new Differ();
        DiffResult res = d.CreateWordDiffs(OldText, NewText, true, true, new[] {' ', '\r', '\n'});
        string x = UnidiffSeqFormater.GenerateSeq(res, "<span style='color:green'>", "</span>",
                                                  "<span style='color:red'>", "</span>");
        Debug.WriteLine("<br>----------------------------------------------<br>");
        Debug.WriteLine("");
        Debug.WriteLine(x);

        res = d.CreateCharacterDiffs(OldText, NewText, true, true);
        x = UnidiffSeqFormater.GenerateSeq(res, "<span style='color:green'>", "</span>",
                                           "<span style='color:red'>", "</span>");
        Debug.WriteLine("<br>----------------------------------------------<br>");
        Debug.WriteLine("");
        Debug.WriteLine(x);
    }
}


Edited: Inserted bugfixes 2011-06-24.
Jun 14, 2011 at 3:02 PM
Edited Jun 14, 2011 at 3:04 PM

There was an error in the code above, the inner while's need to test for end of the list like:

                while (dix < result.Count && result.Count > 0 && !result[dix].StartsWith(NoChangeSymbol))
Jun 22, 2011 at 11:21 PM

Great code jorgenlindell, thank you! I found one other piece that needed to be changed. In your last "else" case in GenerateSeq:

                else
                {
                    //outputSb.Append(result[0].Substring(1));
                    // Check the first character - the NoChangeSymbol only appears here in Character comparisons.
                    // It does not in Word comparisons, so I was losing the first character of each matching word.
                    outputSb.Append(result[0].Substring(0,1) == NoChangeSymbol ? result[0].Substring(1) : result[0]);
                    result.RemoveAt(0);
                }


Thanks again, you saved me a ton of time and headache!

Jun 23, 2011 at 12:45 AM
Edited Jun 23, 2011 at 12:52 AM


Hmm, thats strange...?
The sequences in red and green i showed above the code is actually generated by it. If what you say is needed, shouldn't they have been broken? 

Ahh, wait a minute..

Isn't it  a bug in the bugfix in Generate rather?
Where it says:

       //***** added bugfix (thanks Wawel for pointing this out)
        for (; bPos < diffresult.PiecesNew.Length; bPos++)
        {
            uniLines.Add(diffresult.PiecesNew[bPos]);
        }

        //**** end bugfix 
   

 I think it should be: 

            uniLines.Add(NoChangeSymbol + diffresult.PiecesNew[bPos]);

That is the only place anything could be generated without a symbol. 
Jun 24, 2011 at 1:28 AM

Ah, you're absolutely right, it was the bug fix that was incorrect.

 

Thanks!

Jun 24, 2011 at 7:54 AM

Modified the code to include that fix.