Parse problem

Discussion in 'Rebol' started by MaxV, May 2, 2012.

  1. MaxV

    MaxV Member

    Hello,
    I need to extract pieces of text from an HTML file. All sentences begin with 4 digits (like 0001) and end with <br> tag.
    Example:

    How can I extract the sentences?
  2. swhite

    swhite Member

    Are you hoping to use the "parse" function, or is any solution acceptable as long as it works?

    Is it possible that a sentence could contain a four-digit number that is not the start of a sentence?

    Will there be other html tags in a sentence, or is the only html tag present the <br> tag at the end of each sentence?

    If every sentence does indeed end with the <br> tag, then it seems like the initial four-digit number is irrelevant; the <br>tag can be used to divide up the text.

    If the area of text with the numbered lines is part of a larger block of text, you still could break it up on the <br> tags, and then use a brute-force check on each of the first four characters of each line to see if all four are numeric. Only lines with four initial digits would be saved.

    In the larger picture, a question like this is an interesting training and learning exercise. I can see how to do this in another language, but I am not sure what features of REBOL could be put together to solve this problem with REBOL.
  3. MaxV

    MaxV Member

    Well, I suppose that parse is useful, because I started with:
    Code:
    digits: charset "0123456789"
     
    parse text [ thru 4 digits  copy temp to <br>]
    but it doesn't work. Solution must be something like that...
  4. MaxV

    MaxV Member

    Well, let' say that probably it is nearer something like:
    Code:
    extracted: copy []
    digits: charset "0123456789"
     
    parse text [any [ thru 4 digits  copy temp to <br>(append extracted temp)]]
  5. swhite

    swhite Member

    If the text looks like this:
    0001This is is the start of article number one, etc.<br>
    0002This is the start of article number two<br>
    ...
    then why worry about the numbers at all? The <br> tag alone separates all the lines. Why not just:

    parse text [copy temp to <br> (append extracted temp)]

    "temp" will be a line of text with four digits at the front, which is what you are looking for, but if you do not want those four digits, or want to do something with them, do it in a separate procedure. Instead of "(append extracted temp)" write a separate procedure to deal with each line, like "(REMOVE-LINE-NUMBER temp)". Then, in the REMOVE-LINE-NUMBER procedure, use brute force to remove the first four characters, and append what is left to "extracted." If the first four characters are NOT numbers, then this is NOT one of the lines you are looking for, and you could ignore it and NOT append it to "extracted."

    I realize this is not very "REBOL-ish." I still have to plod along when I do a REBOL program. The "parse" function is especially obscure to me.
  6. endo

    endo New Member

    Try this:
    parse text [ some [4 digits copy temp to <br> (print temp) | skip]]

    MaxV: I didn't see your post :oops:
  7. MaxV

    MaxV Member

    WOW it works perfectly! It's the skip word that I never used this way!!!! Fantastic!

Share This Page