Talk:Ns urlencode

From AOLserver Wiki
Jump to navigation Jump to search

RHS 21Oct2004

The command ns_urlencode currently doesn't do what it's name implies. What it does is to encode everything as if it were a query value. That means that you can do:

set url "http://www.aolserver.com/a/page/name?"
set queryList {}
foreach {key value} {key1 val1 key2 val2 ... ... keyN valN} {
    lappend queryList [ns_urlencode $key]=[ns_urlencode $value]
}
append url [join $queryList &]
# http://www.aolserver.com/a/page/name?key1=val1&key2=val2&%2e%2e%2e=%2e%2e%2e&keyN=valN

You cannot, however, do the following and have it behave correctly:

set url [ns_urlencode {http://www.aolserver.com/a/page/name?key1=value1&key2=value2&....&keyN=valueN}]
# http%3a%2f%2fwww%2eaolserver%2ecom%2fa%2fpage%2fname%3fkey1%3dvalue1%26key2%3dvalue2%26%2e%2e%2e%2e%26keyN%3dvalueN

I'd like to change ns_urlencode (and ns_urldecode, which has its own issues) to be able to handle actual URLs (or, specifically, URIs). The discussions I've had with Dossy over the subject lead to 2 possible solutions to the issue:

  1. Add flags to ns_urlencode (and decode) for -url and -query, which would change the behavior to either "treat this as a full url, and escape it as needed" or "treat this as a query element, and escape it as needed". The default behavior would be that of -query (the same as it is now)
  2. Add a new command ns_url (with subcommands encode and decode) which would take the same flags, but the default would be to treat the input as if -url was specified. The original commands would then be depricated (but still available for a long time).

Questions:

  • How to encode things. I would assume the default encoding machanism would follow the rules of http (ie, use + for space in the query section, but %20 in the resource section)?
  • Is it worthwhile implementing any other schemes, or should they all just follow the http lead?
  • Should the scheme be added as a flag, determined from the url (which may or may not have the scheme) or both?
  • Are there any test cases out there for url encoding? I searched, but couldn't find any.

The code I have currently:

 # File: url.tcl

 namespace eval ::ns::url {

 }

 proc ns_urlencode {args} {
    if { [llength $args] == 0 } {
        error "wrong # args: should be \"ns_urlencode ... data\""
    }
    set charset utf-8
    set mode query

    while { [llength $args] > 1 } {
        switch -exact -- [lindex $args 0] {
            -charset {
                set charset [lindex $args 1]
                set args [lrange $args 2 end]
            }
            -query {
                set mode query
                set args [lrange $args 1 end]
            }
            -url {
                set mode url
                set args [lrange $args 1 end]
            }
            -html {
                set mode html
                set args [lrange $args 1 end]
            }
            default {
                error "invalid option \"[lindex $args 0]\": should be\
                       \"ns_urlencode ... data\""
            }
        }
    }
    
    set url [lindex $args end]
    set url [encoding convertfrom $charset $url]
    set url [encoding convertto identity $url]

    switch -exact -- $mode {
        query {
            set result [::ns::url::encodeQuery $url]
        }
        url {
            set result [::ns::url::encodeUrl $url]
        }
        html {
            set result [::ns::url::encodeUrl $url]
            if { [string equal $mode html] } {
                set result [[string map [list \
                        &  {&} \
                        <  {<}  \
                        >  {>}  \
                        \" {"} ]] $result]
            }
        }
        default {
            error "should not happen"
        }
    }

    return $result
 }

 proc ns_urldecode {args} {
    if { [llength $args] == 0 } {
        error "wrong # args: should be \"ns_urldecode ... data\""
    }

    set charset utf-8
    set mode query

    while { [llength $args] > 1 } {
        switch -exact -- [lindex $args 0] {
            -charset {
                set charset [lindex $args 1]
                set args [lrange $args 2 end]
            }
            default {
                error "invalid option \"[lindex $args 0]\": should be\
                       \"ns_urldecode ... data\""
            }
        }
    }
    

    set url [lindex $args end]
    
    set matchList [regexp -all -inline {%([0-9a-fA-F]{2})} $url]

    set mapList {+ { }}
    foreach {str code} $matchList {
        lappend mapList $str [format %c [scan $code %x]]
    }
    set url [string map $mapList $url]
    set url [encoding convertfrom $charset $url]

    return $url
 }

 proc ::ns::url::encodeQuery {url} {
    set output ""
    #set encoding iso8859-1
    set re {^[a-zA-Z0-9 ]$}

    foreach elem [split $url ""] {
        #if { ![string is alnum -strict $elem] } {}
        if { ![regexp $re $elem] } {
            scan $elem "%c" elem
            set elem %[format %x $elem]
        } elseif { [string equal " " $elem] } {
            set elem +
        }
        append output $elem
    }
    return $output
 }

 proc ::ns::url::encodeResource {url} {
    set output ""

    set alphanum {[a-zA-Z0-9]}
    set mark {[-_.!~*'()]}
    set other {[:@&=+$,]}
    set special "\[/#\]"
            set re "${alphanum}|${mark}|${other}|${special}"
    
    set fragCount 0
    set index 0
    while { [set index [string first "#" $url $index]] >= 0 } {
        incr fragCount
        incr index
    }

    foreach elem [split $url ""] {
        if { ![regexp $re $elem] } {
            scan $elem "%c" elem
            set elem %[format %x $elem]
        } elseif { [string equal " " $elem] } {
            set elem %20
        } elseif { [string equal "#" $elem] && ($fragCount > 1) } {
            set elem %23
            incr fragCount -1
        }
        append output $elem
    }
    return $output
 }

 proc ::ns::url::encodeQueryString {query} {
    set output ""

    set outList {}
    foreach elem [split $query &] {
        set index [string first = $elem]
        if { $index >= 0 } {
            set key [string range $elem 0 [expr {$index -1}]]
            set value [string range $elem [expr {$index +1}] end]
            lappend outList "[encodeQuery $key]]=[[encodeQuery $value]"
        } else {
            lappend outList [encodeQuery $elem]
        }
    }

    return [join $outList &]
 }

 proc ::ns::url::encodeFragment {url} {
    set output ""
    set reserved {[;/?:@&=+$,]}
    set alphanum {[0-9a-zA-Z]}
    set mark {[-_.!~*'()]}
    set validRE "^(${reserved}|${alphanum}|${mark})*\$"
    set charRE "^${alphanum}|${mark}\$"

    # Should we throw an error for an invalid fragment?
    # Are we even sure what an invalid fragment is?

    foreach elem [split $url ""] {
        if { ![regexp $charRE $elem] } {
            scan $elem "%c" elem
            set elem %[format %x $elem]
        }
        append output $elem
    }
    return $output
 }

 proc ::ns::url::encodeUrl {url} {
    set urlRegexp {^((([^:/?#]]+):)?(//([[^/?#]]*))?([[^?#]]*))(\?([[^#]*))?(#(.*))?}
            
    if { ![[regexp $urlRegexp $url -> \
            resource p1 p2 p3 p4 p5 query p7 fragment p9]] } {
        error "Invalid url"
    }
    set result [::ns::url::encodeResource $resource]
    if { [info exists query]] && [[string length $query] } {
        if { ![info exists p7] } {
            set p7 ""
        }
        append result ?[::ns::url::encodeQueryString $p7]
    }
    if { [info exists fragment]] && [[string length $fragment] } {
        if { ![info exists p9] } {
            set p9 ""
        }
        append result #[::ns::url::encodeFragment $p9]
    }
 
    return $result           
 }

 # ########################################
 # File: url.test
 package require tcltest
 namespace import tcltest::*

 source url.tcl

 # ========================================
 test encodeQueryString-1.1 {
    Break up the query string, and encode each section
 } -body {
    ::ns::url::encodeQueryString "arg1=value1&arg2=value2"
 } -result {arg1=value1&arg2=value2}

 test encodeQueryString-1.2 {
    With an equal that needs escaping
 } -body {
    ::ns::url::encodeQueryString "arg1=value11=value12&arg2=value2"
 } -result {arg1=value11%3dvalue12&arg2=value2}

 # ========================================
 test encode-1.1 {
    Basic encoding of url, in query mode
 } -body {
    ns_urlencode http://www.aolserver.com/
 } -result {http%3a%2f%2fwww%2eaolserver%2ecom%2f}

 test encode-1.2 {
 } -body {
    ;# 0xe9 is the iso8859-1 for lowercase accented 'e'
    set data [ns_urlencode [format %c 0xe9]]
 } -result {%c3%a9}

 test encode-2.1 {
    Test that query mode encodes the entire string as a query
 } -body {
    ns_urlencode "http://www.aolserver.com/bob?this is a=query"
 } -result {http%3a%2f%2fwww%2eaolserver%2ecom%2fbob%3fthis+is+a%3dquery}

 test encode-3.1 {
    Mode url should not encode the url part with query rules
 } -body {
    ns_urlencode -url http://www.aolserver.com/bob
 } -result {http://www.aolserver.com/bob}

 test encode-3.2 {
    Url mode, with a query component
 } -body {
    ns_urlencode -url {http://www.aolserver.com/bob?arg1 2}
 } -result {http://www.aolserver.com/bob?arg1+2}

 test encode-3.3 {
    A space in the url part is escaped as %20, instead of a plus
 } -body {
    ns_urlencode -url {http://www.aolserver.com/p1 p2}
 } -result {http://www.aolserver.com/p1%20p2}

 test encode-3.4 {
    The first equal sign of a query element need not be escaped
 } -body {
    ns_urlencode -url "page?arg1=value1"
 } -result {page?arg1=value1}

 test encode-3.4 {
    The second equal sign in a query component needs to be escaped
 } -body {
    ns_urlencode -url "page?arg1=value1=value2"
 } -result {page?arg1=value1%3dvalue2}

 test encode-3.5 {
    Multiple query elements get quoted ok
 } -body {
    ns_urlencode -url "page?arg1=value1&arg2=value 2"
 } -result {page?arg1=value1&arg2=value+2}

 ## Fragments
 test encode-5.1 {
    Http, with fragment
 } -body {
    ns_urlencode -url "http://a.server.com/page#label"
 } -result {http://a.server.com/page#label}

 test encode-5.2 {
    Http, with fragment and a # that needs escaping
 } -body {
    ns_urlencode -url "http://a.server.com/page#notlabel#label"
 } -result {http://a.server.com/page#notlabel%23label}

 test encode-5.3 {
    Anything after the # is encoded as a fragment identifier
 } -body {
    ns_urlencode -url "http://a.server.com/page#sign?arg1=a"
 } -result "http://a.server.com/page#sign%3farg1%3da"

 ## Mode: html
 # Used to encode a url for use in an html page
 # Otherwise, performs the same encodings as -url
 test encode-6.1 {
    Html mode, the & should be replaced by &
 } -body {
    ns_urlencode -html "page?arg1=value1&arg2=value 2"
 } -result {page?arg1=value1&arg2=value+2}

 ## Special case tests
 test encode-7.1 {
    No encoding needed for ~
 } -body {
    ns_urlencode -url "ftp://a.server.com/~joe home%page"
 } -result {ftp://a.server.com/~joe%20home%25page}

 test encode-7.2 {
    Empty query should still include the ?
 } -body {
    ns_urlencode -url "http://a.server.com/page?"
 } -result {http://a.server.com/page?}

 test encode-7.3 {
    Empty fragment should still include the #
 } -body {
    ns_urlencode -url "http://a.server.com/page#"
 } -result {http://a.server.com/page#}

 test encode-7.4 {
    Handle username@host:port
 } -body {
    ns_urlencode -url "ftp://rseeger@a.server.com:80/page"
 } -result {ftp://rseeger@a.server.com:80/page}

 # ====================
 test encode-1.1E {
    Throw an error if no args are supplied
 } -body {
    ns_urlencode
 } -returnCodes error -match glob -result {wrong # args: should be *}

 test encode-1.1E {
    Throw an error if invalid option is specified
 } -body {
    ns_urlencode -notvalid bob
 } -returnCodes error -match glob -result {invalid option "-notvalid": should be *}

 # ========================================
 
 test decode-1.1 {
    Basic decoding of url
 } -body {
    ns_urldecode http%3a%2f%2fwww%2eaolserver%2ecom%2f
 } -result {http://www.aolserver.com/}

 test decode-1.2 {
    Don't decode characters twice
 } -body {
    # %23 is the code for #
    # %2523 is that, encoded
    ns_urldecode {%2523}
 } -result {%23}

 test decode-2.1 {
    Decode with charset: utf-8 (the default)
 } -body {
    ns_urldecode -charset utf-8 %c3%a9
 } -result [format %c 0xe9]

 test decode-2.2 {
    Decode with charset: iso8859-1
 } -body {
    string length [ns_urldecode -charset iso8859-1 %c3%a9]
 } -result 2