Send Bristol mailing list submissions to
bristol@mailman.lug.org.uk
To subscribe or unsubscribe via the World Wide Web, visit
https://mailman.lug.org.uk/mailman/listinfo/bristol
or, via email, send a message with subject or body 'help' to
bristol-request@mailman.lug.org.uk
You can reach the person managing the list at
bristol-owner@mailman.lug.org.uk
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bristol digest..."
Today's Topics:
1. Re: Forrin characters in web pages (Amias Channer)
----------------------------------------------------------------------
Message: 1
Date: Sun, 17 May 2015 14:21:24 +0100
From: Amias Channer <me@amias.net>
To: Bristol and Bath Linux User Group <bristol@mailman.lug.org.uk>
Subject: Re: [bristol] Forrin characters in web pages
Message-ID:
<CAMgU7XXgE00aos52EpL8eW0CgHiddKZyjVtxT3qFgvh2p9e7TQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Hello Andrew,
sounds like the server is taking in UTF8 but not outputting it.
my guess would be the non utf8 compliant pages are generated by a cgi script
or some kind of templating system that for whatever reason isn't UTF8 aware.
there could also be a database that might be doing this.
Perl would definitely be the right choice , this kind of regex stuff
is very easy
with it . i wouldn't use C on data from webpages unless i'd done some kind
of sanitising of it first.
Cheers
Amias
On 16 May 2015 at 16:53, Andrew McLean <am57762@gmail.com> wrote:
> Thanks for the various comments. I am using UTF-8 (Mint 17.1) etc.
> The real problem is that the same names appear on different web pages
> sometimes with an accented character (e.g. c-cedilla in "Francois") and
> sometimes with the equivalent non-accented character (plain 'c' in this
> example). I need to find the same name in different pages where in the
> raw HTML files it is spelt differently.
> So it's not really about how the data is handled on my system, its the fact
> that the same data (someone's name, in this case) is presented differently
> on different web pages.
> I have now written my own C code to find these chars and replace them
> with the plain equivalent. Fiddly, and I should probably learn Perl...
> Andrew
>
>
> _______________________________________________
> Bristol mailing list
> Bristol@mailman.lug.org.uk
> https://mailman.lug.org.uk/mailman/listinfo/bristol
------------------------------
_______________________________________________
Bristol mailing list
Bristol@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/bristol
End of Bristol Digest, Vol 601, Issue 1
***************************************
Tidak ada komentar:
Posting Komentar