I have an HTML page that contains some filenames that i want to download from a webserver. I need to read these filenames in order to create a list that will be passed to my web application that downloads the file from the server. These filenames have some extention.
I have digged about this topic but havn't fount anything except -
Is there no other way so that i can search for text that have pattern like filename.ext from an HTML file?
Sample HTML that contains filename -
<p class=3DMsoNormal style=3D'margin-top:0in;margin-right:0in;margin-bottom=:0in; margin-left:1.5in;margin-bottom:.0001pt;text-indent:-.25in;line-height:normal;mso-list:l1 level3 lfo8;tab-stops:list 1.5in'><![if !supportLists]> <span style=3D'font-family:"Times New Roman","serif";mso-fareast-font-family:"Times New Roman"'><span style=3D'mso-list:Ignore'>1.<span style=3D'font:7.0pt "Times New Roman"'>
</span></span></span><![endif]><span style=3D'font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"'>**13572_PostAccountingReport_2009-06-03.acc**<o:p></o:p></span></p>
I cant use HTML Agility Pack because I m not allowed to download and make use of any application or tool.
Cant this be achieved by anyother logic?
This is what i have done so far
string pageSource = "";
string geturl = @"C:\Documents and Settings\NASD_Download.mht";
WebRequest getRequest = WebRequest.Create(geturl);
WebResponse getResponse = getRequest.GetResponse();
using (StreamReader sr = new StreamReader(getResponse.GetResponseStream()))
{
pageSource = sr.ReadToEnd();
pageSource.Replace("=", "");
}
var fileNames = from Match m in Regex.Matches(pageSource, @"[0-9]+_+[A-Za-z]+_+[0-9]+-+[0-9]+-+[0-9]+.+[a-z]")
select m.Value;
foreach (var s in fileNames)
Response.Write(s);
Bcause of some "=" occuring in every file name i m not able to get the filename. how can I remove the occurrence of "=" in pageSource string
Thanks in advance
Akhil
Well, knowing that regex
aren't ideal to find values in HTML:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\s(\S+\.ext)\s/)
if (match)
files.push(match[1]);
}
Note: Read the comments to the question.
If the extension can be anything, you can use this:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\b(\S+\.\S+)\b/)
console.log(match)
if (match)
files.push(match[1]);
}
document.getElementById('result').innerHTML = files + "";
​ But this really really not reliable.
It may be impossible to get file names using common pattern because of 1.5in
-.25in
7.0pt
and the likes, try to be more specific (if possible), like
/[a-z0-9_-]+\.[a-z]+/gi
or
/>[a-z0-9_-]+\.[a-z]+</gi
(markup included) or even
/>\d+_PostAccountingReport_\d+-\d+-\d+\.[a-z]+</gi