No. of Recommendations: 79
Folks, it looks like the end of the MI Board index. It seems that The Fools have added some protection to their message pages that prevents me from reading them in a PHP program. I suspect it is to keep the boards from being crawled, which technically is what I'm doing. I can't see any way around it and since the new system explicitly blocks what I'm doing, I'm not sure I SHOULD find a way around it.

Unless anyone knows better, I think this is it. Oh well, I've got the first 15 1/2 years.

I was kind of hoping that they'd upgraded their search capability. But no.

MarkW
Print the post Back To Top
No. of Recommendations: 28
Just a thought. As you are providing definite service to the members of this board that intern makes the Fool MI board more usable and therefore more popular, might it be possible that if they were aware of the situation they might offer you special access?

Then again I might be wrong in assuming anyone in power could recognize the true value of your search engine has been to this board.

RAM
Print the post Back To Top
No. of Recommendations: 1
Unless anyone knows better, I think this is it. Oh well, I've got the first 15 1/2 years.


Hi MarkW! I thought I was the only person who had tried to do this. I should have known better -- you MI guys have done a lot.

How are/were you going about this? When I went through this exercise a few years ago (against all the TMF boards), I created a bunch of batch files to cURL the message.asp HTML into files, 1 per MessageID, and then scraped the message text out of the HTML with a separate procedure. Crude, but it worked for all messages that could be accessed anonymously. Paid boards, for instance, would redirect me to the same marketing page. And deleted messages can't be retrieved anonymously. That's where I lost interest, figuring out how to authenticate with cURL. I never picked it up again.

I'm looking at my old batch files, and it does seem like the statements I sent previously no longer work. For instance:

CURL http://boards.fool.com/Message.asp?mid=10000999

produces the output:

<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=1_5F0C167DF4DE89374E45D04ECA756C67268D40A03FC4&incident_id=143000210008412901-39402111587254617&edet=15&cinfo=48e5ae8ec753840404000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 143000210008412901-39402111587254617</iframe></html>

So on the face of it, I'd agree they have added something to prevent this; however, you can still access the page anonymously through a browser. I tested this just now by logging out and just navigating to the link I was trying to cURL, and it came out fine on my browser. From this I figure that it's still possible to grab these files. Maybe you have to impersonate a browser to grab them. I don't know enough about this to be helpful, but I think a similar trick can be used to fool your phone carrier into thinking you're using your phone's browser (not a tethered machine).

Sorry if this ground has been well-covered, and if I sound dumb, well... This is a very intimidating board on which to post!
Print the post Back To Top
No. of Recommendations: 3
Incidentally, it is a real blast to look through these old old posts, espeically tech companies, especially right when things started to heat up in 98/99. Ah, memories.
Print the post Back To Top
No. of Recommendations: 13
That would be a shame, I find the search engine incredibly helpful and am sure many others do too.
Print the post Back To Top
No. of Recommendations: 0
I saw whafa's reply, it seems like some IIS side setting, nothing I know about
Print the post Back To Top
No. of Recommendations: 0
I'm not following you. I just tried to use the mi board search, and typed in a name, and 50 messages from him came up.

Then I typed in a message number, and that post came up.

Thanks.
Print the post Back To Top
No. of Recommendations: 0
It looks like they've started using Incapsula as a proxy. They claim to be able to stop site scraping Security as a Service, they call it:

http://en.wikipedia.org/wiki/Incapsula

There should be a way to get around it.
Print the post Back To Top
No. of Recommendations: 27
Would you mind if I complained to the Fool about this? You're providing a genuinely valuable service to the community that's well beyond anything their search service provides.
Print the post Back To Top
No. of Recommendations: 10
Please do - and add every member of the board as a petition signer. Get them to give a scraping hook/method to Mark. As Robbie calls them, these fossils of message board technology need positive help, not reasons to drive traffic away.

FC
Print the post Back To Top
No. of Recommendations: 31
How about writing a polite and well reasoned message to TMF's head honcho - Tom Gardner (TMFtomG) and asking for his support in finding a solution. Point him to the search engine and encourage him to take a spin. Maybe he'll even pay Mark for the right to incorporate the search engine for all the message boards, as part of TMF's own web site.

Elan
Print the post Back To Top
No. of Recommendations: 2
Maybe he'll even pay Mark for the right to incorporate the search engine for all the message boards, as part of TMF's own web site.

I suggested this over at "Improve the Fool" once.
Got shot down loudly, accusations of pumping my own index product, which of course it wasn't.
We can try again.
That's probably a good place to do it.

Jim
Print the post Back To Top
No. of Recommendations: 16
I suggested this over at "Improve the Fool" once.
...
That's probably a good place to do it.



I just posted a plea over there.
http://boards.fool.com/the-once-and-future-index-tool-309392...
Might help to comment or rec that thread?

Jim
Print the post Back To Top
No. of Recommendations: 41
First of all, I salute Mark for what he's done so far. This message board would be next to worthless to me, and probably quite a few others, without the ability to search it.

Since the cat is out of the bag, I'll go ahead and post what I would have added to my last private email to you.

I'm having no trouble at all backing up posts with message numbers and URLs using iMacros for Firefox (an add-on I shamelessly recommend for all deficiencies in the GTR1 backtester's user interface). Here is a code sample:

SET !EXTRACT_TEST_POPUP NO
SET extURL {{!URLCURRENT}}
TAG POS=1 TYPE=A ATTR=ID:ctl01_ctl00_BaseContentPlaceHolder_BoardsBaseContentPlaceHolder_browseHeader_lnkBoardName EXTRACT=TXT
SET extBoardName {{!EXTRACT}}
SET !EXTRACT NULL
TAG POS=1 TYPE=A ATTR=TITLE:View*this*fool*profile* EXTRACT=TXT
SET extAuthor {{!EXTRACT}}
SET !EXTRACT NULL
TAG POS=1 TYPE=INPUT:TEXT FORM=NAME:aspnetForm ATTR=ID:ctl01_ctl00_BaseContentPlaceHolder_BoardsBaseContentPlaceHolder_ctlMessageHeader_txtMessageNumber EXTRACT=TXT
SET extMsgNum {{!EXTRACT}}
SET !EXTRACT NULL
TAG POS=R1 TYPE=A ATTR=CLASS:pnvalink EXTRACT=TXT
SET extSubject {{!EXTRACT}}
SET !EXTRACT NULL
TAG POS=R1 TYPE=TD ATTR=CLASS:pbnav EXTRACT=TXT
SET extDate {{!EXTRACT}}
SET !EXTRACT NULL
ADD !EXTRACT {{extMsgNum}};{{extURL}}
SAVEAS TYPE=EXTRACT FOLDER=* FILE=msg_urls.txt
SAVEAS TYPE=HTM FOLDER=* FILE={{extMsgNum}}
TAG POS=1 TYPE=B ATTR=TXT:Next

To run it after installing iMacros, go to any MI message board post you want to start archiving from. Open the iMacros widget, find the macro called #Current.imm, open it for editing and replace its code with the above. Then set the "Repeat Macro" setting to whatever you want and click "Play (Loop)". The macro actually extracts more information than it actually saves for the sake of demonstration; some of the tags might need tweaking or else can be eliminated if they stop working.

When I start this macro with Firefox sitting at the top message in this thread (and with the upper loop limit set to 20), I get a file in my iMacros "Downloads" folder called msg_urls.txt containing the following:

"245926;http://boards.fool.com/unless-anyone-knows-better-i-think-th...
"245927;http://boards.fool.com/incidentally-it-is-a-real-blast-to-lo...
"245928;http://boards.fool.com/that-would-be-a-shame-i-find-the-sear...
"245929;http://boards.fool.com/actually-it-gave-me-an-idea-that-you-...
"245930;http://boards.fool.com/i-saw-whafas-reply-it-seems-like-some...
"245931;http://boards.fool.com/im-not-following-you-i-just-tried-to-...
"245932;http://boards.fool.com/thats-a-big-compliment-to-be-mentione...
"245933;http://boards.fool.com/it-looks-like-theyve-started-using-in...
"245934;http://boards.fool.com/you-realize-you-did-not-even-implemen...
"245935;http://boards.fool.com/let-me-define-terms-a-little-more-clo...
"245936;http://boards.fool.com/would-you-mind-if-i-complained-to-the...
"245937;http://boards.fool.com/please-do-and-add-every-member-of-the...
"245938;http://boards.fool.com/thanks-again-robbie-i-got-this-to-wor...
"245939;http://boards.fool.com/sorry-cannot-make-out-what-13-will-do...
"245940;http://boards.fool.com/how-about-writing-a-polite-and-well-r...
"245941;http://boards.fool.com/maybe-hell-even-pay-mark-for-the-righ...
"245942;http://boards.fool.com/i-suggested-this-over-at-quotimprove-...

I also get the following files:

245926.htm
245927.htm
245928.htm
245929.htm
245930.htm
245931.htm
245932.htm
245933.htm
245934.htm
245935.htm
245936.htm
245937.htm
245938.htm
245939.htm
245940.htm
245941.htm
245942.htm

These html files could presumably then be parsed and indexed by your existing PHP program, hopefully without much modification.

This was actually the first iMacro I've ever written that does its own parsing. Normally I launch Firefox from within other code and run macros that simply save raw pages for parsing later within my own code. But I've written this example because it's portable and demonstrates the main features of iMacros.

For example, I launch iMacros from within C++ with two lines:

sprintf(cmd, "\"%s\" -new-window imacros://run/?m=%s%d.iim", FirefoxPath, Progname, i);
system(cmd);

Since my ping time to US websites from Australia is often pathetic, I make use of my CPU's idle time by running lots of iMacros simultaneously. I signal to my C++ programs that an iMacro has completed by making the last line of my macro point the browser to a local html file with a distinctive title. When my program detects that a window is open with that title, it kills it and moves on (using FindWindow and SendMessage in windows.h).

Note to Fool.com: If you are paying programmers to read this thread and find ways to thwart our efforts at making the message board usable, then I humbly suggest that your funds would be better spent on upgrading to a 21st century forum platform instead. Posting plain text messages (without even the ability to fix typos) in threads was novel in the 1980s and even 1990s, but in 2013 it's a joke, and not in the good TMF jester kind of way.

Robbie Geary
Print the post Back To Top
No. of Recommendations: 9
...then I humbly suggest that your funds would be better spent on upgrading to a 21st century forum platform instead. Posting plain text messages (without even the ability to fix typos) in threads was novel in the 1980s and even 1990s, but in 2013 it's a joke, and not in the good TMF jester kind of way.

Just be sure to bring all of the existing content over to that modern message board!

Mark
Print the post Back To Top
No. of Recommendations: 2
Hi Jim,

I've contacted the appropriate people to see if we can whitelist specific persons who collect data benignly. Hopefully it's possible. We had major problems with data scrapers clogging our servers, so we had to do something about it. Unfortunately, some good guys got caught in the net.

Richard


This is from the motley fool. There is hope.

I understand from Jim's post that today is the last day it is available.

That answers my question.
Print the post Back To Top
No. of Recommendations: 81
I've just been pinged by Tom Chilcutt, an engineer at the Fool. Jim, I'll bet it's because of your plea. We are going to get this to work. Probably a lot better then it did before the application change.

I read the page using the PHP file_get_contents() function and used regular expressions to get the subject, content, author, parent message, etc. Since they started using Incapsula, I started getting the output that whafa posted. I tried cURL with the same result. I tried the wget command. Same thing. I'd just gotten off a 26 hour work day so I was pretty easy to defeat.

I am completely swamped at the moment any may not be able to get back to this for a week or two. But it WILL be fixed!
Print the post Back To Top
No. of Recommendations: 3
No question, when mungofitch speaks, people listen!
Print the post Back To Top
No. of Recommendations: 10
No question, when mungofitch speaks, people listen!

Gosh, 175 recs in a day.
As a rec hog I should stop doing invesrment systems and just praise the search engine more often.

Jim

[great search engine!]
Print the post Back To Top